├── dataset
    └── README.md
├── results
    └── README.md
├── training
    ├── requirements.txt
    └── train.py
├── tutorial
    ├── serving.template
    ├── utils
    │   └── LineIterator.py
    └── Deploy_model_on_Amazon_SageMaker_with_vLLM.ipynb
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── CONTRIBUTING.md
├── test_training_script_local_gpu.ipynb
├── Finetune_Mistral_7B_on_Amazon_SageMaker.ipynb
├── Local_Compare_Finetune_Mistral.ipynb
├── Local_Finetune_Mistral.ipynb
└── Deploy_Mistral_7B_on_Amazon_SageMaker_with_vLLM.ipynb


/dataset/README.md:
--------------------------------------------------------------------------------
1 | # Mistral-7B-Instruct-fine-tune-and-deploy-on-SageMaker


--------------------------------------------------------------------------------
/results/README.md:
--------------------------------------------------------------------------------
1 | # Mistral-7B-Instruct-fine-tune-and-deploy-on-SageMaker


--------------------------------------------------------------------------------
/training/requirements.txt:
--------------------------------------------------------------------------------
1 | transformers==4.53.0 
2 | datasets==2.17.1 
3 | peft==0.8.2 
4 | bitsandbytes==0.42.0 
5 | trl==0.7.11 
6 | diffusers==0.26.3


--------------------------------------------------------------------------------
/tutorial/serving.template:
--------------------------------------------------------------------------------
1 | engine=Python
2 | option.model_id={{ model_id }}
3 | option.tensor_parallel_degree=1
4 | option.max_rolling_batch_size=16
5 | option.rolling_batch=vllm
6 | option.task=text-generation
7 | option.dtype=fp16
8 | option.max_model_len=2048
9 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Mistral 7B Instruct fine tune and deploy on SageMaker
 2 | 
 3 | Mistral 7B is the open LLM from Mistral AI. The Mistral 7B Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. See more at https://mistral.ai/news/announcing-mistral-7b/
 4 | 
 5 | In this code sample, we will guide, step by step, fine tune the Mistral 7B instruct model
 6 | 1. Locally with small sample of data 
 7 | 2. With SageMaker on full dataset
 8 | 
 9 | Readers can start with `Local_Finetune_Mistral.ipynb`, which is the notebook for fine tuning the Mistral 7B model with QLoRA (Efficient Finetuning of Quantized LLMs). After fine tune several steps, you should be able to observe the difference between original model and fine tuned version, where you can use this notebook `Local_Compare_Finetune_Mistral.ipynb` to compare locally. 
10 | 
11 | The following script can be run locally with GPUs instance (we execute the script with single GPU on SageMaker Notebook g5 2xlarge instance, see more details on `test_training_script_local_gpu.ipynb`)
12 | 
13 | python ./training/train.py --dataset_path "./dataset/dolly.hf" --model_save_path "./results" --job_output_path "./results" --per_device_train_batch_size 1 --epochs 1
14 | 
15 | 
16 | Additionally, readers have the option to run end-to-end fine-tuning code with SageMaker for a larger dataset using the `Finetune_Mistral_7B_on_Amazon_SageMaker.ipynb` notebook. 
17 | 
18 | Once the fine-tuning process is completed, the model can be deployed as a SageMaker endpoint using the `Deploy_Mistral_7B_on_Amazon_SageMaker_with_vLLM.ipynb` notebook.
19 | 
20 | ## Security
21 | 
22 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
23 | 
24 | ## License
25 | 
26 | This library is licensed under the MIT-0 License. See the LICENSE file.
27 | 
28 | 


--------------------------------------------------------------------------------
/tutorial/utils/LineIterator.py:
--------------------------------------------------------------------------------
 1 | import io
 2 | import re
 3 | 
 4 | NEWLINE = re.compile(r'\\n')  
 5 | DOUBLE_NEWLINE = re.compile(r'\\n\\n')
 6 | 
 7 | class LineIterator:
 8 |     """
 9 |     A helper class for parsing the byte stream from Llama 2 model inferenced with LMI Container. 
10 |     
11 |     The output of the model will be in the following repetetive but incremental format:
12 |     ```
13 |     b'{"generated_text": "'
14 |     b'lo from L"'
15 |     b'LM \\n\\n'
16 |     b'How are you?"}'
17 |     ...
18 | 
19 |     For each iteration, we just read the incremental part and seek for the new position for the next iteration till the end of the line.
20 | 
21 |     """
22 |     
23 |     def __init__(self, stream):
24 |         self.byte_iterator = iter(stream)
25 |         self.buffer = io.BytesIO()
26 |         self.read_pos = 0
27 | 
28 |     def __iter__(self):
29 |         return self
30 | 
31 |     def __next__(self):
32 |         start_sequence = b'{"generated_text": "'
33 |         stop_sequence = b'"}'
34 |         new_line = '\n'
35 |         double_new_line = '\n\n'
36 |         while True:
37 |             self.buffer.seek(self.read_pos)
38 |             line = self.buffer.readline()
39 |             if line:
40 |                 self.read_pos += len(line)
41 |                 if line.startswith(start_sequence):# in :
42 |                     line = line.lstrip(start_sequence)
43 |                 
44 |                 if line.endswith(stop_sequence):
45 |                     line =line.rstrip(stop_sequence)
46 |                 line = line.decode('utf-8')
47 |                 line = NEWLINE.sub(new_line, line)
48 |                 line = DOUBLE_NEWLINE.sub(double_new_line, line)
49 |                 return line
50 |             try:
51 |                 chunk = next(self.byte_iterator)
52 |             except StopIteration:
53 |                 if self.read_pos < self.buffer.getbuffer().nbytes:
54 |                     continue
55 |                 raise
56 |             if 'PayloadPart' not in chunk:
57 |                 print('Unknown event type:' + chunk)
58 |                 continue
59 |             self.buffer.seek(0, io.SEEK_END)
60 |             self.buffer.write(chunk['PayloadPart']['Bytes'])


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/test_training_script_local_gpu.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "0877bf5e-dcb6-4377-94e5-06332ab32848",
  7 |    "metadata": {
  8 |     "tags": []
  9 |    },
 10 |    "outputs": [],
 11 |    "source": [
 12 |     "!pip install transformers==4.38.1 datasets==2.17.1 peft==0.8.2 bitsandbytes==0.42.0 trl==0.7.11 --upgrade --quiet"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": null,
 18 |    "id": "73cac042-5399-46d3-bfde-2728a06fe450",
 19 |    "metadata": {
 20 |     "tags": []
 21 |    },
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "from datasets import load_dataset\n",
 25 |     "from random import randrange\n",
 26 |     "\n",
 27 |     "# Load dataset from the hub\n",
 28 |     "dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\")\n",
 29 |     "\n",
 30 |     "#For local testing the fine tuning code, we limit the dataset to 20 samples \n",
 31 |     "#dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\").select(range(20))\n",
 32 |     "\n",
 33 |     "print(f\"dataset size: {len(dataset)}\")\n",
 34 |     "print(dataset[randrange(len(dataset))])"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "id": "7b382391-42b8-4393-8523-55057824e3dc",
 41 |    "metadata": {
 42 |     "tags": []
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "dataset.save_to_disk(\"./dataset/dolly.hf\")"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "id": "43f9b55d-eae2-4195-8aaf-43c04f1f3eca",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "### Local test with Python script"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "eb580a8a-9a36-4538-b6ea-b4e2285de384",
 61 |    "metadata": {
 62 |     "tags": []
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "!python ./training/train.py --dataset_path \"./dataset/dolly.hf\" --model_save_path \"./results\" --job_output_path \"./results\" --per_device_train_batch_size 1 --epochs 1"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "55d6a529-2f76-4a97-bfd8-254c20239a18",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "### Local test with SageMaker sdk"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "id": "a8ab9d75-994e-42dc-95fc-36aed741a365",
 81 |    "metadata": {
 82 |     "tags": []
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "import sagemaker\n",
 87 |     "import boto3\n",
 88 |     "sess = sagemaker.Session()\n",
 89 |     "# sagemaker session bucket -> used for uploading data, models and logs\n",
 90 |     "# sagemaker will automatically create this bucket if it not exists\n",
 91 |     "sagemaker_session_bucket=None\n",
 92 |     "if sagemaker_session_bucket is None and sess is not None:\n",
 93 |     "    # set to default bucket if a bucket name is not given\n",
 94 |     "    sagemaker_session_bucket = sess.default_bucket()\n",
 95 |     "\n",
 96 |     "try:\n",
 97 |     "    role = sagemaker.get_execution_role()\n",
 98 |     "except ValueError:\n",
 99 |     "    iam = boto3.client('iam')\n",
100 |     "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
101 |     "\n",
102 |     "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
103 |     "\n",
104 |     "print(f\"sagemaker role arn: {role}\")\n",
105 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "id": "dbeac35c-38d8-4823-85a2-a05deb347e68",
112 |    "metadata": {
113 |     "tags": []
114 |    },
115 |    "outputs": [],
116 |    "source": [
117 |     "model_id = \"mistralai/Mistral-7B-Instruct-v0.1\"\n",
118 |     "instance_type = 'local_gpu'  # instances type used for the training job\n",
119 |     "training_input_path = \"file://./dataset\""
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "id": "dc690758-4303-4e16-a115-eccbed1a2f65",
126 |    "metadata": {
127 |     "tags": []
128 |    },
129 |    "outputs": [],
130 |    "source": [
131 |     "import time\n",
132 |     "from sagemaker.huggingface import HuggingFace\n",
133 |     "from huggingface_hub import HfFolder\n",
134 |     "\n",
135 |     "# define Training Job Name\n",
136 |     "job_name = f'huggingface-qlora-{model_id.replace(\"/\", \"-\").lower()}'\n",
137 |     "\n",
138 |     "# hyperparameters, which are passed into the training job\n",
139 |     "hyperparameters ={\n",
140 |     "  'model_id': model_id,                             # pre-trained model\n",
141 |     "  'dataset_path': '/opt/ml/input/data/training/dolly.hf',    # path where sagemaker will save training dataset\n",
142 |     "  'epochs': 1,                                      # number of training epochs\n",
143 |     "  'per_device_train_batch_size': 1,                 # batch size for training\n",
144 |     "  'lr': 2e-4,                                       # learning rate used during training\n",
145 |     "}\n",
146 |     "metric=[\n",
147 |     "    {\"Name\": \"loss\", \"Regex\": r\"'loss':\\s*([0-9.]+)\"},\n",
148 |     "    {\"Name\": \"epoch\", \"Regex\": r\"'epoch':\\s*([0-9.]+)\"},\n",
149 |     "]\n",
150 |     "# create the Estimator\n",
151 |     "huggingface_estimator = HuggingFace(\n",
152 |     "    entry_point          = 'train.py',      # train script\n",
153 |     "    source_dir           = 'training',         # directory which includes all the files needed for training\n",
154 |     "    metric_definitions   = metric,\n",
155 |     "    instance_type        = instance_type,   # instances type used for the training job\n",
156 |     "    instance_count       = 1,                 # the number of instances used for training\n",
157 |     "    base_job_name        = job_name,          # the name of the training job\n",
158 |     "    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3\n",
159 |     "    volume_size          = 300,               # the size of the EBS volume in GB\n",
160 |     "    transformers_version = '4.28',            # the transformers version used in the training job\n",
161 |     "    pytorch_version      = '2.0',             # the pytorch_version version used in the training job\n",
162 |     "    py_version           = 'py310',           # the python version used in the training job\n",
163 |     "    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job\n",
164 |     "    environment          = { \"HUGGINGFACE_HUB_CACHE\": \"/opt/ml/.cache\" }, # set env variable to cache models in /tmp\n",
165 |     ")"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "id": "adb98352-f5b8-4f46-8631-0995f1f17ade",
172 |    "metadata": {
173 |     "tags": []
174 |    },
175 |    "outputs": [],
176 |    "source": [
177 |     "# define a data input dictonary with our uploaded s3 uris\n",
178 |     "data = {'training': training_input_path}\n",
179 |     "\n",
180 |     "# starting the train job with our uploaded datasets as input\n",
181 |     "huggingface_estimator.fit(data, wait=True)"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": null,
187 |    "id": "5b35f442-63fa-487e-b90d-8f79a4eadaf3",
188 |    "metadata": {},
189 |    "outputs": [],
190 |    "source": []
191 |   }
192 |  ],
193 |  "metadata": {
194 |   "kernelspec": {
195 |    "display_name": "conda_pytorch_p310",
196 |    "language": "python",
197 |    "name": "conda_pytorch_p310"
198 |   },
199 |   "language_info": {
200 |    "codemirror_mode": {
201 |     "name": "ipython",
202 |     "version": 3
203 |    },
204 |    "file_extension": ".py",
205 |    "mimetype": "text/x-python",
206 |    "name": "python",
207 |    "nbconvert_exporter": "python",
208 |    "pygments_lexer": "ipython3",
209 |    "version": "3.10.13"
210 |   }
211 |  },
212 |  "nbformat": 4,
213 |  "nbformat_minor": 5
214 | }
215 | 


--------------------------------------------------------------------------------
/training/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | import json
  4 | import torch
  5 | 
  6 | from datasets import Dataset, load_from_disk
  7 | 
  8 | from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, logging
  9 | 
 10 | import bitsandbytes as bnb
 11 | from peft import LoraConfig, PeftModel
 12 | from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
 13 | 
 14 | 
 15 | # Define the create_prompt function
 16 | def create_prompt(sample):
 17 |     bos_token = "<s>"
 18 |     eos_token = "</s>"
 19 |     
 20 |     instruction = sample['instruction']
 21 |     context = sample['context']
 22 |     response = sample['response']
 23 | 
 24 |     text_row = f"""[INST] Below is the question based on the context. Question: {instruction}. Below is the given the context {context}. Write a response that appropriately completes the request.[/INST]"""
 25 |     answer_row = response
 26 | 
 27 |     sample["prompt"] = bos_token + text_row
 28 |     sample["completion"] = answer_row + eos_token
 29 | 
 30 |     return sample
 31 | 
 32 | 
 33 | def prepare_datatset(dataset_location):
 34 |     dataset = load_from_disk(dataset_location)
 35 |     dataset_instruct_format = dataset.map(create_prompt, remove_columns=['instruction','context','response','category'])
 36 |     return dataset_instruct_format
 37 | 
 38 | 
 39 | def parse_arge():
 40 |     """Parse the arguments."""
 41 |     parser = argparse.ArgumentParser()
 42 |     # add model id and dataset path argument
 43 |     parser.add_argument(
 44 |         "--model_id",
 45 |         type=str,
 46 |         default="mistralai/Mistral-7B-Instruct-v0.1",
 47 |         help="Model id to use for training.",
 48 |     )
 49 |     # loading dataset from SageMaker training job instance local location 
 50 |     parser.add_argument(
 51 |         "--dataset_path", type=str, default="/opt/ml/input/data/training", help="Path to dataset."
 52 |     )
 53 |     # saving the merged model to SageMaker training job instance local location 
 54 |     parser.add_argument(
 55 |         "--model_save_path", type=str, default="/opt/ml/model", help="Path to save pretrained model."
 56 |     )
 57 |     parser.add_argument(
 58 |         "--job_output_path", type=str, default="/opt/ml/output/data", help="Path to model output artifact."
 59 |     )
 60 |     
 61 |     # add training hyperparameters for epochs, batch size, learning rate, and seed
 62 |     parser.add_argument(
 63 |         "--epochs", type=int, default=3, help="Number of epochs to train for."
 64 |     )
 65 |     parser.add_argument(
 66 |         "--per_device_train_batch_size",
 67 |         type=int,
 68 |         default=4,
 69 |         help="Batch size to use for training.",
 70 |     )
 71 |     parser.add_argument(
 72 |         "--lr", type=float, default=5e-5, help="Learning rate to use for training."
 73 |     )
 74 | 
 75 |     args, _ = parser.parse_known_args()
 76 | 
 77 |     return args
 78 | 
 79 | 
 80 | def formatting_prompts_func(example):
 81 |     output_texts = []
 82 |     for i in range(len(example['prompt'])):
 83 |         text = f"{example['prompt'][i]}\n\n ### Answer: {example['completion'][i]}"
 84 |         output_texts.append(text)
 85 |     return output_texts
 86 | 
 87 | 
 88 | 
 89 | 
 90 | 
 91 | def training_function(args):
 92 |     model_id = args.model_id
 93 |     
 94 |     ################################################################################
 95 |     # loading dataset from SageMaker training job instance local location
 96 |     ################################################################################
 97 |     # SageMaker copy the data from S3 to /opt/ml/input/data/training 
 98 |     dataset_location = args.dataset_path
 99 |     print(dataset_location)
100 |     dataset_instruct_format = prepare_datatset(dataset_location)
101 |     
102 | 
103 |     ################################################################################
104 |     # bitsandbytes parameters and configuration, args and fixed parameters
105 |     ################################################################################
106 | 
107 |     # Activate 4-bit precision base model loading
108 |     use_4bit = True
109 | 
110 |     # Compute dtype for 4-bit base models
111 |     bnb_4bit_compute_dtype = "bfloat16"
112 | 
113 |     # Quantization type (fp4 or nf4)
114 |     bnb_4bit_quant_type = "nf4"
115 | 
116 |     # Activate nested quantization for 4-bit base models (double quantization)
117 |     use_nested_quant = True
118 |     
119 |     # Load the base model with QLoRA configuration
120 |     compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
121 |     
122 |     bnb_config = BitsAndBytesConfig(
123 |         load_in_4bit=use_4bit,
124 |         bnb_4bit_quant_type=bnb_4bit_quant_type,
125 |         bnb_4bit_compute_dtype=compute_dtype,
126 |         bnb_4bit_use_double_quant=use_nested_quant,
127 |     )
128 |     # Load the entire model on the GPU 0
129 |     device_map = {"": 0}
130 |     #device_map = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
131 | 
132 |     ################################################################################
133 |     # Loading the base model from Mistral Huggingface
134 |     ################################################################################
135 |     base_model = AutoModelForCausalLM.from_pretrained(
136 |         model_id,
137 |         quantization_config=bnb_config,
138 |         device_map=device_map
139 |     )
140 | 
141 |     base_model.config.use_cache = False
142 |     base_model.config.pretraining_tp = 1
143 | 
144 |     # Load MitsralAi tokenizer
145 |     tokenizer = AutoTokenizer.from_pretrained(model_id)
146 |     tokenizer.pad_token = tokenizer.eos_token
147 |     tokenizer.padding_side = 'right'
148 | 
149 | 
150 |     ################################################################################
151 |     # QLoRA parameters and Lora Config
152 |     ################################################################################
153 | 
154 |     # LoRA attention dimension
155 |     lora_r = 64
156 | 
157 |     # Alpha parameter for LoRA scaling
158 |     lora_alpha = 16
159 | 
160 |     # Dropout probability for LoRA layers
161 |     lora_dropout = 0.1
162 |  
163 | 
164 |     # Set LoRA configuration
165 |     peft_config = LoraConfig(
166 |         lora_alpha=lora_alpha,
167 |         lora_dropout=lora_dropout,
168 |         r=lora_r,
169 |         target_modules=[
170 |             "q_proj",
171 |             "k_proj",
172 |             "v_proj",
173 |             "o_proj",
174 |             "gate_proj",
175 |             "up_proj",
176 |             "down_proj",
177 |         ],
178 |         bias="none",
179 |         task_type="CAUSAL_LM",
180 |     )
181 | 
182 |     ################################################################################
183 |     # TrainingArguments parameters
184 |     ################################################################################
185 | 
186 |     # Output directory where the model predictions and checkpoints will be stored
187 |     # Similarly, SageMaker will copy the local output artifact to S3 
188 |     output_dir = args.job_output_path
189 |     
190 |     # Number of training epochs
191 |     num_train_epochs = args.epochs
192 | 
193 |     # Enable fp16/bf16 training (set bf16 to True with an A100)
194 |     fp16 = False
195 |     bf16 = False
196 | 
197 |     # Batch size per GPU for training
198 |     per_device_train_batch_size = args.per_device_train_batch_size
199 | 
200 |     # Batch size per GPU for evaluation
201 |     per_device_eval_batch_size = args.per_device_train_batch_size
202 | 
203 |     # Number of update steps to accumulate the gradients for
204 |     gradient_accumulation_steps = 1
205 | 
206 |     # Enable gradient checkpointing
207 |     gradient_checkpointing = True
208 | 
209 |     # Maximum gradient normal (gradient clipping)
210 |     max_grad_norm = 0.3
211 | 
212 |     # Initial learning rate (AdamW optimizer)
213 |     learning_rate = args.lr
214 | 
215 |     # Weight decay to apply to all layers except bias/LayerNorm weights
216 |     weight_decay = 0.001
217 | 
218 |     # Optimizer to use
219 |     optim = "paged_adamw_32bit"
220 | 
221 |     # Learning rate schedule (constant a bit better than cosine)
222 |     lr_scheduler_type = "constant"
223 | 
224 |     # Number of training steps (overrides num_train_epochs)
225 |     #max_steps = -1
226 | 
227 |     # Ratio of steps for a linear warmup (from 0 to learning rate)
228 |     warmup_ratio = 0.03
229 | 
230 |     # Group sequences into batches with same length
231 |     # Saves memory and speeds up training considerably
232 |     group_by_length = True
233 | 
234 |     # Save checkpoint every X updates steps
235 |     save_steps = 250
236 | 
237 |     # Log every X updates steps
238 |     logging_steps = 25
239 |     
240 |     # Set training parameters
241 |     training_arguments = TrainingArguments(
242 |         output_dir=output_dir,
243 |         num_train_epochs=num_train_epochs,
244 |         per_device_train_batch_size=per_device_train_batch_size,
245 |         gradient_accumulation_steps=gradient_accumulation_steps,
246 |         optim=optim,
247 |         save_steps=save_steps,
248 |         logging_steps=logging_steps,
249 |         learning_rate=learning_rate,
250 |         weight_decay=weight_decay,
251 |         gradient_checkpointing=gradient_checkpointing,
252 |         fp16=fp16,
253 |         bf16=bf16,
254 |         max_grad_norm=max_grad_norm,
255 |         #max_steps=100, # the total number of training steps to perform
256 |         warmup_ratio=warmup_ratio,
257 |         group_by_length=group_by_length,
258 |         lr_scheduler_type=lr_scheduler_type,
259 |     )
260 |     
261 | 
262 |     response_template = "### Answer:"
263 |     collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
264 |     
265 |     ################################################################################
266 |     # SFT parameters
267 |     ################################################################################
268 | 
269 |     # Maximum sequence length to use
270 |     max_seq_length = 2048
271 | 
272 |     # Pack multiple short examples in the same input sequence to increase efficiency
273 |     packing = False
274 | 
275 | 
276 |     # Initialize the SFTTrainer for fine-tuning
277 |     trainer = SFTTrainer(
278 |         model=base_model,
279 |         train_dataset=dataset_instruct_format,
280 |         formatting_func=formatting_prompts_func,
281 |         data_collator=collator,
282 |         peft_config=peft_config,
283 |         max_seq_length=max_seq_length,  # You can specify the maximum sequence length here
284 |         tokenizer=tokenizer,
285 |         args=training_arguments,
286 |         packing=packing
287 |     )
288 | 
289 |     # Start the training process
290 |     trainer.train()
291 |     
292 |     
293 |     ################################################################################
294 |     # Save the pretrained model
295 |     ################################################################################    
296 |     sagemaker_save_dir = args.model_save_path
297 |     #set the name of the new model
298 |     new_model = "Mistral-qlora-7B-Instruct-v0.1" 
299 |     
300 |     # Save the fine-tuned model
301 |     trainer.model.save_pretrained(new_model)    
302 | 
303 |     base_model = AutoModelForCausalLM.from_pretrained(
304 |         model_id,
305 |         low_cpu_mem_usage=True,
306 |         torch_dtype=torch.float16,
307 |         device_map=device_map
308 |     )
309 |     
310 |     merged_model= PeftModel.from_pretrained(base_model, new_model)
311 |     
312 |     merged_model= merged_model.merge_and_unload()
313 |     
314 |     merged_model.save_pretrained(sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB")
315 |     # save tokenizer for easy inference
316 |     tokenizer.save_pretrained(sagemaker_save_dir)
317 | 
318 | def main():
319 |     args = parse_arge()
320 |     training_function(args)
321 | 
322 | 
323 | if __name__ == "__main__":
324 |     main()


--------------------------------------------------------------------------------
/Finetune_Mistral_7B_on_Amazon_SageMaker.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "0877bf5e-dcb6-4377-94e5-06332ab32848",
  7 |    "metadata": {
  8 |     "tags": []
  9 |    },
 10 |    "outputs": [],
 11 |    "source": [
 12 |     "!pip install transformers==4.38.1 datasets==2.17.1 peft==0.8.2 bitsandbytes==0.42.0 trl==0.7.11 --upgrade --quiet"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "id": "be57a9b4",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "This notebook has been tested on Amazon SageMaker Notebook Instances with single GPU on ml.g5.2xlarge"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 45,
 26 |    "id": "73cac042-5399-46d3-bfde-2728a06fe450",
 27 |    "metadata": {
 28 |     "tags": []
 29 |    },
 30 |    "outputs": [
 31 |     {
 32 |      "name": "stdout",
 33 |      "output_type": "stream",
 34 |      "text": [
 35 |       "dataset size: 15011\n",
 36 |       "{'instruction': 'Given this paragraph, which highs school did Drake Maye attend?', 'context': 'Drake Maye was born on August 30, 2002, in Charlotte, North Carolina. He attended and played high school football for Myers Park High School in Charlotte, where he was named MaxPreps North Carolina player of the year. He was a four-star prospect and originally committed to Alabama before flipping to North Carolina.', 'response': 'Based on this text, Drake Maye attended Myers Park High School in Charlotte, North Carolina.', 'category': 'closed_qa'}\n"
 37 |      ]
 38 |     }
 39 |    ],
 40 |    "source": [
 41 |     "from datasets import load_dataset\n",
 42 |     "\n",
 43 |     "from random import randrange\n",
 44 |     "\n",
 45 |     "# Load dataset from the hub\n",
 46 |     "dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\")\n",
 47 |     "\n",
 48 |     "#For local testing the fine tuning code, we limit the dataset to 20 samples \n",
 49 |     "#dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\").select(range(20))\n",
 50 |     "\n",
 51 |     "print(f\"dataset size: {len(dataset)}\")\n",
 52 |     "print(dataset[randrange(len(dataset))])\n"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 37,
 58 |    "id": "7b382391-42b8-4393-8523-55057824e3dc",
 59 |    "metadata": {
 60 |     "tags": []
 61 |    },
 62 |    "outputs": [
 63 |     {
 64 |      "data": {
 65 |       "application/vnd.jupyter.widget-view+json": {
 66 |        "model_id": "4914025056f24db28ad831c972f94555",
 67 |        "version_major": 2,
 68 |        "version_minor": 0
 69 |       },
 70 |       "text/plain": [
 71 |        "Saving the dataset (0/1 shards):   0%|          | 0/15011 [00:00<?, ? examples/s]"
 72 |       ]
 73 |      },
 74 |      "metadata": {},
 75 |      "output_type": "display_data"
 76 |     }
 77 |    ],
 78 |    "source": [
 79 |     "local_path = \"./dataset/dolly.hf\"\n",
 80 |     "dataset.save_to_disk(local_path)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 38,
 86 |    "id": "31100935-9aec-4643-9447-9e9daf35d770",
 87 |    "metadata": {
 88 |     "tags": []
 89 |    },
 90 |    "outputs": [
 91 |     {
 92 |      "name": "stdout",
 93 |      "output_type": "stream",
 94 |      "text": [
 95 |       "sagemaker role arn: arn:aws:iam::70768*******:role/service-role/AmazonSageMaker-ExecutionRole-20191024T163188\n",
 96 |       "sagemaker session region: us-east-1\n"
 97 |      ]
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "import sagemaker\n",
102 |     "import boto3\n",
103 |     "sess = sagemaker.Session()\n",
104 |     "# sagemaker session bucket -> used for uploading data, models and logs\n",
105 |     "# sagemaker will automatically create this bucket if it not exists\n",
106 |     "sagemaker_session_bucket=None\n",
107 |     "if sagemaker_session_bucket is None and sess is not None:\n",
108 |     "    # set to default bucket if a bucket name is not given\n",
109 |     "    sagemaker_session_bucket = sess.default_bucket()\n",
110 |     "\n",
111 |     "try:\n",
112 |     "    role = sagemaker.get_execution_role()\n",
113 |     "except ValueError:\n",
114 |     "    iam = boto3.client('iam')\n",
115 |     "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
116 |     "\n",
117 |     "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
118 |     "\n",
119 |     "print(f\"sagemaker role arn: {role}\")\n",
120 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 39,
126 |    "id": "08165c6c-cf0b-45df-ab4a-7073b421b9fb",
127 |    "metadata": {
128 |     "tags": []
129 |    },
130 |    "outputs": [
131 |     {
132 |      "name": "stdout",
133 |      "output_type": "stream",
134 |      "text": [
135 |       "training dataset uploaded to --- &gt; s3://sagemaker-us-east-1-70768*******/train/data/dolly.hf\n"
136 |      ]
137 |     }
138 |    ],
139 |    "source": [
140 |     "# save train_dataset to s3\n",
141 |     "\n",
142 |     "s3_data_prefix = \"train/data/dolly.hf\"\n",
143 |     "bucket = sagemaker_session_bucket  # bucket to house artifacts\n",
144 |     "training_input_path = sess.upload_data(local_path, bucket, s3_data_prefix)\n",
145 |     "print(f\"training dataset uploaded to --- &gt; {training_input_path}\")"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "markdown",
150 |    "id": "55d6a529-2f76-4a97-bfd8-254c20239a18",
151 |    "metadata": {},
152 |    "source": [
153 |     "### Training with SageMaker"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": 40,
159 |    "id": "dbeac35c-38d8-4823-85a2-a05deb347e68",
160 |    "metadata": {
161 |     "tags": []
162 |    },
163 |    "outputs": [],
164 |    "source": [
165 |     "model_id = \"mistralai/Mistral-7B-Instruct-v0.1\"\n",
166 |     "instance_type = 'ml.g5.4xlarge'  # instances type used for the training job"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 41,
172 |    "id": "dc690758-4303-4e16-a115-eccbed1a2f65",
173 |    "metadata": {
174 |     "tags": []
175 |    },
176 |    "outputs": [],
177 |    "source": [
178 |     "import time\n",
179 |     "from sagemaker.huggingface import HuggingFace\n",
180 |     "from huggingface_hub import HfFolder\n",
181 |     "\n",
182 |     "# define Training Job Name\n",
183 |     "job_name = f'huggingface-qlora-{model_id.replace(\"/\", \"-\").lower()}'\n",
184 |     "\n",
185 |     "# hyperparameters, which are passed into the training job\n",
186 |     "hyperparameters ={\n",
187 |     "  'model_id': model_id,                             # pre-trained model\n",
188 |     "  'dataset_path': '/opt/ml/input/data/training/dolly.hf',    # path where sagemaker will save training dataset\n",
189 |     "  'epochs': 1,                                      # number of training epochs\n",
190 |     "  'per_device_train_batch_size': 1,                 # batch size for training\n",
191 |     "  'lr': 2e-5,                                       # learning rate used during training\n",
192 |     "}\n",
193 |     "metric=[\n",
194 |     "    {\"Name\": \"loss\", \"Regex\": r\"'loss':\\s*([0-9.]+)\"},\n",
195 |     "    {\"Name\": \"epoch\", \"Regex\": r\"'epoch':\\s*([0-9.]+)\"},\n",
196 |     "]\n",
197 |     "# create the Estimator\n",
198 |     "huggingface_estimator = HuggingFace(\n",
199 |     "    entry_point          = 'train.py',      # train script\n",
200 |     "    source_dir           = 'training',         # directory which includes all the files needed for training\n",
201 |     "    metric_definitions   = metric,\n",
202 |     "    instance_type        = instance_type,   # instances type used for the training job\n",
203 |     "    instance_count       = 1,                 # the number of instances used for training\n",
204 |     "    base_job_name        = job_name,          # the name of the training job\n",
205 |     "    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3\n",
206 |     "    volume_size          = 300,               # the size of the EBS volume in GB\n",
207 |     "    transformers_version = '4.28',            # the transformers version used in the training job\n",
208 |     "    pytorch_version      = '2.0',             # the pytorch_version version used in the training job\n",
209 |     "    py_version           = 'py310',           # the python version used in the training job\n",
210 |     "    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job\n",
211 |     "    environment          = { \"HUGGINGFACE_HUB_CACHE\": \"/tmp/.cache\" }, # set env variable to cache models in /tmp\n",
212 |     ")"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": 42,
218 |    "id": "adb98352-f5b8-4f46-8631-0995f1f17ade",
219 |    "metadata": {
220 |     "tags": []
221 |    },
222 |    "outputs": [
223 |     {
224 |      "name": "stderr",
225 |      "output_type": "stream",
226 |      "text": [
227 |       "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n",
228 |       "INFO:sagemaker:Creating training-job with name: huggingface-qlora-mistralai-mistral-7b--2024-03-04-07-40-17-256\n"
229 |      ]
230 |     }
231 |    ],
232 |    "source": [
233 |     "# define a data input dictonary with our uploaded s3 uris\n",
234 |     "training_input_path = \"s3://sagemaker-us-east-1-70768*******/train/data\"\n",
235 |     "data = {'training': training_input_path}\n",
236 |     "\n",
237 |     "# starting the train job with our uploaded datasets as input\n",
238 |     "huggingface_estimator.fit(data, wait=False)"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "id": "73c04245-ffd1-44af-a7af-20fdc9756ea1",
244 |    "metadata": {},
245 |    "source": [
246 |     "### Download the model weight from SageMaker Training job"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": 26,
252 |    "id": "8c841257-839e-4bf7-9708-0159a2d9f65b",
253 |    "metadata": {
254 |     "tags": []
255 |    },
256 |    "outputs": [
257 |     {
258 |      "name": "stdout",
259 |      "output_type": "stream",
260 |      "text": [
261 |       "sagemaker-us-east-1-70768********\n"
262 |      ]
263 |     },
264 |     {
265 |      "data": {
266 |       "text/plain": [
267 |        "['./results/training_job/model.tar.gz']"
268 |       ]
269 |      },
270 |      "execution_count": 26,
271 |      "metadata": {},
272 |      "output_type": "execute_result"
273 |     }
274 |    ],
275 |    "source": [
276 |     "# Specify the training job name\n",
277 |     "from sagemaker.s3 import S3Downloader\n",
278 |     "\n",
279 |     "training_job_name = 'huggingface-qlora-mistralai-mistral-7b--2024-03-04-07-40-17-256'\n",
280 |     "print(sagemaker_session_bucket)\n",
281 |     "key = f'{training_job_name}/output/model.tar.gz'\n",
282 |     "\n",
283 |     "# Download the output of the training job\n",
284 |     "local_path = './results/training_job/'\n",
285 |     "S3Downloader.download(f's3://{bucket}/{key}', local_path)"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "id": "42161f1c-5c05-4b24-9b21-11a609154645",
291 |    "metadata": {},
292 |    "source": [
293 |     "### The deployable model artifact with huggingface safe tensor"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": 30,
299 |    "id": "0f03e0d2-fee2-4222-9b89-a086e3d178b0",
300 |    "metadata": {
301 |     "tags": []
302 |    },
303 |    "outputs": [],
304 |    "source": [
305 |     "import tarfile\n",
306 |     "\n",
307 |     "# Specify the path to the tar.gz file\n",
308 |     "tar_gz_file = local_path + \"model.tar.gz\"\n",
309 |     "\n",
310 |     "# Extract the contents of the tar.gz file\n",
311 |     "with tarfile.open(tar_gz_file, 'r:gz') as tar:\n",
312 |     "    tar.extractall('./results/training_job')  # Specify the directory where you want to extract the contents"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "code",
317 |    "execution_count": null,
318 |    "id": "4f8bae1f-5015-49c4-be96-ca319582b212",
319 |    "metadata": {},
320 |    "outputs": [],
321 |    "source": []
322 |   }
323 |  ],
324 |  "metadata": {
325 |   "kernelspec": {
326 |    "display_name": "conda_pytorch_p310",
327 |    "language": "python",
328 |    "name": "conda_pytorch_p310"
329 |   },
330 |   "language_info": {
331 |    "codemirror_mode": {
332 |     "name": "ipython",
333 |     "version": 3
334 |    },
335 |    "file_extension": ".py",
336 |    "mimetype": "text/x-python",
337 |    "name": "python",
338 |    "nbconvert_exporter": "python",
339 |    "pygments_lexer": "ipython3",
340 |    "version": "3.10.13"
341 |   }
342 |  },
343 |  "nbformat": 4,
344 |  "nbformat_minor": 5
345 | }
346 | 


--------------------------------------------------------------------------------
/tutorial/Deploy_model_on_Amazon_SageMaker_with_vLLM.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "82d23eb4-072b-4832-b7ff-6d9516bb1d3c",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Deploying Large Language Models Using the VLLM Backend for General Inference\n",
  9 |     "\n",
 10 |     "\n",
 11 |     "In this tutorial, you will employ the VLLM backend of the Large Model Inference (LMI) DLC to deploy a Hugging Face model and use boto3 to test the inference capabilities, including options for streaming and non-streaming features.\n",
 12 |     "\n",
 13 |     "Please ensure that your machine has sufficient disk space before proceeding.\n"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "id": "7741f93b-8127-4d2f-9e08-804cf5e516a2",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "## Step 1: Setup development environment"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "id": "9980179f-2656-4f3a-a662-57d7eefb45b6",
 28 |    "metadata": {
 29 |     "tags": []
 30 |    },
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "!pip install \"sagemaker>=2.216.0\" --upgrade --quiet"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "id": "0220210b-6db2-42c3-a36e-947f82645160",
 40 |    "metadata": {
 41 |     "tags": []
 42 |    },
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "!pip install huggingface_hub jinja2"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "id": "4237222b-a70d-4c77-b8b0-e5114308634d",
 52 |    "metadata": {
 53 |     "tags": []
 54 |    },
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "import sagemaker\n",
 58 |     "import boto3\n",
 59 |     "sess = sagemaker.Session()\n",
 60 |     "# sagemaker session bucket -> used for uploading data, models and logs\n",
 61 |     "# sagemaker will automatically create this bucket if it not exists\n",
 62 |     "sagemaker_session_bucket=None\n",
 63 |     "if sagemaker_session_bucket is None and sess is not None:\n",
 64 |     "    # set to default bucket if a bucket name is not given\n",
 65 |     "    sagemaker_session_bucket = sess.default_bucket()\n",
 66 |     "\n",
 67 |     "try:\n",
 68 |     "    role = sagemaker.get_execution_role()\n",
 69 |     "except ValueError:\n",
 70 |     "    iam = boto3.client('iam')\n",
 71 |     "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
 72 |     "\n",
 73 |     "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
 74 |     "\n",
 75 |     "print(f\"sagemaker role arn: {role}\")\n",
 76 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")\n"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "id": "6f9667c3-d3b8-4ad8-845f-2eecbee8443a",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "## Step 2: Start preparing model artifacts\n",
 85 |     "In LMI container, we expect some artifacts to help setting up the model\n",
 86 |     "\n",
 87 |     "- serving.properties (required): Defines the model server settings\n",
 88 |     "- model.py (optional): A python file to define the core inference logic\n",
 89 |     "- requirements.txt (optional): Any additional pip wheel need to install\n",
 90 |     "\n",
 91 |     "For the purpose of this tutorial, we will focus only on the `serving.properties` file."
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "id": "cb1fcbda-113c-4eb3-94de-1e31884c64db",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "### Download the model to local and upload to s3\n",
100 |     "\n",
101 |     "Please skip this step if you already possess a model on S3, whether it's a downloaded or a fine-tuned version.\n",
102 |     "\n",
103 |     "\n",
104 |     "Update your model ID and Hugging Face token."
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "id": "cdd51cba-dfc8-4158-9fec-a52bee345be8",
111 |    "metadata": {
112 |     "tags": []
113 |    },
114 |    "outputs": [],
115 |    "source": [
116 |     "model_id =\"****\"\n",
117 |     "huggingface_token = \"****\""
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "id": "afc86b0b-2380-43d8-9e45-ae78d3ae4848",
124 |    "metadata": {
125 |     "tags": []
126 |    },
127 |    "outputs": [],
128 |    "source": [
129 |     "from huggingface_hub import snapshot_download\n",
130 |     "# Download the model repository from the Hugging Face Hub\n",
131 |     "model_directory = snapshot_download(model_id, token= huggingface_token, local_dir=f\"/home/ec2-user/SageMaker/{model_id}\", ignore_patterns=[\"*.pth\", \"original/*\"])\n",
132 |     "print(f\"Downloaded model {model_id} to {model_directory}\")"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "id": "53e7519e-b461-44b2-926b-07fae6710016",
139 |    "metadata": {
140 |     "tags": []
141 |    },
142 |    "outputs": [],
143 |    "source": [
144 |     "### Upload to s3\n",
145 |     "from sagemaker.s3 import S3Uploader\n",
146 |     "\n",
147 |     "S3Uploader.upload(\n",
148 |     "        local_path=model_id,\n",
149 |     "        desired_s3_uri=f\"s3://{sagemaker_session_bucket}/models/{model_id}\",\n",
150 |     "        sagemaker_session=sess\n",
151 |     "    )"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "markdown",
156 |    "id": "28c17391-9e16-470e-aad4-d9a152f7ee31",
157 |    "metadata": {},
158 |    "source": [
159 |     "### Prepare the serving.properties\n",
160 |     "\n",
161 |     "Update the model location to the correct S3 path. If you have fine-tuned a model stored in S3, please change the value to reflect your specific S3 bucket location."
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "id": "6686fe4a-6901-463e-a5fa-6af9ce5a5ae5",
168 |    "metadata": {
169 |     "tags": []
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "import jinja2\n",
174 |     "import os\n",
175 |     "from pathlib import Path\n",
176 |     "\n",
177 |     "# Define the directory path\n",
178 |     "deployment_path = \"deployment\"\n",
179 |     "\n",
180 |     "# Check if the directory exists. If not, create it.\n",
181 |     "os.makedirs(deployment_path, exist_ok=True)\n",
182 |     "\n",
183 |     "jinja_env = jinja2.Environment()\n",
184 |     "\n",
185 |     "template = jinja_env.from_string(Path(\"serving.template\").open().read())\n",
186 |     "Path(f\"{deployment_path}/serving.properties\").open(\"w\").write(\n",
187 |     "    template.render(model_id=f\"s3://{sagemaker_session_bucket}/models/{model_id}\")\n",
188 |     "\n",
189 |     ")\n",
190 |     "!pygmentize deployment/serving.properties | cat -n"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "id": "c03a19ad-fcc2-41f3-bbcf-01e745db6fe8",
196 |    "metadata": {},
197 |    "source": [
198 |     "Pack your serviing.properties in a tar file"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "id": "74176a52-7ecd-414c-bb95-4a07d26e2851",
205 |    "metadata": {
206 |     "tags": []
207 |    },
208 |    "outputs": [],
209 |    "source": [
210 |     "%%sh\n",
211 |     "mkdir mymodel\n",
212 |     "rm -f mymodel.tar.gz\n",
213 |     "mv deployment/serving.properties mymodel/\n",
214 |     "tar czvf mymodel.tar.gz mymodel/\n",
215 |     "rm -rf mymodel"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "id": "a7872a3c-fd31-4102-b154-cfea6b804fcf",
221 |    "metadata": {},
222 |    "source": [
223 |     "## Step 3: Start building SageMaker endpoint\n",
224 |     "\n",
225 |     "Getting the container image URI"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "id": "2710a93c-8d0c-40c2-a87a-d62206346c4e",
232 |    "metadata": {
233 |     "tags": []
234 |    },
235 |    "outputs": [],
236 |    "source": [
237 |     "from sagemaker import image_uris \n",
238 |     "image_uri = image_uris.retrieve(\n",
239 |     "        framework=\"djl-deepspeed\",\n",
240 |     "        region=sess.boto_session.region_name,\n",
241 |     "        version=\"0.27.0\"\n",
242 |     "    )"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "id": "68729439-b647-447b-b128-0466b8231110",
248 |    "metadata": {},
249 |    "source": [
250 |     "Upload artifact on S3 and create SageMaker model"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": null,
256 |    "id": "ca7886ed-8dda-460e-8442-7d2e3829dd67",
257 |    "metadata": {
258 |     "tags": []
259 |    },
260 |    "outputs": [],
261 |    "source": [
262 |     "from sagemaker import Model\n",
263 |     "\n",
264 |     "s3_code_prefix = f\"large-model-vllm/{model_id}_code\"\n",
265 |     "bucket = sess.default_bucket()  # bucket to house artifacts\n",
266 |     "code_artifact = sess.upload_data(\"mymodel.tar.gz\", bucket, s3_code_prefix)\n",
267 |     "print(f\"S3 Code or Model tar ball uploaded to --- > {code_artifact}\")\n",
268 |     "\n",
269 |     "model = Model(image_uri=image_uri, model_data=code_artifact, role=role)"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "markdown",
274 |    "id": "b557f038-76c8-4593-9e4e-48fc2dff5800",
275 |    "metadata": {},
276 |    "source": [
277 |     "Deploy SageMaker endpoint\n",
278 |     "- instance_type = \"ml.g5.2xlarge\": This line sets the type of machine that SageMaker will use to host the endpoint. The instance type ml.g5.2xlarge is generally suitable for demanding machine learning tasks. If you are planning to use larger token lengths in your model, you might need to choose a more powerful instance type to ensure optimal performance.\n",
279 |     "- endpoint_name = sagemaker.utils.name_from_base(f\"lmi-model-{model_id.replace('/', '-')}\"): This line generates a unique name for the SageMaker endpoint. It uses the model ID, modifying it to replace slashes with hyphens to create a valid endpoint name. This is necessary because certain characters like slashes may not be permitted in AWS resource names.\n",
280 |     "- model.deploy(...): This function call deploys the model to the configured SageMaker endpoint. Here are the parameters used:\n",
281 |     "   * initial_instance_count=1: This specifies that one instance of the specified type should be used.\n",
282 |     "   * instance_type: As defined earlier, this is the type of instance to deploy.\n",
283 |     "   * endpoint_name: The unique name generated for the endpoint.\n",
284 |     "   *  container_startup_health_check_timeout=1800: This sets a timeout value in seconds for the container startup  \n",
285 |     "   health check, ensuring that the deployment does not hang indefinitely if issues occur during startup."
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": null,
291 |    "id": "4ba8fdfd-c2c9-4c00-8de1-6e08b70a7be8",
292 |    "metadata": {
293 |     "tags": []
294 |    },
295 |    "outputs": [],
296 |    "source": [
297 |     "# Set the instance type; update this if using larger token lengths\n",
298 |     "instance_type = \"ml.g5.2xlarge\"\n",
299 |     "endpoint_name = sagemaker.utils.name_from_base(f\"lmi-model-{model_id.replace('/', '-')}\")\n",
300 |     "print(f\"endpoint_name: {endpoint_name}\")\n",
301 |     "\n",
302 |     "model.deploy(initial_instance_count=1,\n",
303 |     "             instance_type=instance_type,\n",
304 |     "             endpoint_name=endpoint_name,\n",
305 |     "             container_startup_health_check_timeout=1800\n",
306 |     "        )"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "markdown",
311 |    "id": "3f27076f-db37-4b45-a4b8-5cd47beb175d",
312 |    "metadata": {},
313 |    "source": [
314 |     "## Step 4: Run inference\n",
315 |     "\n",
316 |     "In the example below, we demonstrate the inference process using a sample question as follows:"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "code",
321 |    "execution_count": null,
322 |    "id": "923797e4-8f2e-4291-bb35-76cec20237da",
323 |    "metadata": {
324 |     "tags": []
325 |    },
326 |    "outputs": [],
327 |    "source": [
328 |     "question= \"tell me about Harry Potter in 100 words\""
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "id": "87abee78-54e6-4c4e-8348-ee6fa6aa35de",
334 |    "metadata": {},
335 |    "source": [
336 |     "### Normal request\n",
337 |     "\n",
338 |     "To query an Amazon SageMaker model endpoint effectively, you use the invoke_endpoint API provided by the SageMaker Runtime service. This API allows you to send input data to your deployed model and receive predictions in response. "
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "id": "4ab6698e-7271-465f-82ca-e3346724ea84",
345 |    "metadata": {
346 |     "tags": []
347 |    },
348 |    "outputs": [],
349 |    "source": [
350 |     "input_data = {\n",
351 |     "    \"inputs\": f\"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n\\n\\n\", \n",
352 |     "    \"parameters\": {\"max_new_tokens\":1024}\n",
353 |     "}"
354 |    ]
355 |   },
356 |   {
357 |    "cell_type": "code",
358 |    "execution_count": null,
359 |    "id": "2b3b6bed-e201-490b-b4f2-cbc0107658c8",
360 |    "metadata": {
361 |     "tags": []
362 |    },
363 |    "outputs": [],
364 |    "source": [
365 |     "import json\n",
366 |     "# Create a SageMaker runtime client with the AWS SDK\n",
367 |     "client = boto3.client('sagemaker-runtime')\n",
368 |     "\n",
369 |     "# Convert the input data to JSON string\n",
370 |     "payload = json.dumps(input_data)\n",
371 |     "\n",
372 |     "# Set the content type for the endpoint, adjust if different for your model\n",
373 |     "content_type = \"application/json\"\n",
374 |     "\n",
375 |     "# Invoke the SageMaker endpoint\n",
376 |     "response = client.invoke_endpoint(\n",
377 |     "        EndpointName=endpoint_name,\n",
378 |     "        ContentType=content_type,\n",
379 |     "        Body=payload\n",
380 |     ")\n",
381 |     "\n",
382 |     "# The response is a stream of bytes. We need to read and decode it.\n",
383 |     "result = response['Body'].read().decode('utf-8')\n",
384 |     "\n",
385 |     "print(result)"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "markdown",
390 |    "id": "4c7a41fe-3417-457a-8d06-580d3ad230a3",
391 |    "metadata": {},
392 |    "source": [
393 |     "### Streaming\n",
394 |     "\n",
395 |     "The invoke_endpoint_with_response_stream function is an API provided by Amazon SageMaker, designed to handle streaming responses from a deployed model endpoint. \n",
396 |     "\n",
397 |     "The `LineIterator` is copied from https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/sample-llm/utils/LineIterator.py"
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "code",
402 |    "execution_count": null,
403 |    "id": "121de517-f590-4f2f-9307-a0c3a3a9f8ba",
404 |    "metadata": {
405 |     "tags": []
406 |    },
407 |    "outputs": [],
408 |    "source": [
409 |     "import json\n",
410 |     "import boto3\n",
411 |     "from utils.LineIterator import LineIterator\n",
412 |     "\n",
413 |     "smr_client = boto3.client(\"sagemaker-runtime\")\n",
414 |     "def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):\n",
415 |     "    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(\n",
416 |     "        EndpointName=endpoint_name,\n",
417 |     "        Body=json.dumps(payload), \n",
418 |     "        ContentType=\"application/json\"\n",
419 |     "    )\n",
420 |     "    return response_stream\n",
421 |     "\n",
422 |     "\n",
423 |     "\n",
424 |     "def print_response_stream(response_stream):\n",
425 |     "    event_stream = response_stream.get('Body')\n",
426 |     "    last_error_line =''\n",
427 |     "    for line in LineIterator(event_stream):\n",
428 |     "        try:\n",
429 |     "            print(json.loads(last_error_line+line)[\"token\"][\"text\"], end='')\n",
430 |     "            last_error_line =''\n",
431 |     "        except:\n",
432 |     "            last_error_line = line"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "code",
437 |    "execution_count": null,
438 |    "id": "2898fd61-2323-498c-92d8-54853ae187f2",
439 |    "metadata": {
440 |     "tags": []
441 |    },
442 |    "outputs": [],
443 |    "source": [
444 |     "payload = {    \n",
445 |     "    \"inputs\":  f\"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\\n\\n{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n\\n\\n\", \n",
446 |     "    \"parameters\": {\n",
447 |     "        \"max_new_tokens\":1024, \n",
448 |     "        \"stop\":[\"<|start_header_id|>\", \"<|end_header_id|>\", \"<|eot_id|>\", \"<|reserved_special_token\"]\n",
449 |     "    },\n",
450 |     "    \"stream\": True ## <-- to have response stream.\n",
451 |     "}\n",
452 |     "response_stream = get_realtime_response_stream(smr_client, endpoint_name, payload)\n",
453 |     "print_response_stream(response_stream)"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "markdown",
458 |    "id": "82b766c8-c97c-4189-a682-5a9d6f66fad4",
459 |    "metadata": {
460 |     "tags": []
461 |    },
462 |    "source": [
463 |     "## Clear Resources"
464 |    ]
465 |   },
466 |   {
467 |    "cell_type": "code",
468 |    "execution_count": null,
469 |    "id": "9b38fd78-6ef3-4d08-9ea4-bb6b96fe3462",
470 |    "metadata": {},
471 |    "outputs": [],
472 |    "source": [
473 |     "sess.delete_endpoint(endpoint_name)\n",
474 |     "sess.delete_endpoint_config(endpoint_name)\n",
475 |     "model.delete_model()\n"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "markdown",
480 |    "source": [
481 |     "## Further reading\n",
482 |     "\n",
483 |     "Please visit https://github.com/deepjavalibrary/djl-demo/tree/master/aws/sagemaker/large-model-inference/sample-llm \n",
484 |     "for more LMI usage tutorials."
485 |    ],
486 |    "metadata": {
487 |     "collapsed": false
488 |    },
489 |    "id": "9a3f28be3b9bb769"
490 |   }
491 |  ],
492 |  "metadata": {
493 |   "kernelspec": {
494 |    "display_name": "conda_python3",
495 |    "language": "python",
496 |    "name": "conda_python3"
497 |   },
498 |   "language_info": {
499 |    "codemirror_mode": {
500 |     "name": "ipython",
501 |     "version": 3
502 |    },
503 |    "file_extension": ".py",
504 |    "mimetype": "text/x-python",
505 |    "name": "python",
506 |    "nbconvert_exporter": "python",
507 |    "pygments_lexer": "ipython3",
508 |    "version": "3.10.14"
509 |   }
510 |  },
511 |  "nbformat": 4,
512 |  "nbformat_minor": 5
513 | }
514 | 


--------------------------------------------------------------------------------
/Local_Compare_Finetune_Mistral.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "2c6d91ea-8c0c-4157-a5fc-c16a4491f896",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Setup development environment"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "db8eff5e",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "This notebook has been tested on Amazon SageMaker Notebook Instances with single GPU on ml.g5.2xlarge"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": 1,
 22 |    "id": "756f5440-6f6a-4329-8f03-b545b1753f9e",
 23 |    "metadata": {
 24 |     "tags": []
 25 |    },
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "!pip install transformers==4.38.1 datasets==2.17.1 peft==0.8.2 bitsandbytes==0.42.0 trl==0.7.11 --upgrade --quiet"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "id": "a2891d64-4920-465e-9781-c86dcdd31743",
 34 |    "metadata": {
 35 |     "tags": []
 36 |    },
 37 |    "source": [
 38 |     "## Load and prepare the dataset\n",
 39 |     "\n",
 40 |     "\n",
 41 |     "### Choose a dataset\n",
 42 |     "\n",
 43 |     "For the purpose of this tutorial, we will use dolly, an open-source dataset containing 15k instruction pairs.\n",
 44 |     "\n",
 45 |     "Example record from dolly:\n",
 46 |     "```\n",
 47 |     "{\n",
 48 |     "  \"instruction\": \"Who was the first woman to have four country albums reach No. 1 on the Billboard 200?\",\n",
 49 |     "  \"context\": \"\",\n",
 50 |     "  \"response\": \"Carrie Underwood.\"\n",
 51 |     "}\n",
 52 |     "```\n"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 2,
 58 |    "id": "c248c8fa-1026-4e5b-9182-9ee37e64e993",
 59 |    "metadata": {
 60 |     "tags": []
 61 |    },
 62 |    "outputs": [
 63 |     {
 64 |      "name": "stdout",
 65 |      "output_type": "stream",
 66 |      "text": [
 67 |       "dataset size: 15011\n",
 68 |       "{'instruction': 'Where is the Lighthouse Point, Bahamas', 'context': 'Lighthouse Point, Bahamas, or simply Lighthouse Point, is a private peninsula in The Bahamas which serves as an exclusive port for the Disney Cruise Line ships. It is located in the south-eastern region of Bannerman Town, Eleuthera. In March 2019, The Walt Disney Company purchased the peninsula from the Bahamian government, giving the company control over the area.', 'response': 'The Lighthouse Point, Bahamas, or simply Lighthouse Point, is a private peninsula in the Bahamas which serves as an exclusive port for the Disney Cruise Line ships. It is located in the south-eastern region of Bannerman Town, Eleuthera.', 'category': 'summarization'}\n"
 69 |      ]
 70 |     }
 71 |    ],
 72 |    "source": [
 73 |     "from datasets import load_dataset\n",
 74 |     "from random import randrange\n",
 75 |     "\n",
 76 |     "# Load dataset from the hub\n",
 77 |     "dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\")\n",
 78 |     "\n",
 79 |     "print(f\"dataset size: {len(dataset)}\")\n",
 80 |     "print(dataset[randrange(len(dataset))])"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "id": "8cd4df77-add2-461b-bf92-cb8d6a4c9afe",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "### Understand the Mistral format\n",
 89 |     "\n",
 90 |     "The mistralai/Mixtral-8x7B-Instruct-v0.1 is a conversational chat model meaning we can chat with it using the following prompt:\n",
 91 |     "\n",
 92 |     "\n",
 93 |     "```\n",
 94 |     "<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]\n",
 95 |     "```\n"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "id": "ebc630c2-3682-4d8d-a321-14f64374a0d1",
101 |    "metadata": {},
102 |    "source": [
103 |     "For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response."
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 3,
109 |    "id": "e62a310d-d98a-460d-b00a-78e7f4f60882",
110 |    "metadata": {
111 |     "tags": []
112 |    },
113 |    "outputs": [],
114 |    "source": [
115 |     "from random import randint\n",
116 |     "\n",
117 |     "# Define the create_prompt function\n",
118 |     "def create_prompt(sample):\n",
119 |     "    bos_token = \"<s>\"\n",
120 |     "    eos_token = \"</s>\"\n",
121 |     "    \n",
122 |     "    instruction = sample['instruction']\n",
123 |     "    context = sample['context']\n",
124 |     "    response = sample['response']\n",
125 |     "\n",
126 |     "    text_row = f\"\"\"[INST] Below is the question based on the context. Question: {instruction}. Below is the given the context {context}. Write a response that appropriately completes the request.[/INST]\"\"\"\n",
127 |     "    answer_row = response\n",
128 |     "\n",
129 |     "    sample[\"prompt\"] = bos_token + text_row\n",
130 |     "    sample[\"completion\"] = answer_row + eos_token\n",
131 |     "\n",
132 |     "    return sample"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "id": "c63f0625-6acf-4ed1-8f22-2a73b6e23163",
138 |    "metadata": {},
139 |    "source": [
140 |     "### Mistral finetuned model inference "
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 3,
146 |    "id": "bff5f74d-64f6-4de1-9540-b0c4c7b81685",
147 |    "metadata": {
148 |     "tags": []
149 |    },
150 |    "outputs": [],
151 |    "source": [
152 |     "new_model_path = \"./Mistral-Finetuned-Merged\" #set the name of the new model"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 4,
158 |    "id": "f6549cad-a99d-4ad9-aff7-71298e4fa2f4",
159 |    "metadata": {
160 |     "tags": []
161 |    },
162 |    "outputs": [
163 |     {
164 |      "data": {
165 |       "application/vnd.jupyter.widget-view+json": {
166 |        "model_id": "a9a0e47c2c4245f8b0b3c16809b3ed0b",
167 |        "version_major": 2,
168 |        "version_minor": 0
169 |       },
170 |       "text/plain": [
171 |        "Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]"
172 |       ]
173 |      },
174 |      "metadata": {},
175 |      "output_type": "display_data"
176 |     }
177 |    ],
178 |    "source": [
179 |     "import json\n",
180 |     "import pandas as pd\n",
181 |     "import torch\n",
182 |     "from datasets import Dataset, load_dataset\n",
183 |     "from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n",
184 |     "\n",
185 |     "new_model = AutoModelForCausalLM.from_pretrained(\n",
186 |     "    new_model_path,\n",
187 |     "    torch_dtype=torch.float16,\n",
188 |     "    device_map=\"auto\"\n",
189 |     ")\n",
190 |     "\n",
191 |     "\n",
192 |     "new_model.config.use_cache = False\n",
193 |     "new_model.config.pretraining_tp = 1"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 5,
199 |    "id": "3a6f37ff-4a6e-46f1-bf54-1cc7645a2e35",
200 |    "metadata": {
201 |     "tags": []
202 |    },
203 |    "outputs": [],
204 |    "source": [
205 |     "# Load MitsralAi tokenizer\n",
206 |     "tokenizer = AutoTokenizer.from_pretrained(new_model_path)"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": 6,
212 |    "id": "d0ca49ec-6f35-40e8-95ad-dc6267235e19",
213 |    "metadata": {
214 |     "tags": []
215 |    },
216 |    "outputs": [
217 |     {
218 |      "name": "stdout",
219 |      "output_type": "stream",
220 |      "text": [
221 |       "<s>[INST] Below is the question based on the context. Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.\n",
222 |       "\n",
223 |       "Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. Write a response that appropriately completes the request.[/INST]\n",
224 |       "Dataset ground truth:\n",
225 |       "Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.</s>\n"
226 |      ]
227 |     }
228 |    ],
229 |    "source": [
230 |     "#benchmark_test = create_prompt(dataset[randrange(len(dataset))])\n",
231 |     "benchmark_test = create_prompt(dataset[6])\n",
232 |     "eval_prompt = benchmark_test[\"prompt\"]\n",
233 |     "eval_completion = benchmark_test[\"completion\"]\n",
234 |     "\n",
235 |     "print(eval_prompt)\n",
236 |     "print(\"Dataset ground truth:\")\n",
237 |     "print(eval_completion)"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": 7,
243 |    "id": "4811c847-c2b4-4ad9-9c60-feb78f4b7af6",
244 |    "metadata": {
245 |     "tags": []
246 |    },
247 |    "outputs": [
248 |     {
249 |      "name": "stdout",
250 |      "output_type": "stream",
251 |      "text": [
252 |       "<s><s> [INST] Below is the question based on the context. Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.\n",
253 |       "\n",
254 |       "Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. Write a response that appropriately completes the request.[/INST] Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.</s>\n"
255 |      ]
256 |     }
257 |    ],
258 |    "source": [
259 |     "model_input = tokenizer(eval_prompt, return_tensors=\"pt\").to(\"cuda\")\n",
260 |     "\n",
261 |     "new_model.eval()\n",
262 |     "with torch.no_grad():\n",
263 |     "    print(tokenizer.decode(new_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=False))"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "id": "574a4fe2",
269 |    "metadata": {},
270 |    "source": [
271 |     "#### You might notice that the output is identical to dataset ground truth\n",
272 |     "\n",
273 |     "This is expected behavious as we fine tuned the model on small samples."
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "markdown",
278 |    "id": "c687e074-3ca7-4ef9-b2f3-ea80a73bf03a",
279 |    "metadata": {},
280 |    "source": [
281 |     "### Mistral original model inference "
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 4,
287 |    "id": "31f3b9c0-4b26-4984-b166-75da00a557a5",
288 |    "metadata": {
289 |     "tags": []
290 |    },
291 |    "outputs": [],
292 |    "source": [
293 |     "model_id = \"mistralai/Mistral-7B-Instruct-v0.1\""
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": 5,
299 |    "id": "20be602f-8400-4290-9103-568e0776dd5b",
300 |    "metadata": {
301 |     "tags": []
302 |    },
303 |    "outputs": [
304 |     {
305 |      "data": {
306 |       "application/vnd.jupyter.widget-view+json": {
307 |        "model_id": "4f30830afc824ad3a34140e88da4f03f",
308 |        "version_major": 2,
309 |        "version_minor": 0
310 |       },
311 |       "text/plain": [
312 |        "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
313 |       ]
314 |      },
315 |      "metadata": {},
316 |      "output_type": "display_data"
317 |     }
318 |    ],
319 |    "source": [
320 |     "import json\n",
321 |     "import pandas as pd\n",
322 |     "import torch\n",
323 |     "from datasets import Dataset, load_dataset\n",
324 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
325 |     "\n",
326 |     "base_model = AutoModelForCausalLM.from_pretrained(\n",
327 |     "    model_id,\n",
328 |     "    torch_dtype=torch.float16,\n",
329 |     "    device_map=\"auto\"\n",
330 |     ")\n",
331 |     "\n",
332 |     "# Load MitsralAi tokenizer\n",
333 |     "tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "markdown",
338 |    "id": "4c2d8ef3-a28b-4916-bd70-427c709f7fc5",
339 |    "metadata": {},
340 |    "source": []
341 |   },
342 |   {
343 |    "cell_type": "code",
344 |    "execution_count": 6,
345 |    "id": "a01679b4-fc57-458a-ba4b-0749da3b11d3",
346 |    "metadata": {
347 |     "tags": []
348 |    },
349 |    "outputs": [
350 |     {
351 |      "name": "stdout",
352 |      "output_type": "stream",
353 |      "text": [
354 |       "<s>[INST] Below is the question based on the context. Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.\n",
355 |       "\n",
356 |       "Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. Write a response that appropriately completes the request.[/INST]\n",
357 |       "Dataset ground truth:\n",
358 |       "Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.</s>\n"
359 |      ]
360 |     }
361 |    ],
362 |    "source": [
363 |     "#benchmark_test = create_prompt(dataset[randrange(len(dataset))])\n",
364 |     "benchmark_test = create_prompt(dataset[6])\n",
365 |     "eval_prompt = benchmark_test[\"prompt\"]\n",
366 |     "eval_completion = benchmark_test[\"completion\"]\n",
367 |     "\n",
368 |     "print(eval_prompt)\n",
369 |     "print(\"Dataset ground truth:\")\n",
370 |     "print(eval_completion)"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "markdown",
375 |    "id": "72e707a1-88f5-43a3-8be6-a035ae568b39",
376 |    "metadata": {
377 |     "tags": []
378 |    },
379 |    "source": [
380 |     "#### You might notice that the output is semantically correct\n",
381 |     "\n",
382 |     "This is expected behavious thanks the the zero shot capabilities of original Mistral model "
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": 7,
388 |    "id": "5512f9ed-8b99-4b9b-bec7-fe3ebb6260db",
389 |    "metadata": {
390 |     "tags": []
391 |    },
392 |    "outputs": [
393 |     {
394 |      "name": "stdout",
395 |      "output_type": "stream",
396 |      "text": [
397 |       "<s><s> [INST] Below is the question based on the context. Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.\n",
398 |       "\n",
399 |       "Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. Write a response that appropriately completes the request.[/INST] Lollapalooza takes place in Grant Park in Chicago. It was started by Perry Farrell, the singer of Jane's Addiction, as a farewell tour in 1991. The festival features a variety of music genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also included visual arts, nonprofit organizations, and political organizations in its lineup. The festival attracts an estimated 400,000 people each July and sells out annually. Lollapalooza is considered one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.</s>\n"
400 |      ]
401 |     }
402 |    ],
403 |    "source": [
404 |     "model_input = tokenizer(eval_prompt, return_tensors=\"pt\").to(\"cuda\")\n",
405 |     "\n",
406 |     "\n",
407 |     "base_model.eval()\n",
408 |     "with torch.no_grad():\n",
409 |     "    print(tokenizer.decode(base_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=False))"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "code",
414 |    "execution_count": null,
415 |    "id": "81eab384-6c1c-4dcd-8709-09abd18d7788",
416 |    "metadata": {
417 |     "tags": []
418 |    },
419 |    "outputs": [],
420 |    "source": []
421 |   }
422 |  ],
423 |  "metadata": {
424 |   "kernelspec": {
425 |    "display_name": "conda_pytorch_p310",
426 |    "language": "python",
427 |    "name": "conda_pytorch_p310"
428 |   },
429 |   "language_info": {
430 |    "codemirror_mode": {
431 |     "name": "ipython",
432 |     "version": 3
433 |    },
434 |    "file_extension": ".py",
435 |    "mimetype": "text/x-python",
436 |    "name": "python",
437 |    "nbconvert_exporter": "python",
438 |    "pygments_lexer": "ipython3",
439 |    "version": "3.10.13"
440 |   }
441 |  },
442 |  "nbformat": 4,
443 |  "nbformat_minor": 5
444 | }
445 | 


--------------------------------------------------------------------------------
/Local_Finetune_Mistral.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "603f53ab-ed44-4950-aac6-ba20b77fa16a",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Local Finetune Mistral 7B\n",
  9 |     "\n",
 10 |     "Mistral 7B is the open LLM from Mistral AI.\n",
 11 |     "\n",
 12 |     "This sample is modified from this tutorial \n",
 13 |     "https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe\n",
 14 |     "\n",
 15 |     "What this tutorial will, step-by-step, cover:\n",
 16 |     "\n",
 17 |     "- Setup Development Environment\n",
 18 |     "- Load and prepare the dataset\n",
 19 |     "- Fine-Tune Mistral with QLoRA\n"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "id": "8f81be81",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "This notebook has been tested on Amazon SageMaker Notebook Instances with single GPU on ml.g5.2xlarge"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "id": "2c6d91ea-8c0c-4157-a5fc-c16a4491f896",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "## Setup development environment"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 1,
 41 |    "id": "4150b595-3807-48d9-b60e-75c69638e4c0",
 42 |    "metadata": {
 43 |     "tags": []
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "!pip install transformers==4.38.1 datasets==2.17.1 peft==0.8.2 bitsandbytes==0.42.0 trl==0.7.11 --upgrade --quiet"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "id": "a2891d64-4920-465e-9781-c86dcdd31743",
 53 |    "metadata": {
 54 |     "tags": []
 55 |    },
 56 |    "source": [
 57 |     "## Load and prepare the dataset\n",
 58 |     "\n",
 59 |     "\n",
 60 |     "### Choose a dataset\n",
 61 |     "\n",
 62 |     "For the purpose of this tutorial, we will use dolly, an open-source dataset containing 15k instruction pairs.\n",
 63 |     "\n",
 64 |     "Example record from dolly:\n",
 65 |     "```\n",
 66 |     "{\n",
 67 |     "  \"instruction\": \"Who was the first woman to have four country albums reach No. 1 on the Billboard 200?\",\n",
 68 |     "  \"context\": \"\",\n",
 69 |     "  \"response\": \"Carrie Underwood.\"\n",
 70 |     "}\n",
 71 |     "```\n"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 2,
 77 |    "id": "c248c8fa-1026-4e5b-9182-9ee37e64e993",
 78 |    "metadata": {
 79 |     "tags": []
 80 |    },
 81 |    "outputs": [
 82 |     {
 83 |      "name": "stdout",
 84 |      "output_type": "stream",
 85 |      "text": [
 86 |       "dataset size: 20\n",
 87 |       "{'instruction': 'Why mobile is bad for human', 'context': '', 'response': 'We are always engaged one phone which is not good.', 'category': 'brainstorming'}\n"
 88 |      ]
 89 |     }
 90 |    ],
 91 |    "source": [
 92 |     "from datasets import load_dataset\n",
 93 |     "from random import randrange\n",
 94 |     "\n",
 95 |     "# Load dataset from the hub\n",
 96 |     "#dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\")\n",
 97 |     "\n",
 98 |     "#For local testing the fine tuning code, we limit the dataset to 20 samples \n",
 99 |     "dataset = load_dataset(\"databricks/databricks-dolly-15k\", split=\"train\").select(range(20))\n",
100 |     "\n",
101 |     "print(f\"dataset size: {len(dataset)}\")\n",
102 |     "print(dataset[randrange(len(dataset))])"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "id": "8cd4df77-add2-461b-bf92-cb8d6a4c9afe",
108 |    "metadata": {},
109 |    "source": [
110 |     "### Understand the Mistral format and prepare the prompt input\n",
111 |     "\n",
112 |     "The mistralai/Mistral-7B-Instruct-v0.1 is a conversational chat model meaning we can chat with it using the following prompt:\n",
113 |     "\n",
114 |     "\n",
115 |     "```\n",
116 |     "<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]\n",
117 |     "```\n"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "id": "ebc630c2-3682-4d8d-a321-14f64374a0d1",
123 |    "metadata": {},
124 |    "source": [
125 |     "For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 3,
131 |    "id": "e62a310d-d98a-460d-b00a-78e7f4f60882",
132 |    "metadata": {
133 |     "tags": []
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "from random import randint\n",
138 |     "\n",
139 |     "# Define the create_prompt function\n",
140 |     "def create_prompt(sample):\n",
141 |     "    bos_token = \"<s>\"\n",
142 |     "    eos_token = \"</s>\"\n",
143 |     "    \n",
144 |     "    instruction = sample['instruction']\n",
145 |     "    context = sample['context']\n",
146 |     "    response = sample['response']\n",
147 |     "\n",
148 |     "    text_row = f\"\"\"[INST] Below is the question based on the context. Question: {instruction}. Below is the given the context {context}. Write a response that appropriately completes the request.[/INST]\"\"\"\n",
149 |     "    answer_row = response\n",
150 |     "\n",
151 |     "    sample[\"prompt\"] = bos_token + text_row\n",
152 |     "    sample[\"completion\"] = answer_row + eos_token\n",
153 |     "\n",
154 |     "    return sample"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "id": "5e322854-8f7c-41b4-9f69-627cc35058a2",
160 |    "metadata": {},
161 |    "source": [
162 |     "lets test our formatting function on a random example."
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": 4,
168 |    "id": "6c6e67f6-7705-4a7f-936c-4dcb9a6f921a",
169 |    "metadata": {
170 |     "tags": []
171 |    },
172 |    "outputs": [
173 |     {
174 |      "data": {
175 |       "application/vnd.jupyter.widget-view+json": {
176 |        "model_id": "4f5583c0cc4343f4bd0628ef603d2797",
177 |        "version_major": 2,
178 |        "version_minor": 0
179 |       },
180 |       "text/plain": [
181 |        "Map:   0%|          | 0/20 [00:00<?, ? examples/s]"
182 |       ]
183 |      },
184 |      "metadata": {},
185 |      "output_type": "display_data"
186 |     },
187 |     {
188 |      "name": "stdout",
189 |      "output_type": "stream",
190 |      "text": [
191 |       "{'prompt': \"<s>[INST] Below is the question based on the context. Question: Who was John Moses Browning?. Below is the given the context John Moses Browning (January 23, 1855 – November 26, 1926) was an American firearm designer who developed many varieties of military and civilian firearms, cartridges, and gun mechanisms – many of which are still in use around the world. He made his first firearm at age 13 in his father's gun shop and was awarded the first of his 128 firearm patents on October 7, 1879, at the age of 24. He is regarded as one of the most successful firearms designers of the 19th and 20th centuries and pioneered the development of modern repeating, semi-automatic, and automatic firearms.\\n\\nBrowning influenced nearly all categories of firearms design, especially the autoloading of ammunition. He invented, or made significant improvements to, single-shot, lever-action, and pump-action rifles and shotguns. He developed the first reliable and compact autoloading pistols by inventing the telescoping bolt, then integrating the bolt and barrel shroud into what is known as the pistol slide. Browning's telescoping bolt design is now found on nearly every modern semi-automatic pistol, as well as several modern fully automatic weapons. He also developed the first gas-operated firearm, the Colt–Browning Model 1895 machine gun – a system that surpassed mechanical recoil operation to become the standard for most high-power self-loading firearm designs worldwide. He also made significant contributions to automatic cannon development.\\n\\nBrowning's most successful designs include the M1911 pistol, the water-cooled M1917, the air-cooled M1919, and heavy M2 machine guns, the M1918 Browning Automatic Rifle, and the Browning Auto-5 – the first semi-automatic shotgun. Some of these arms are still manufactured, often with only minor changes in detail and cosmetics to those assembled by Browning or his licensees. The Browning-designed M1911 and Hi-Power are some of the most copied firearms in the world.. Write a response that appropriately completes the request.[/INST]\", 'completion': \"John Moses Browning is one of the most well-known designer of modern firearms.  He started building firearms in his father's shop at the age of 13, and was awarded his first patent when he was 24.\\n\\nHe  designed the first reliable automatic pistol, and the first gas-operated firearm, as well inventing or improving single-shot, lever-action, and pump-action rifles and shotguns.\\n\\nToday, he is most well-known for the M1911 pistol, the Browning Automatic Rifle, and the Auto-5 shotgun, all of which are in still in current production in either their original design, or with minor changes.  His M1911 and Hi-Power pistols designs are some of the most reproduced firearms in the world today.</s>\"}\n"
192 |      ]
193 |     }
194 |    ],
195 |    "source": [
196 |     "dataset_instruct_format = dataset.map(create_prompt, remove_columns=['instruction','context','response','category'])\n",
197 |     "# print random sample\n",
198 |     "print(dataset_instruct_format[randint(0, len(dataset_instruct_format))])"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "id": "6ae72dc8-48ee-4dd2-b067-a973baa48a86",
205 |    "metadata": {
206 |     "tags": []
207 |    },
208 |    "outputs": [],
209 |    "source": []
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "id": "d9d1aa8e-84ea-4555-8fb0-5ce45850a48e",
214 |    "metadata": {},
215 |    "source": [
216 |     "### Prepare the configuration for training the LLM\n",
217 |     "\n",
218 |     "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA\n",
219 |     "https://huggingface.co/blog/4bit-transformers-bitsandbytes\n"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 5,
225 |    "id": "31f3b9c0-4b26-4984-b166-75da00a557a5",
226 |    "metadata": {
227 |     "tags": []
228 |    },
229 |    "outputs": [],
230 |    "source": [
231 |     "model_id = \"mistralai/Mistral-7B-Instruct-v0.1\"\n",
232 |     "new_model = \"Mistral-qlora-7B-Instruct-v0.1\" #set the name of the new model"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": null,
238 |    "id": "c1453314-9347-44e6-9472-34305dad5342",
239 |    "metadata": {},
240 |    "outputs": [],
241 |    "source": []
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": 6,
246 |    "id": "c68c3166-10d3-48c4-b6f2-caa1b3d4c69d",
247 |    "metadata": {
248 |     "tags": []
249 |    },
250 |    "outputs": [],
251 |    "source": [
252 |     "\n",
253 |     "################################################################################\n",
254 |     "# QLoRA parameters\n",
255 |     "################################################################################\n",
256 |     "\n",
257 |     "# LoRA attention dimension\n",
258 |     "lora_r = 64\n",
259 |     "\n",
260 |     "# Alpha parameter for LoRA scaling\n",
261 |     "lora_alpha = 16\n",
262 |     "\n",
263 |     "# Dropout probability for LoRA layers\n",
264 |     "lora_dropout = 0.1\n",
265 |     "\n",
266 |     "################################################################################\n",
267 |     "# bitsandbytes parameters\n",
268 |     "################################################################################\n",
269 |     "\n",
270 |     "# Activate 4-bit precision base model loading\n",
271 |     "use_4bit = True\n",
272 |     "\n",
273 |     "# Compute dtype for 4-bit base models\n",
274 |     "bnb_4bit_compute_dtype = \"bfloat16\"\n",
275 |     "\n",
276 |     "# Quantization type (fp4 or nf4)\n",
277 |     "bnb_4bit_quant_type = \"nf4\"\n",
278 |     "\n",
279 |     "# Activate nested quantization for 4-bit base models (double quantization)\n",
280 |     "use_nested_quant = True\n",
281 |     "\n",
282 |     "\n",
283 |     "################################################################################\n",
284 |     "# TrainingArguments parameters\n",
285 |     "################################################################################\n",
286 |     "\n",
287 |     "# Output directory where the model predictions and checkpoints will be stored\n",
288 |     "output_dir = \"./results\"\n",
289 |     "\n",
290 |     "# Number of training epochs\n",
291 |     "num_train_epochs = 1\n",
292 |     "\n",
293 |     "# Enable fp16/bf16 training (set bf16 to True with an A100)\n",
294 |     "fp16 = False\n",
295 |     "bf16 = False\n",
296 |     "\n",
297 |     "# Batch size per GPU for training\n",
298 |     "per_device_train_batch_size = 4\n",
299 |     "\n",
300 |     "# Batch size per GPU for evaluation\n",
301 |     "per_device_eval_batch_size = 4\n",
302 |     "\n",
303 |     "# Number of update steps to accumulate the gradients for\n",
304 |     "gradient_accumulation_steps = 1\n",
305 |     "\n",
306 |     "# Enable gradient checkpointing\n",
307 |     "gradient_checkpointing = True\n",
308 |     "\n",
309 |     "# Maximum gradient normal (gradient clipping)\n",
310 |     "max_grad_norm = 0.3\n",
311 |     "\n",
312 |     "# Initial learning rate (AdamW optimizer)\n",
313 |     "learning_rate = 2e-4\n",
314 |     "\n",
315 |     "# Weight decay to apply to all layers except bias/LayerNorm weights\n",
316 |     "weight_decay = 0.001\n",
317 |     "\n",
318 |     "# Optimizer to use\n",
319 |     "optim = \"paged_adamw_32bit\"\n",
320 |     "\n",
321 |     "# Learning rate schedule (constant a bit better than cosine)\n",
322 |     "lr_scheduler_type = \"constant\"\n",
323 |     "\n",
324 |     "# Number of training steps (overrides num_train_epochs)\n",
325 |     "max_steps = -1\n",
326 |     "\n",
327 |     "# Ratio of steps for a linear warmup (from 0 to learning rate)\n",
328 |     "warmup_ratio = 0.03\n",
329 |     "\n",
330 |     "# Group sequences into batches with same length\n",
331 |     "# Saves memory and speeds up training considerably\n",
332 |     "group_by_length = True\n",
333 |     "\n",
334 |     "# Save checkpoint every X updates steps\n",
335 |     "save_steps = 25\n",
336 |     "\n",
337 |     "# Log every X updates steps\n",
338 |     "logging_steps = 25\n",
339 |     "\n",
340 |     "################################################################################\n",
341 |     "# SFT parameters\n",
342 |     "################################################################################\n",
343 |     "\n",
344 |     "# Maximum sequence length to use\n",
345 |     "max_seq_length = 1024\n",
346 |     "\n",
347 |     "# Pack multiple short examples in the same input sequence to increase efficiency\n",
348 |     "packing = False\n",
349 |     "\n",
350 |     "# Load the entire model on the GPU 0\n",
351 |     "device_map = {\"\": 0}\n",
352 |     "#device_map = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "code",
357 |    "execution_count": 7,
358 |    "id": "20be602f-8400-4290-9103-568e0776dd5b",
359 |    "metadata": {
360 |     "tags": []
361 |    },
362 |    "outputs": [
363 |     {
364 |      "data": {
365 |       "application/vnd.jupyter.widget-view+json": {
366 |        "model_id": "496bc29f1dbc4329965782365e1bcebb",
367 |        "version_major": 2,
368 |        "version_minor": 0
369 |       },
370 |       "text/plain": [
371 |        "config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]"
372 |       ]
373 |      },
374 |      "metadata": {},
375 |      "output_type": "display_data"
376 |     },
377 |     {
378 |      "data": {
379 |       "application/vnd.jupyter.widget-view+json": {
380 |        "model_id": "4014baf66b4149749abdd5f72a34bfba",
381 |        "version_major": 2,
382 |        "version_minor": 0
383 |       },
384 |       "text/plain": [
385 |        "model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]"
386 |       ]
387 |      },
388 |      "metadata": {},
389 |      "output_type": "display_data"
390 |     },
391 |     {
392 |      "data": {
393 |       "application/vnd.jupyter.widget-view+json": {
394 |        "model_id": "76fa7fc8d2ef40bc9cc574dc5e3e4149",
395 |        "version_major": 2,
396 |        "version_minor": 0
397 |       },
398 |       "text/plain": [
399 |        "Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]"
400 |       ]
401 |      },
402 |      "metadata": {},
403 |      "output_type": "display_data"
404 |     },
405 |     {
406 |      "data": {
407 |       "application/vnd.jupyter.widget-view+json": {
408 |        "model_id": "24418f6c9cb047a0ac45ca822660f09a",
409 |        "version_major": 2,
410 |        "version_minor": 0
411 |       },
412 |       "text/plain": [
413 |        "model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]"
414 |       ]
415 |      },
416 |      "metadata": {},
417 |      "output_type": "display_data"
418 |     },
419 |     {
420 |      "data": {
421 |       "application/vnd.jupyter.widget-view+json": {
422 |        "model_id": "355c8f94a315400197731f79a4d982da",
423 |        "version_major": 2,
424 |        "version_minor": 0
425 |       },
426 |       "text/plain": [
427 |        "model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]"
428 |       ]
429 |      },
430 |      "metadata": {},
431 |      "output_type": "display_data"
432 |     },
433 |     {
434 |      "data": {
435 |       "application/vnd.jupyter.widget-view+json": {
436 |        "model_id": "4e211909071648b0932d38780ed5869f",
437 |        "version_major": 2,
438 |        "version_minor": 0
439 |       },
440 |       "text/plain": [
441 |        "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
442 |       ]
443 |      },
444 |      "metadata": {},
445 |      "output_type": "display_data"
446 |     },
447 |     {
448 |      "data": {
449 |       "application/vnd.jupyter.widget-view+json": {
450 |        "model_id": "dfa1e790f9f9419e8691057707a95472",
451 |        "version_major": 2,
452 |        "version_minor": 0
453 |       },
454 |       "text/plain": [
455 |        "generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
456 |       ]
457 |      },
458 |      "metadata": {},
459 |      "output_type": "display_data"
460 |     },
461 |     {
462 |      "data": {
463 |       "application/vnd.jupyter.widget-view+json": {
464 |        "model_id": "e17932bbe7c84821892dfabeb2a6ec50",
465 |        "version_major": 2,
466 |        "version_minor": 0
467 |       },
468 |       "text/plain": [
469 |        "tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]"
470 |       ]
471 |      },
472 |      "metadata": {},
473 |      "output_type": "display_data"
474 |     },
475 |     {
476 |      "data": {
477 |       "application/vnd.jupyter.widget-view+json": {
478 |        "model_id": "330d14b156ce4c8c939e4fbba3fed4b2",
479 |        "version_major": 2,
480 |        "version_minor": 0
481 |       },
482 |       "text/plain": [
483 |        "tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]"
484 |       ]
485 |      },
486 |      "metadata": {},
487 |      "output_type": "display_data"
488 |     },
489 |     {
490 |      "data": {
491 |       "application/vnd.jupyter.widget-view+json": {
492 |        "model_id": "3f9814134d114f86abb395c681ecfc33",
493 |        "version_major": 2,
494 |        "version_minor": 0
495 |       },
496 |       "text/plain": [
497 |        "tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]"
498 |       ]
499 |      },
500 |      "metadata": {},
501 |      "output_type": "display_data"
502 |     },
503 |     {
504 |      "data": {
505 |       "application/vnd.jupyter.widget-view+json": {
506 |        "model_id": "88efbfa14c874bdb8f8942c269613212",
507 |        "version_major": 2,
508 |        "version_minor": 0
509 |       },
510 |       "text/plain": [
511 |        "special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]"
512 |       ]
513 |      },
514 |      "metadata": {},
515 |      "output_type": "display_data"
516 |     }
517 |    ],
518 |    "source": [
519 |     "import json\n",
520 |     "import pandas as pd\n",
521 |     "import torch\n",
522 |     "from datasets import Dataset, load_dataset\n",
523 |     "from transformers import (\n",
524 |     "    AutoModelForCausalLM,\n",
525 |     "    AutoTokenizer,\n",
526 |     "    BitsAndBytesConfig,\n",
527 |     "    TrainingArguments,\n",
528 |     "    pipeline,\n",
529 |     "    logging,\n",
530 |     ")\n",
531 |     "\n",
532 |     "# Load the base model with QLoRA configuration\n",
533 |     "compute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n",
534 |     "\n",
535 |     "bnb_config = BitsAndBytesConfig(\n",
536 |     "    load_in_4bit=use_4bit,\n",
537 |     "    bnb_4bit_quant_type=bnb_4bit_quant_type,\n",
538 |     "    bnb_4bit_compute_dtype=compute_dtype,\n",
539 |     "    bnb_4bit_use_double_quant=use_nested_quant,\n",
540 |     ")\n",
541 |     "\n",
542 |     "base_model = AutoModelForCausalLM.from_pretrained(\n",
543 |     "    model_id,\n",
544 |     "    quantization_config=bnb_config,\n",
545 |     "    device_map=device_map\n",
546 |     ")\n",
547 |     "\n",
548 |     "base_model.config.use_cache = False\n",
549 |     "base_model.config.pretraining_tp = 1\n",
550 |     "\n",
551 |     "# Load MitsralAi tokenizer\n",
552 |     "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
553 |     "tokenizer.pad_token = tokenizer.eos_token\n",
554 |     "tokenizer.padding_side = 'right'"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": null,
560 |    "id": "71488fc2-a707-4b1f-a12d-bb169eb6c85e",
561 |    "metadata": {
562 |     "tags": []
563 |    },
564 |    "outputs": [],
565 |    "source": [
566 |     "print(base_model)"
567 |    ]
568 |   },
569 |   {
570 |    "cell_type": "code",
571 |    "execution_count": null,
572 |    "id": "1db7ea67-cb73-43bc-8d99-9981429499c0",
573 |    "metadata": {
574 |     "tags": []
575 |    },
576 |    "outputs": [],
577 |    "source": [
578 |     "import bitsandbytes as bnb\n",
579 |     "\n",
580 |     "def find_all_linear_names(model):\n",
581 |     "    lora_module_names = set()\n",
582 |     "    for name, module in model.named_modules():\n",
583 |     "        if isinstance(module, bnb.nn.Linear4bit):\n",
584 |     "            names = name.split(\".\")\n",
585 |     "            lora_module_names.add(names[0] if len(names) == 1 else names[-1])\n",
586 |     "\n",
587 |     "    if \"lm_head\" in lora_module_names:  # needed for 16-bit\n",
588 |     "        lora_module_names.remove(\"lm_head\")\n",
589 |     "    return list(lora_module_names)"
590 |    ]
591 |   },
592 |   {
593 |    "cell_type": "code",
594 |    "execution_count": null,
595 |    "id": "7d3efac0-8989-43a9-937c-64921015b73c",
596 |    "metadata": {
597 |     "tags": []
598 |    },
599 |    "outputs": [],
600 |    "source": [
601 |     "# get lora target modules\n",
602 |     "modules = find_all_linear_names(base_model)"
603 |    ]
604 |   },
605 |   {
606 |    "cell_type": "code",
607 |    "execution_count": null,
608 |    "id": "69921e29-c5d8-4a93-a8e1-9e4b6006ec9c",
609 |    "metadata": {
610 |     "tags": []
611 |    },
612 |    "outputs": [],
613 |    "source": [
614 |     "print(modules)"
615 |    ]
616 |   },
617 |   {
618 |    "cell_type": "markdown",
619 |    "id": "843c9954-95f5-4761-a85b-39caa2f8889c",
620 |    "metadata": {
621 |     "tags": []
622 |    },
623 |    "source": [
624 |     "#### Inference using base model only before fine tuning "
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "code",
629 |    "execution_count": null,
630 |    "id": "bba52c5d-93a5-4328-9d89-80d46323e54a",
631 |    "metadata": {
632 |     "tags": []
633 |    },
634 |    "outputs": [],
635 |    "source": [
636 |     "# eval_prompt = create_prompt(dataset[randrange(len(dataset))])[\"prompt\"]\n",
637 |     "\n",
638 |     "# # import random\n",
639 |     "# model_input = tokenizer(eval_prompt, return_tensors=\"pt\").to(\"cuda\")\n",
640 |     "\n",
641 |     "# base_model.eval()\n",
642 |     "# with torch.no_grad():\n",
643 |     "#     print(tokenizer.decode(base_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))"
644 |    ]
645 |   },
646 |   {
647 |    "cell_type": "code",
648 |    "execution_count": null,
649 |    "id": "9d42fbd4-35e2-4a8a-ab82-b4b8be07bc29",
650 |    "metadata": {
651 |     "tags": []
652 |    },
653 |    "outputs": [],
654 |    "source": [
655 |     "from peft import LoraConfig, PeftModel\n",
656 |     "from trl import SFTTrainer, DataCollatorForCompletionOnlyLM\n",
657 |     "\n",
658 |     "\n",
659 |     "# Set LoRA configuration\n",
660 |     "peft_config = LoraConfig(\n",
661 |     "    lora_alpha=lora_alpha,\n",
662 |     "    lora_dropout=lora_dropout,\n",
663 |     "    r=lora_r,\n",
664 |     "    target_modules=[\n",
665 |     "        \"q_proj\",\n",
666 |     "        \"k_proj\",\n",
667 |     "        \"v_proj\",\n",
668 |     "        \"o_proj\",\n",
669 |     "        \"gate_proj\",\n",
670 |     "        \"up_proj\",\n",
671 |     "        \"down_proj\",\n",
672 |     "    ],\n",
673 |     "    bias=\"none\",\n",
674 |     "    task_type=\"CAUSAL_LM\",\n",
675 |     ")\n",
676 |     "\n",
677 |     "# Set training parameters\n",
678 |     "training_arguments = TrainingArguments(\n",
679 |     "    output_dir=output_dir,\n",
680 |     "    num_train_epochs=num_train_epochs,\n",
681 |     "    per_device_train_batch_size=per_device_train_batch_size,\n",
682 |     "    gradient_accumulation_steps=gradient_accumulation_steps,\n",
683 |     "    optim=optim,\n",
684 |     "    save_steps=save_steps,\n",
685 |     "    logging_steps=logging_steps,\n",
686 |     "    learning_rate=learning_rate,\n",
687 |     "    weight_decay=weight_decay,\n",
688 |     "    gradient_checkpointing=gradient_checkpointing,\n",
689 |     "    fp16=fp16,\n",
690 |     "    bf16=bf16,\n",
691 |     "    max_grad_norm=max_grad_norm,\n",
692 |     "    max_steps=100, # the total number of training steps to perform\n",
693 |     "    warmup_ratio=warmup_ratio,\n",
694 |     "    group_by_length=group_by_length,\n",
695 |     "    lr_scheduler_type=lr_scheduler_type,\n",
696 |     ")"
697 |    ]
698 |   },
699 |   {
700 |    "cell_type": "markdown",
701 |    "id": "59b38140-fcfc-4786-84fd-a7c3590811c5",
702 |    "metadata": {},
703 |    "source": [
704 |     "Train on completions only https://huggingface.co/docs/trl/en/sft_trainer"
705 |    ]
706 |   },
707 |   {
708 |    "cell_type": "code",
709 |    "execution_count": null,
710 |    "id": "1128648c-c865-4ac2-9823-4954c5915201",
711 |    "metadata": {
712 |     "tags": []
713 |    },
714 |    "outputs": [],
715 |    "source": [
716 |     "def formatting_prompts_func(example):\n",
717 |     "    output_texts = []\n",
718 |     "    for i in range(len(example['prompt'])):\n",
719 |     "        text = f\"{example['prompt'][i]}\\n\\n ### Answer: {example['completion'][i]}\"\n",
720 |     "        output_texts.append(text)\n",
721 |     "    return output_texts\n",
722 |     "\n",
723 |     "response_template = \"### Answer:\"\n",
724 |     "collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)"
725 |    ]
726 |   },
727 |   {
728 |    "cell_type": "code",
729 |    "execution_count": null,
730 |    "id": "1f43f1ab-e0a6-49ed-b09d-7286f09f3d3d",
731 |    "metadata": {
732 |     "tags": []
733 |    },
734 |    "outputs": [],
735 |    "source": [
736 |     "# Initialize the SFTTrainer for fine-tuning\n",
737 |     "trainer = SFTTrainer(\n",
738 |     "    model=base_model,\n",
739 |     "    train_dataset=dataset_instruct_format,\n",
740 |     "    formatting_func=formatting_prompts_func,\n",
741 |     "    data_collator=collator,\n",
742 |     "    peft_config=peft_config,\n",
743 |     "    max_seq_length=max_seq_length,  # You can specify the maximum sequence length here\n",
744 |     "    tokenizer=tokenizer,\n",
745 |     "    args=training_arguments,\n",
746 |     "    packing=packing\n",
747 |     ")"
748 |    ]
749 |   },
750 |   {
751 |    "cell_type": "code",
752 |    "execution_count": null,
753 |    "id": "3e873e68-01ee-4d42-b7e1-edc375b79157",
754 |    "metadata": {
755 |     "tags": []
756 |    },
757 |    "outputs": [],
758 |    "source": [
759 |     "# Start the training process\n",
760 |     "trainer.train()"
761 |    ]
762 |   },
763 |   {
764 |    "cell_type": "code",
765 |    "execution_count": null,
766 |    "id": "9f5b6b7e-e8c1-458d-a7b3-fa5d08f703d8",
767 |    "metadata": {},
768 |    "outputs": [],
769 |    "source": [
770 |     "# Save the fine-tuned model\n",
771 |     "trainer.model.save_pretrained(new_model)"
772 |    ]
773 |   },
774 |   {
775 |    "cell_type": "markdown",
776 |    "id": "81708229-ed18-4a35-909c-5a8abd2dc473",
777 |    "metadata": {},
778 |    "source": [
779 |     "### Finished all training steps ...  "
780 |    ]
781 |   },
782 |   {
783 |    "cell_type": "markdown",
784 |    "id": "daeb016f-0a5f-4202-8c5d-d87e6f6dd190",
785 |    "metadata": {},
786 |    "source": [
787 |     "#### Merge the trained qlora into the base model "
788 |    ]
789 |   },
790 |   {
791 |    "cell_type": "code",
792 |    "execution_count": null,
793 |    "id": "5183c57d-b905-4adb-adbf-fc788861d0b7",
794 |    "metadata": {
795 |     "tags": []
796 |    },
797 |    "outputs": [],
798 |    "source": [
799 |     "base_model = AutoModelForCausalLM.from_pretrained(\n",
800 |     "    model_id,\n",
801 |     "    low_cpu_mem_usage=True,\n",
802 |     "    torch_dtype=torch.float16,\n",
803 |     "    device_map=device_map\n",
804 |     ")"
805 |    ]
806 |   },
807 |   {
808 |    "cell_type": "code",
809 |    "execution_count": null,
810 |    "id": "2c9d4b5a-d421-4941-9606-3151d3b9d988",
811 |    "metadata": {
812 |     "tags": []
813 |    },
814 |    "outputs": [],
815 |    "source": [
816 |     "print(base_model)"
817 |    ]
818 |   },
819 |   {
820 |    "cell_type": "code",
821 |    "execution_count": null,
822 |    "id": "96fefd55-780a-4edd-a407-8d38eeb98314",
823 |    "metadata": {
824 |     "tags": []
825 |    },
826 |    "outputs": [],
827 |    "source": [
828 |     "merged_model= PeftModel.from_pretrained(base_model, new_model)"
829 |    ]
830 |   },
831 |   {
832 |    "cell_type": "code",
833 |    "execution_count": null,
834 |    "id": "daa1c293-1daa-45a3-8531-9334163ba123",
835 |    "metadata": {
836 |     "tags": []
837 |    },
838 |    "outputs": [],
839 |    "source": [
840 |     "print(merged_model)"
841 |    ]
842 |   },
843 |   {
844 |    "cell_type": "code",
845 |    "execution_count": null,
846 |    "id": "c019bb5c-3215-4bf4-9c9c-1a9e84792739",
847 |    "metadata": {
848 |     "tags": []
849 |    },
850 |    "outputs": [],
851 |    "source": [
852 |     "merged_model= merged_model.merge_and_unload()"
853 |    ]
854 |   },
855 |   {
856 |    "cell_type": "code",
857 |    "execution_count": null,
858 |    "id": "ed33bd53-467a-4d57-8265-8ced8a7a01d8",
859 |    "metadata": {
860 |     "tags": []
861 |    },
862 |    "outputs": [],
863 |    "source": [
864 |     "print(merged_model)"
865 |    ]
866 |   },
867 |   {
868 |    "cell_type": "code",
869 |    "execution_count": null,
870 |    "id": "bdf5879d-06e4-4146-80c1-f8a026854456",
871 |    "metadata": {
872 |     "tags": []
873 |    },
874 |    "outputs": [],
875 |    "source": [
876 |     "sagemaker_save_dir = \"Mistral-Finetuned-Merged\""
877 |    ]
878 |   },
879 |   {
880 |    "cell_type": "code",
881 |    "execution_count": null,
882 |    "id": "61ef243d-0356-4c7f-859a-83800ce61d3f",
883 |    "metadata": {
884 |     "tags": []
885 |    },
886 |    "outputs": [],
887 |    "source": [
888 |     "merged_model.save_pretrained(sagemaker_save_dir, safe_serialization=True, max_shard_size=\"2GB\")\n",
889 |     "# save tokenizer for easy inference\n",
890 |     "tokenizer.save_pretrained(sagemaker_save_dir)"
891 |    ]
892 |   },
893 |   {
894 |    "cell_type": "code",
895 |    "execution_count": null,
896 |    "id": "7f8945c6-df65-4f4f-90fe-792c5155e73d",
897 |    "metadata": {},
898 |    "outputs": [],
899 |    "source": []
900 |   },
901 |   {
902 |    "cell_type": "code",
903 |    "execution_count": null,
904 |    "id": "ccf36938-a425-43ed-aa0f-938edb4a026b",
905 |    "metadata": {},
906 |    "outputs": [],
907 |    "source": []
908 |   }
909 |  ],
910 |  "metadata": {
911 |   "kernelspec": {
912 |    "display_name": "conda_pytorch_p310",
913 |    "language": "python",
914 |    "name": "conda_pytorch_p310"
915 |   },
916 |   "language_info": {
917 |    "codemirror_mode": {
918 |     "name": "ipython",
919 |     "version": 3
920 |    },
921 |    "file_extension": ".py",
922 |    "mimetype": "text/x-python",
923 |    "name": "python",
924 |    "nbconvert_exporter": "python",
925 |    "pygments_lexer": "ipython3",
926 |    "version": "3.10.13"
927 |   }
928 |  },
929 |  "nbformat": 4,
930 |  "nbformat_minor": 5
931 | }
932 | 


--------------------------------------------------------------------------------
/Deploy_Mistral_7B_on_Amazon_SageMaker_with_vLLM.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "603f53ab-ed44-4950-aac6-ba20b77fa16a",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Deploy Finetuned Mistral on Amazon SageMaker\n",
  9 |     "\n",
 10 |     "Mistral 7B is the open LLM from Mistral AI.\n",
 11 |     "\n",
 12 |     "This sample is modified from this documentation \n",
 13 |     "https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_mistral_7b.html\n",
 14 |     "\n",
 15 |     "You only need to provide the new model location to deploy the Mistral fine tune version"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "id": "efc64a47",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "This notebook has been tested on Amazon SageMaker Notebook Instances with single GPU on ml.g5.2xlarge\n",
 24 |     "\n",
 25 |     "The deployment has been tested on Amazon SageMaker real time inference endpoint with single GPU on ml.g5.2xlarge"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "id": "2c6d91ea-8c0c-4157-a5fc-c16a4491f896",
 31 |    "metadata": {},
 32 |    "source": [
 33 |     "## Setup development environment"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 7,
 39 |    "id": "756f5440-6f6a-4329-8f03-b545b1753f9e",
 40 |    "metadata": {
 41 |     "tags": []
 42 |    },
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "# TODO: update when container is added to sagemaker sdk\n",
 46 |     "!pip install sagemaker huggingface_hub jinja2 --upgrade --quiet"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 6,
 52 |    "id": "c1721d88-85c2-4eec-92c6-ba1cdb28edf5",
 53 |    "metadata": {
 54 |     "tags": []
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "name": "stdout",
 59 |      "output_type": "stream",
 60 |      "text": [
 61 |       "sagemaker role arn: arn:aws:iam::70768*******:role/service-role/AmazonSageMaker-ExecutionRole-20191024T163188\n",
 62 |       "sagemaker session region: us-east-1\n",
 63 |       "sagemaker version: 2.209.0\n"
 64 |      ]
 65 |     }
 66 |    ],
 67 |    "source": [
 68 |     "import sagemaker\n",
 69 |     "import boto3\n",
 70 |     "sess = sagemaker.Session()\n",
 71 |     "# sagemaker session bucket -> used for uploading data, models and logs\n",
 72 |     "# sagemaker will automatically create this bucket if it not exists\n",
 73 |     "sagemaker_session_bucket=None\n",
 74 |     "if sagemaker_session_bucket is None and sess is not None:\n",
 75 |     "    # set to default bucket if a bucket name is not given\n",
 76 |     "    sagemaker_session_bucket = sess.default_bucket()\n",
 77 |     "\n",
 78 |     "try:\n",
 79 |     "    role = sagemaker.get_execution_role()\n",
 80 |     "except ValueError:\n",
 81 |     "    iam = boto3.client('iam')\n",
 82 |     "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
 83 |     "\n",
 84 |     "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
 85 |     "\n",
 86 |     "print(f\"sagemaker role arn: {role}\")\n",
 87 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")\n",
 88 |     "print(f\"sagemaker version: {sagemaker.__version__}\")\n"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 8,
 94 |    "id": "0c5725e0-285c-4c67-b02d-8cc60b376abe",
 95 |    "metadata": {
 96 |     "tags": []
 97 |    },
 98 |    "outputs": [],
 99 |    "source": [
100 |     "model_id = \"mistralai/Mistral-7B-Instruct-v0.1\"\n",
101 |     "instance_type = 'ml.g5.2xlarge'  # instances type used for the deployment "
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "id": "f6dcac4f-5f26-4797-975b-c54f0b481ac2",
107 |    "metadata": {},
108 |    "source": [
109 |     "## Download the model and upload to s3\n",
110 |     "\n",
111 |     "We recommend to first save the model in a S3 location and provide the S3 url in the serving.properties file. This allows faster downloads times.\n",
112 |     "\n",
113 |     "Now you need to agree the policy to access this model, and use your Hugging Face token to download the model \n",
114 |     "\n",
115 |     "from huggingface_hub import login\n",
116 |     "\n",
117 |     "access_token_read = \"hf_****************XU\"\n",
118 |     "\n",
119 |     "login(token = access_token_read)\n",
120 |     "\n",
121 |     "<span style=\"color:red\">*If you already download the model to the notebook, please ingore this steps*</span>."
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 4,
127 |    "id": "83ba4c27-49f0-4418-bf58-7749b15b9443",
128 |    "metadata": {
129 |     "tags": []
130 |    },
131 |    "outputs": [
132 |     {
133 |      "data": {
134 |       "application/vnd.jupyter.widget-view+json": {
135 |        "model_id": "2b580cd9b3354b87a1d153d22daaabe9",
136 |        "version_major": 2,
137 |        "version_minor": 0
138 |       },
139 |       "text/plain": [
140 |        "Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]"
141 |       ]
142 |      },
143 |      "metadata": {},
144 |      "output_type": "display_data"
145 |     },
146 |     {
147 |      "data": {
148 |       "application/vnd.jupyter.widget-view+json": {
149 |        "model_id": "726b4f7f834a48b1b03e2c2ebc7e2985",
150 |        "version_major": 2,
151 |        "version_minor": 0
152 |       },
153 |       "text/plain": [
154 |        "generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
155 |       ]
156 |      },
157 |      "metadata": {},
158 |      "output_type": "display_data"
159 |     },
160 |     {
161 |      "data": {
162 |       "application/vnd.jupyter.widget-view+json": {
163 |        "model_id": "ea4ec51032c84a26be9f4ee09805aa84",
164 |        "version_major": 2,
165 |        "version_minor": 0
166 |       },
167 |       "text/plain": [
168 |        "model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]"
169 |       ]
170 |      },
171 |      "metadata": {},
172 |      "output_type": "display_data"
173 |     },
174 |     {
175 |      "data": {
176 |       "application/vnd.jupyter.widget-view+json": {
177 |        "model_id": "b66d12288630458793824c828047890b",
178 |        "version_major": 2,
179 |        "version_minor": 0
180 |       },
181 |       "text/plain": [
182 |        "model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]"
183 |       ]
184 |      },
185 |      "metadata": {},
186 |      "output_type": "display_data"
187 |     },
188 |     {
189 |      "data": {
190 |       "application/vnd.jupyter.widget-view+json": {
191 |        "model_id": "6d013d5d370c4c47996640ab2e3644fc",
192 |        "version_major": 2,
193 |        "version_minor": 0
194 |       },
195 |       "text/plain": [
196 |        "pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]"
197 |       ]
198 |      },
199 |      "metadata": {},
200 |      "output_type": "display_data"
201 |     },
202 |     {
203 |      "data": {
204 |       "application/vnd.jupyter.widget-view+json": {
205 |        "model_id": "b1fc3e6d6a644ba88a076cc94e11a42a",
206 |        "version_major": 2,
207 |        "version_minor": 0
208 |       },
209 |       "text/plain": [
210 |        "model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]"
211 |       ]
212 |      },
213 |      "metadata": {},
214 |      "output_type": "display_data"
215 |     },
216 |     {
217 |      "data": {
218 |       "application/vnd.jupyter.widget-view+json": {
219 |        "model_id": "db32e5873d3440d7a8d4e35aaca76571",
220 |        "version_major": 2,
221 |        "version_minor": 0
222 |       },
223 |       "text/plain": [
224 |        "pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]"
225 |       ]
226 |      },
227 |      "metadata": {},
228 |      "output_type": "display_data"
229 |     },
230 |     {
231 |      "data": {
232 |       "application/vnd.jupyter.widget-view+json": {
233 |        "model_id": "00acbeb1136f4052bff0651ddd3a6950",
234 |        "version_major": 2,
235 |        "version_minor": 0
236 |       },
237 |       "text/plain": [
238 |        "pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]"
239 |       ]
240 |      },
241 |      "metadata": {},
242 |      "output_type": "display_data"
243 |     },
244 |     {
245 |      "data": {
246 |       "application/vnd.jupyter.widget-view+json": {
247 |        "model_id": "70aa0f18215e493eaa6d61d7ff4cae15",
248 |        "version_major": 2,
249 |        "version_minor": 0
250 |       },
251 |       "text/plain": [
252 |        "special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]"
253 |       ]
254 |      },
255 |      "metadata": {},
256 |      "output_type": "display_data"
257 |     },
258 |     {
259 |      "data": {
260 |       "application/vnd.jupyter.widget-view+json": {
261 |        "model_id": "1b4539a4868143b88b4af349b6797342",
262 |        "version_major": 2,
263 |        "version_minor": 0
264 |       },
265 |       "text/plain": [
266 |        "config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]"
267 |       ]
268 |      },
269 |      "metadata": {},
270 |      "output_type": "display_data"
271 |     },
272 |     {
273 |      "data": {
274 |       "application/vnd.jupyter.widget-view+json": {
275 |        "model_id": "aac43333ef74450fb07ffd76466a88be",
276 |        "version_major": 2,
277 |        "version_minor": 0
278 |       },
279 |       "text/plain": [
280 |        "tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]"
281 |       ]
282 |      },
283 |      "metadata": {},
284 |      "output_type": "display_data"
285 |     },
286 |     {
287 |      "data": {
288 |       "application/vnd.jupyter.widget-view+json": {
289 |        "model_id": "58a2df599b624ec9a49ef8bf5667aea6",
290 |        "version_major": 2,
291 |        "version_minor": 0
292 |       },
293 |       "text/plain": [
294 |        "tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]"
295 |       ]
296 |      },
297 |      "metadata": {},
298 |      "output_type": "display_data"
299 |     },
300 |     {
301 |      "data": {
302 |       "application/vnd.jupyter.widget-view+json": {
303 |        "model_id": "91574f67a0d1464983573aab55b82cc0",
304 |        "version_major": 2,
305 |        "version_minor": 0
306 |       },
307 |       "text/plain": [
308 |        "tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]"
309 |       ]
310 |      },
311 |      "metadata": {},
312 |      "output_type": "display_data"
313 |     }
314 |    ],
315 |    "source": [
316 |     "# from huggingface_hub import snapshot_download\n",
317 |     "# from pathlib import Path\n",
318 |     "# import os\n",
319 |     "\n",
320 |     "# # - This will download the model into the current directory where ever the jupyter notebook is running\n",
321 |     "# local_model_path = Path(\".\")\n",
322 |     "# local_model_path.mkdir(exist_ok=True)\n",
323 |     "\n",
324 |     "# # Only download pytorch checkpoint files\n",
325 |     "# allow_patterns = [\"*.json\", \"*.pt\", \"*.bin\", \"*.txt\", \"*.model\", \"*.safetensors\"]\n",
326 |     "\n",
327 |     "# # - Leverage the snapshot library to donload the model since the model is stored in repository using LFS\n",
328 |     "# model_download_path = snapshot_download(\n",
329 |     "#     repo_id=model_id,\n",
330 |     "#     cache_dir=local_model_path,\n",
331 |     "#     allow_patterns=allow_patterns,\n",
332 |     "# )"
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "markdown",
337 |    "id": "cb749e4d-7747-4d62-87b8-c913a4eb1e9c",
338 |    "metadata": {},
339 |    "source": [
340 |     "#### If you have run the notebook of Local_Finetune_Mistral.ipynb, you should have the model artifact in Mistral-Finetuned-Merged\n",
341 |     "\n",
342 |     "#### If you have run the notebook of Finetune_Mistral_7B_on_Amazon_SageMaker.ipynb, you should have the model artifact in ./results/training_job, downloaded from SageMaker Training job\n"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "code",
347 |    "execution_count": 17,
348 |    "id": "60b4006c-589b-4dc1-83cb-084ca6e27d88",
349 |    "metadata": {
350 |     "tags": []
351 |    },
352 |    "outputs": [],
353 |    "source": [
354 |     "model_download_path\n",
355 |     "#model_download_path = \"Mistral-Finetuned-Merged\"\n",
356 |     "#model_download_path = \"./results/training_job\"\n"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "markdown",
361 |    "id": "1b037369-a5a0-4618-b690-723d8ff6549b",
362 |    "metadata": {},
363 |    "source": [
364 |     "Define where the model should upload in S3"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "code",
369 |    "execution_count": 15,
370 |    "id": "910b29d3-24f8-4073-a1f5-788404bfe635",
371 |    "metadata": {
372 |     "tags": []
373 |    },
374 |    "outputs": [
375 |     {
376 |      "name": "stdout",
377 |      "output_type": "stream",
378 |      "text": [
379 |       "Pretrained model will be uploaded to ---- > s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi/\n"
380 |      ]
381 |     }
382 |    ],
383 |    "source": [
384 |     "# define a variable to contain the s3url of the location that has the model\n",
385 |     "s3_model_prefix = f\"{model_id}/lmi\"  # folder within bucket where model artifact will go\n",
386 |     "pretrained_model_location = f\"s3://{sagemaker_session_bucket}/{s3_model_prefix}/\"\n",
387 |     "print(f\"Pretrained model will be uploaded to ---- > {pretrained_model_location}\")"
388 |    ]
389 |   },
390 |   {
391 |    "cell_type": "markdown",
392 |    "id": "1d022085-82e6-4d5e-a956-1fa48eaba71a",
393 |    "metadata": {},
394 |    "source": [
395 |     "We upload the model files to s3 bucket, please be patient, this will takes serveral （around 20） minutes."
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "code",
400 |    "execution_count": 16,
401 |    "id": "c341a860-6751-417b-8b4d-49ce6ce2940c",
402 |    "metadata": {
403 |     "tags": []
404 |    },
405 |    "outputs": [
406 |     {
407 |      "name": "stdout",
408 |      "output_type": "stream",
409 |      "text": [
410 |       "Model uploaded to --- > s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi\n",
411 |       "We will set option.model_id=s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi\n"
412 |      ]
413 |     }
414 |    ],
415 |    "source": [
416 |     "model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)\n",
417 |     "print(f\"Model uploaded to --- > {model_artifact}\")\n",
418 |     "print(f\"We will set option.model_id={model_artifact}\")"
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "markdown",
423 |    "id": "4b92b25c-6b19-4779-a1dc-5bd09336f213",
424 |    "metadata": {
425 |     "tags": []
426 |    },
427 |    "source": [
428 |     "## Choose an inference image\n",
429 |     "\n",
430 |     "[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving. Here we demo how to use VLLM to host the model.\n",
431 |     "\n",
432 |     "Starting LMI V10 (0.28.0), we are changing the name from LMI DeepSpeed DLC to LMI (LargeModelInference).\n",
433 |     "\n",
434 |     "https://github.com/aws/deep-learning-containers/blob/master/available_images.md"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": 18,
440 |    "id": "e5383930-19e9-47a6-b787-cdfc04b216d9",
441 |    "metadata": {
442 |     "tags": []
443 |    },
444 |    "outputs": [
445 |     {
446 |      "name": "stdout",
447 |      "output_type": "stream",
448 |      "text": [
449 |       "Docker image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.25.0-deepspeed0.11.0-cu118\n"
450 |      ]
451 |     }
452 |    ],
453 |    "source": [
454 |     "import sagemaker \n",
455 |     "\n",
456 |     "import sagemaker \n",
457 |     "\n",
458 |     "inference_image_uri = sagemaker.image_uris.retrieve(\n",
459 |     "    framework=\"djl-lmi\", region=sess.boto_session.region_name, version=\"0.29.0\"\n",
460 |     ")\n",
461 |     "\n",
462 |     "print(f\"Docker image: {inference_image_uri}\")"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "markdown",
467 |    "id": "79b6140d-66e7-41e6-a895-7e6d668efb63",
468 |    "metadata": {
469 |     "tags": []
470 |    },
471 |    "source": [
472 |     "### Prepare serving.properties file"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": 19,
478 |    "id": "79641d46-a272-4447-a626-a577e9fc91bf",
479 |    "metadata": {
480 |     "tags": []
481 |    },
482 |    "outputs": [],
483 |    "source": [
484 |     "code_folder = \"code_vllm\"\n",
485 |     "#pretrained_model_location = \"TheBloke/mixtral-8x7b-v0.1-AWQ\""
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "code",
490 |    "execution_count": 20,
491 |    "id": "bd5c3277-edcf-4651-bc04-e5bf43018776",
492 |    "metadata": {
493 |     "tags": []
494 |    },
495 |    "outputs": [],
496 |    "source": [
497 |     "from pathlib import Path\n",
498 |     "code_path = Path(code_folder)\n",
499 |     "code_path.mkdir(exist_ok=True)"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "markdown",
504 |    "id": "485b3660",
505 |    "metadata": {},
506 |    "source": [
507 |     "Large Model Inference Configurations\n",
508 |     "\n",
509 |     "https://docs.djl.ai/docs/serving/serving/docs/lmi/configurations_large_model_inference_containers.html"
510 |    ]
511 |   },
512 |   {
513 |    "cell_type": "markdown",
514 |    "id": "555b9bd0-6193-4912-bcbe-9f26e1d3226b",
515 |    "metadata": {},
516 |    "source": [
517 |     "mistralai/Mistral-7B-Instruct-v0.1"
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "code",
522 |    "execution_count": 21,
523 |    "id": "339a0dba-f1eb-41df-a068-97f9fff12f54",
524 |    "metadata": {
525 |     "tags": []
526 |    },
527 |    "outputs": [
528 |     {
529 |      "name": "stdout",
530 |      "output_type": "stream",
531 |      "text": [
532 |       "Writing ./code_vllm/serving.properties\n"
533 |      ]
534 |     }
535 |    ],
536 |    "source": [
537 |     "%%writefile ./{code_folder}/serving.properties\n",
538 |     "engine = Python\n",
539 |     "option.model_id = {{s3url}}\n",
540 |     "option.dtype=fp16\n",
541 |     "option.tensor_parallel_degree = 1\n",
542 |     "option.output_formatter = json\n",
543 |     "option.task=text-generation\n",
544 |     "option.model_loading_timeout = 1200\n",
545 |     "option.rolling_batch=vllm\n",
546 |     "option.device_map=auto\n",
547 |     "option.max_model_len=2048"
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "code",
552 |    "execution_count": 22,
553 |    "id": "5e9931d2-14da-459a-9c24-4d3301659fac",
554 |    "metadata": {
555 |     "tags": []
556 |    },
557 |    "outputs": [
558 |     {
559 |      "name": "stdout",
560 |      "output_type": "stream",
561 |      "text": [
562 |       "     1\t\u001b[36mengine\u001b[39;49;00m\u001b[37m \u001b[39;49;00m=\u001b[37m \u001b[39;49;00m\u001b[33mPython\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
563 |       "     2\t\u001b[36moption.model_id\u001b[39;49;00m\u001b[37m \u001b[39;49;00m=\u001b[37m \u001b[39;49;00m\u001b[33ms3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi/\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
564 |       "     3\t\u001b[36moption.dtype\u001b[39;49;00m=\u001b[33mfp16\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
565 |       "     4\t\u001b[36moption.tensor_parallel_degree\u001b[39;49;00m\u001b[37m \u001b[39;49;00m=\u001b[37m \u001b[39;49;00m\u001b[33m1\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
566 |       "     5\t\u001b[36moption.output_formatter\u001b[39;49;00m\u001b[37m \u001b[39;49;00m=\u001b[37m \u001b[39;49;00m\u001b[33mjson\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
567 |       "     6\t\u001b[36moption.task\u001b[39;49;00m=\u001b[33mtext-generation\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
568 |       "     7\t\u001b[36moption.model_loading_timeout\u001b[39;49;00m\u001b[37m \u001b[39;49;00m=\u001b[37m \u001b[39;49;00m\u001b[33m600\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
569 |       "     8\t\u001b[36moption.rolling_batch\u001b[39;49;00m=\u001b[33mvllm\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n",
570 |       "     9\t\u001b[36moption.device_map\u001b[39;49;00m=\u001b[33mauto\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n"
571 |      ]
572 |     }
573 |    ],
574 |    "source": [
575 |     "import jinja2\n",
576 |     "jinja_env = jinja2.Environment()\n",
577 |     "# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running\n",
578 |     "template = jinja_env.from_string(Path(f\"{code_folder}/serving.properties\").open().read())\n",
579 |     "Path(f\"{code_folder}/serving.properties\").open(\"w\").write(\n",
580 |     "    template.render(s3url=pretrained_model_location)\n",
581 |     ")\n",
582 |     "!pygmentize {code_folder}/serving.properties | cat -n"
583 |    ]
584 |   },
585 |   {
586 |    "cell_type": "markdown",
587 |    "id": "6dd1f43e-4448-47ba-aefd-9924da9d4d18",
588 |    "metadata": {},
589 |    "source": [
590 |     "### create tarball and upload to s3 location\n"
591 |    ]
592 |   },
593 |   {
594 |    "cell_type": "code",
595 |    "execution_count": 23,
596 |    "id": "818ddddd-36ea-4bc6-bc46-6601e52a69eb",
597 |    "metadata": {
598 |     "tags": []
599 |    },
600 |    "outputs": [
601 |     {
602 |      "name": "stdout",
603 |      "output_type": "stream",
604 |      "text": [
605 |       "./\n",
606 |       "./serving.properties\n"
607 |      ]
608 |     }
609 |    ],
610 |    "source": [
611 |     "!rm -f model.tar.gz\n",
612 |     "!rm -rf {code_folder}/.ipynb_checkpoints\n",
613 |     "!tar czvf model.tar.gz -C {code_folder} ."
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": 24,
619 |    "id": "d34bebe9-682a-4561-b03e-c2bfd7dedf43",
620 |    "metadata": {
621 |     "tags": []
622 |    },
623 |    "outputs": [
624 |     {
625 |      "name": "stdout",
626 |      "output_type": "stream",
627 |      "text": [
628 |       "S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/code_vllm/model.tar.gz\n"
629 |      ]
630 |     }
631 |    ],
632 |    "source": [
633 |     "s3_code_prefix = f\"{model_id}/{code_folder}\"\n",
634 |     "s3_code_artifact = sess.upload_data(\"model.tar.gz\", sagemaker_session_bucket, s3_code_prefix)\n",
635 |     "print(f\"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}\")"
636 |    ]
637 |   },
638 |   {
639 |    "cell_type": "markdown",
640 |    "id": "6027cb0e-27b1-41ac-a61d-f142bea36f7e",
641 |    "metadata": {},
642 |    "source": [
643 |     "## Create the endpoint"
644 |    ]
645 |   },
646 |   {
647 |    "cell_type": "code",
648 |    "execution_count": 25,
649 |    "id": "24a5dacf-b66c-4704-aaba-aa7fbc159f6c",
650 |    "metadata": {
651 |     "tags": []
652 |    },
653 |    "outputs": [
654 |     {
655 |      "name": "stdout",
656 |      "output_type": "stream",
657 |      "text": [
658 |       "mistralai/Mistral-7B-Instruct-v0.1\n",
659 |       "ml.g5.2xlarge\n"
660 |      ]
661 |     }
662 |    ],
663 |    "source": [
664 |     "## Deploy Mixtral 8x7B to Amazon SageMaker\n",
665 |     "\n",
666 |     "import json\n",
667 |     "from sagemaker import Model\n",
668 |     "from datetime import datetime\n",
669 |     "\n",
670 |     "# sagemaker config\n",
671 |     "print(model_id)\n",
672 |     "print(instance_type)\n",
673 |     "health_check_timeout = 1200\n",
674 |     "\n",
675 |     "timestamp = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n",
676 |     "\n",
677 |     "# create HuggingFaceModel with the image uri\n",
678 |     "llm_model = Model(\n",
679 |     "  role=role,\n",
680 |     "  name=f\"vllm-{model_id.replace('/', '-').lower().replace('.', '')}-{timestamp}\",\n",
681 |     "  model_data=s3_code_artifact, \n",
682 |     "  image_uri=inference_image_uri,\n",
683 |     ")\n"
684 |    ]
685 |   },
686 |   {
687 |    "cell_type": "markdown",
688 |    "id": "fa217f75-4aff-4753-8333-df4928def65c",
689 |    "metadata": {},
690 |    "source": [
691 |     "After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.48xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs."
692 |    ]
693 |   },
694 |   {
695 |    "cell_type": "code",
696 |    "execution_count": 26,
697 |    "id": "5794eb27-441f-48e5-9591-3f63b0921647",
698 |    "metadata": {
699 |     "tags": []
700 |    },
701 |    "outputs": [
702 |     {
703 |      "name": "stdout",
704 |      "output_type": "stream",
705 |      "text": [
706 |       "-----------!"
707 |      ]
708 |     }
709 |    ],
710 |    "source": [
711 |     "# Deploy model to an endpoint\n",
712 |     "# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy\n",
713 |     "llm = llm_model.deploy(\n",
714 |     "  initial_instance_count=1,\n",
715 |     "  instance_type=instance_type,\n",
716 |     "  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model\n",
717 |     ")\n"
718 |    ]
719 |   },
720 |   {
721 |    "cell_type": "code",
722 |    "execution_count": 27,
723 |    "id": "fe4457b7-bfa7-44c7-b0c8-d3f4d06c7372",
724 |    "metadata": {
725 |     "tags": []
726 |    },
727 |    "outputs": [
728 |     {
729 |      "name": "stdout",
730 |      "output_type": "stream",
731 |      "text": [
732 |       "vllm-mistralai-mistral-7b-instruct-v01--2024-02-28-07-04-46-288\n"
733 |      ]
734 |     }
735 |    ],
736 |    "source": [
737 |     "endpoint_name = llm_model.endpoint_name\n",
738 |     "print(endpoint_name)"
739 |    ]
740 |   },
741 |   {
742 |    "cell_type": "markdown",
743 |    "id": "e955f836-4928-4098-8bb4-7118ba0a7781",
744 |    "metadata": {
745 |     "tags": []
746 |    },
747 |    "source": [
748 |     "## Run inference and chat with the model\n",
749 |     "\n",
750 |     "See below as one example of prompt and response "
751 |    ]
752 |   },
753 |   {
754 |    "cell_type": "markdown",
755 |    "id": "36f67d3b-edd0-4cb5-972d-96e72b2a47f0",
756 |    "metadata": {},
757 |    "source": [
758 |     "[INST] Below is the question based on the context. Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.\n",
759 |     "Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. Write a response that appropriately completes the request.[/INST]"
760 |    ]
761 |   },
762 |   {
763 |    "cell_type": "markdown",
764 |    "id": "d1587944-6ba0-48a7-b96a-2263f19dc6ac",
765 |    "metadata": {},
766 |    "source": [
767 |     "#### Dataset ground truth:\n",
768 |     "Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.</s>"
769 |    ]
770 |   },
771 |   {
772 |    "cell_type": "code",
773 |    "execution_count": 28,
774 |    "id": "21abbf85-05df-453b-8704-65d229abe84e",
775 |    "metadata": {
776 |     "tags": []
777 |    },
778 |    "outputs": [],
779 |    "source": [
780 |     "import logging\n",
781 |     "\n",
782 |     "import boto3\n",
783 |     "import json\n",
784 |     "\n",
785 |     "# Create a Boto3 client for SageMaker Runtime\n",
786 |     "sagemaker_client = boto3.client(\"sagemaker-runtime\")\n",
787 |     "\n",
788 |     "max_tokens_to_sample = 200\n",
789 |     " \n",
790 |     "# Define the prompt and other parameters\n",
791 |     "prompt = \"\"\"\n",
792 |     "<s>[INST] Below is the question based on the context. \n",
793 |     "Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. \n",
794 |     "Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. \n",
795 |     "It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. \n",
796 |     "The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States. Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. \n",
797 |     "Write a response that appropriately completes the request.[/INST]\n",
798 |     "\"\"\"\n",
799 |     "\n",
800 |     "# hyperparameters for llm\n",
801 |     "parameters = {\n",
802 |     "    \"max_new_tokens\": max_tokens_to_sample,\n",
803 |     "    \"do_sample\": True,\n",
804 |     "    \"top_p\": 0.9,\n",
805 |     "    \"temperature\": 0.5,\n",
806 |     "}\n",
807 |     "\n",
808 |     "contentType = 'application/json'\n",
809 |     "\n",
810 |     "body = json.dumps({\n",
811 |     "    \"inputs\": prompt,\n",
812 |     "    # specify the parameters as needed\n",
813 |     "    \"parameters\": parameters\n",
814 |     "})\n",
815 |     "\n"
816 |    ]
817 |   },
818 |   {
819 |    "cell_type": "code",
820 |    "execution_count": 29,
821 |    "id": "f4376239-b8a7-476a-a874-ccc10a2b2303",
822 |    "metadata": {
823 |     "tags": []
824 |    },
825 |    "outputs": [],
826 |    "source": [
827 |     "response = sagemaker_client.invoke_endpoint(\n",
828 |     "    EndpointName=endpoint_name, Body=body, ContentType=contentType)\n",
829 |     "\n",
830 |     "# Process the response\n",
831 |     "response_body = json.loads(response.get('Body').read())"
832 |    ]
833 |   },
834 |   {
835 |    "cell_type": "code",
836 |    "execution_count": 30,
837 |    "id": "8e2508ff-1f63-4634-818c-50a0424fd132",
838 |    "metadata": {
839 |     "tags": []
840 |    },
841 |    "outputs": [
842 |     {
843 |      "name": "stdout",
844 |      "output_type": "stream",
845 |      "text": [
846 |       "Response: Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.\n"
847 |      ]
848 |     }
849 |    ],
850 |    "source": [
851 |     "print(response_body['generated_text'])"
852 |    ]
853 |   },
854 |   {
855 |    "cell_type": "markdown",
856 |    "id": "510a4186-0d35-44f2-9040-b87321bc9e80",
857 |    "metadata": {},
858 |    "source": [
859 |     "## Clean up\n"
860 |    ]
861 |   },
862 |   {
863 |    "cell_type": "code",
864 |    "execution_count": null,
865 |    "id": "9a20046d-6f3b-40b7-a984-414a4e5af055",
866 |    "metadata": {},
867 |    "outputs": [],
868 |    "source": [
869 |     "#llm.delete_model()\n",
870 |     "#llm.delete_endpoint()"
871 |    ]
872 |   }
873 |  ],
874 |  "metadata": {
875 |   "kernelspec": {
876 |    "display_name": "conda_python3",
877 |    "language": "python",
878 |    "name": "conda_python3"
879 |   },
880 |   "language_info": {
881 |    "codemirror_mode": {
882 |     "name": "ipython",
883 |     "version": 3
884 |    },
885 |    "file_extension": ".py",
886 |    "mimetype": "text/x-python",
887 |    "name": "python",
888 |    "nbconvert_exporter": "python",
889 |    "pygments_lexer": "ipython3",
890 |    "version": "3.10.13"
891 |   }
892 |  },
893 |  "nbformat": 4,
894 |  "nbformat_minor": 5
895 | }
896 | 


--------------------------------------------------------------------------------