├── FUNDING.yml ├── docs ├── quantization.md ├── merging_the_weights.md ├── making_datasets.md └── training.md ├── accelerate_config.yaml ├── README.md ├── research ├── abstract.md └── paper.md ├── merge_adapter_weights.py ├── .gitignore ├── utils.py ├── examples ├── alter_dataset.ipynb └── create_dataset.ipynb └── finetune_peft_8bit.py /FUNDING.yml: -------------------------------------------------------------------------------- 1 | github: serp-ai 2 | -------------------------------------------------------------------------------- /docs/quantization.md: -------------------------------------------------------------------------------- 1 | # Quantization 2 | 4/3/2 bit instructions and repo can be found [here](https://github.com/megvii-research/Sparsebit/blob/main/large_language_models/llama/quantization/README.md). 3 | 4 | Since transformers now has LLaMA compatibility, you can skip to the `Run` section and continue from there. 4 bit is the best performing but you can also try setting the candidate bits to `2 3 4` or `3 4` to have a mix of precisions. 5 | 6 | Make sure you merge your adapter before quantizing! -------------------------------------------------------------------------------- /accelerate_config.yaml: -------------------------------------------------------------------------------- 1 | compute_environment: LOCAL_MACHINE 2 | deepspeed_config: 3 | deepspeed_multinode_launcher: standard 4 | gradient_clipping: 1.0 5 | offload_optimizer_device: cpu 6 | offload_param_device: cpu 7 | zero3_init_flag: false 8 | zero_stage: 2 9 | distributed_type: DEEPSPEED 10 | downcast_bf16: no 11 | dynamo_backend: 'no' 12 | fsdp_config: {} 13 | machine_rank: 0 14 | main_training_function: main 15 | megatron_lm_config: {} 16 | mixed_precision: 'no' 17 | num_machines: 1 18 | num_processes: 8 19 | rdzv_backend: static 20 | same_network: true 21 | use_cpu: false -------------------------------------------------------------------------------- /docs/merging_the_weights.md: -------------------------------------------------------------------------------- 1 | # Weight merging 2 | You can merge the lora with the foundation for further training such as the ppo stage in rlhf, to have faster inference, to quantize the model to 4/3/2 bit, or to just not have to deal with juggling the two models. 3 | 4 | All you have to do is run this command 5 | - `python merge_adapter_weights --model_name= --output_dir=` 6 | 7 | The config of the LoRA saves the path to the foundation model, so you can just run the command without any extra arguments and it will merge the weights for you. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Chat LLaMA 2 | 3 | 8bit-LoRA or 4bit-LoRA 4 | 5 | Repository for training a LoRA for the LLaMA (1 and 2) models on HuggingFace with 8-bit or 4-bit quantization. Research only for LLaMA 1, LLaMA 2 is open commercially. 6 |

7 | 👉 [Join our Discord Server](https://serp.ly/@serpai/discord) for updates, support & collaboration 8 |

9 | 10 | Dataset creation, training, weight merging, and quantization instructions are in the [docs](docs/). 11 | 12 | # Check out our trained LoRAs on HuggingFace 13 | ## Anthropic's HH 14 | - [7B](https://huggingface.co/serpdotai/llama-hh-lora-7B) 15 | - [13B](https://huggingface.co/serpdotai/llama-hh-lora-13B) 16 | - [30B](https://huggingface.co/serpdotai/llama-hh-lora-30B) 17 | -------------------------------------------------------------------------------- /research/abstract.md: -------------------------------------------------------------------------------- 1 | Title: Leveraging Large Language Models with LLaMA: Dataset Creation, Weight Merging, Quantization, and Training 2 | Authors: [Author Name] ([Affiliation]), [Author Name] ([Affiliation]) 3 | Abstract: 4 | In this paper, we present a comprehensive approach to fine-tuning large language models (LLMs) using the LLaMA (Large Language Model Adaptation) framework. We demonstrate how to create finetune datasets, merge adapter weights, quantize models, and train LLMs with LoRA (Low-Rank Adapters). We provide details on gathering samples of inputs and outputs, constructing prompt templates, and training models to complete prompts. We also discuss the use of weight merging for faster inference and quantization for memory-efficient deployment. Our experiments show that our approach is effective in generating high-quality datasets and training LLMs with improved performance. 5 | Comments: [Number of pages] pages, [Number of figures] figures 6 | Report-no: [Report Number] (if applicable) 7 | Category: cs.AI (Artificial Intelligence) 8 | Journal-ref: [Journal Reference] (if applicable) 9 | DOI: [DOI] (if applicable) 10 | MSC-class: [MSC-class] (if applicable, for math archives only) 11 | ACM-class: [ACM-class] (if applicable, for cs archives only) 12 | -------------------------------------------------------------------------------- /docs/making_datasets.md: -------------------------------------------------------------------------------- 1 | # Making datasets 2 | 3 | When creating a finetune dataset, you can either gather samples of inputs and outputs and construct the prompt template the model will expect during inference, or you can use a corpus of text to train the model to complete prompts in the style of the corpus. (Both are using the same training objective of predicting the next token in the sequence, but the way you construct the dataset and prompt the model is different.) 4 | 5 | ## Gathering samples 6 | 7 | The simplest way to create a dataset is to gather a set of inputs and outputs using existing llms. A technique you can also use is to take an existing dataset (like Anthropic's HH for example) and use an llm to alter the outputs and/or inputs to create a new dataset in the tone/personality of you choosing. This is a good way to create a dialogue style dataset for a specific domain, like a specific company or a specific person. You can also generate datasets using few shot templates with a desired task/api usage and have an llm generate the outputs for you (inputs too if you'd like). (Examples of this are in the [examples](examples/) folder 8 | 9 | ## Training a model to complete prompts 10 | 11 | The other way to create a dataset is to train a model to complete prompts. This is a good way to create a dataset for creative writing, idea generation, or any other task where you want the model to finish your prompt in the way that it was trained. To create these datasets you can just add start and end tokens to each body of text, concatenate them, chunk them by the desired sequence length, and then train a model. 12 | 13 | ### Details 14 | - 30k - 50k samples is a good starting point for a dataset of this type. (10k or possibly less may also turn out well depending on the task and the quality of the data) 15 | - If the model doesn't seem to be modeling your data well, you can try adding more data, training for more epochs, or revisting your dataset to see if there are actual patterns for the model to learn. (If you're using a corpus of text, you can try using a different corpus, or filtering out irrelevant text) (If they are instructions or dialogue, you can try making sure the inputs and outputs match your desired behavior) 16 | -------------------------------------------------------------------------------- /merge_adapter_weights.py: -------------------------------------------------------------------------------- 1 | # Taken from https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/merge_peft_adapter.py for ease of use (copy and pasted) 2 | 3 | from dataclasses import dataclass, field 4 | from typing import Optional 5 | 6 | import peft 7 | import torch 8 | from peft import PeftConfig, PeftModel 9 | from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, LlamaForCausalLM, LlamaTokenizer 10 | 11 | 12 | @dataclass 13 | class ScriptArguments: 14 | """ 15 | The name of the Casual LM model we wish to fine with PPO 16 | """ 17 | 18 | # NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode 19 | # models like gpt-neo* models are more suitable 20 | model_name: Optional[str] = field(default="serpdotai/llama-hh-lora-7B", metadata={"help": "the lora model name"}) 21 | output_dir: Optional[str] = field(default="llama-7B-hh-adapter-merged", metadata={"help": "the output directory"}) 22 | push_to_hub: Optional[bool] = field(default=False, metadata={"help": "push the model to the huggingface hub"}) 23 | 24 | 25 | parser = HfArgumentParser(ScriptArguments) 26 | script_args = parser.parse_args_into_dataclasses()[0] 27 | 28 | peft_model_id = script_args.model_name 29 | peft_config = PeftConfig.from_pretrained(peft_model_id) 30 | model = LlamaForCausalLM.from_pretrained( 31 | peft_config.base_model_name_or_path, 32 | return_dict=True, 33 | torch_dtype=torch.float16, 34 | ) 35 | tokenizer =LlamaTokenizer.from_pretrained(peft_config.base_model_name_or_path) 36 | 37 | # Load the Lora model 38 | model = PeftModel.from_pretrained(model, peft_model_id) 39 | model.eval() 40 | 41 | key_list = [key for key, _ in model.base_model.model.named_modules() if "lora" not in key] 42 | for key in key_list: 43 | parent, target, target_name = model.base_model._get_submodules(key) 44 | if isinstance(target, peft.tuners.lora.Linear): 45 | bias = target.bias is not None 46 | new_module = torch.nn.Linear(target.in_features, target.out_features, bias=bias) 47 | model.base_model._replace_module(parent, target_name, new_module, target) 48 | 49 | model = model.base_model.model 50 | 51 | if script_args.push_to_hub: 52 | model.push_to_hub(f"{script_args.model_name}-adapter-merged", use_temp_dir=False) 53 | model.save_pretrained(script_args.output_dir) 54 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | LoRA/ 2 | wandb/ 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | pip-wheel-metadata/ 27 | share/python-wheels/ 28 | *.egg-info/ 29 | .installed.cfg 30 | *.egg 31 | MANIFEST 32 | 33 | # PyInstaller 34 | # Usually these files are written by a python script from a template 35 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 36 | *.manifest 37 | *.spec 38 | 39 | # Installer logs 40 | pip-log.txt 41 | pip-delete-this-directory.txt 42 | 43 | # Unit test / coverage reports 44 | htmlcov/ 45 | .tox/ 46 | .nox/ 47 | .coverage 48 | .coverage.* 49 | .cache 50 | nosetests.xml 51 | coverage.xml 52 | *.cover 53 | *.py,cover 54 | .hypothesis/ 55 | .pytest_cache/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Django stuff: 62 | *.log 63 | local_settings.py 64 | db.sqlite3 65 | db.sqlite3-journal 66 | 67 | # Flask stuff: 68 | instance/ 69 | .webassets-cache 70 | 71 | # Scrapy stuff: 72 | .scrapy 73 | 74 | # Sphinx documentation 75 | docs/_build/ 76 | 77 | # PyBuilder 78 | target/ 79 | 80 | # Jupyter Notebook 81 | .ipynb_checkpoints 82 | 83 | # IPython 84 | profile_default/ 85 | ipython_config.py 86 | 87 | # pyenv 88 | .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 98 | __pypackages__/ 99 | 100 | # Celery stuff 101 | celerybeat-schedule 102 | celerybeat.pid 103 | 104 | # SageMath parsed files 105 | *.sage.py 106 | 107 | # Environments 108 | .env 109 | .venv 110 | env/ 111 | venv/ 112 | ENV/ 113 | env.bak/ 114 | venv.bak/ 115 | 116 | # Spyder project settings 117 | .spyderproject 118 | .spyproject 119 | 120 | # Rope project settings 121 | .ropeproject 122 | 123 | # mkdocs documentation 124 | /site 125 | 126 | # mypy 127 | .mypy_cache/ 128 | .dmypy.json 129 | dmypy.json 130 | 131 | # Pyre type checker 132 | .pyre/ 133 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import transformers 2 | import torch 3 | import torch.nn as nn 4 | from typing import Dict 5 | 6 | 7 | def smart_tokenizer_and_embedding_resize( 8 | special_tokens_dict: Dict, 9 | tokenizer: transformers.PreTrainedTokenizer, 10 | model: transformers.PreTrainedModel, 11 | ): 12 | """From: https://github.com/artidoro/qlora/blob/main/qlora.py 13 | Resize tokenizer and embedding. 14 | 15 | Note: This is the unoptimized version that may make your embedding size not be divisible by 64. 16 | """ 17 | num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict) 18 | model.resize_token_embeddings(len(tokenizer)) 19 | 20 | input_embeddings = model.get_input_embeddings() 21 | output_embeddings = model.get_output_embeddings() 22 | if num_new_tokens > 0: 23 | input_embeddings_data = input_embeddings.weight.data 24 | output_embeddings_data = output_embeddings.weight.data 25 | 26 | input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True) 27 | output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True) 28 | 29 | input_embeddings_data[-num_new_tokens:] = input_embeddings_avg 30 | output_embeddings_data[-num_new_tokens:] = output_embeddings_avg 31 | model.tie_weights() 32 | 33 | # Temporary bug fix #214: 34 | # freeze embeddings otherwise need to store them with checkpoint 35 | input_embeddings.weight.requires_grad = False 36 | output_embeddings.weight.requires_grad = False 37 | # re-register forward hook 38 | if hasattr(model, "enable_input_require_grads"): 39 | model.enable_input_require_grads() 40 | else: 41 | 42 | def make_inputs_require_grad(module, input, output): 43 | output.requires_grad_(True) 44 | 45 | model.get_input_embeddings().register_forward_hook(make_inputs_require_grad) 46 | 47 | 48 | class CastOutputToFloat(nn.Sequential): 49 | def forward(self, x): return super().forward(x).to(torch.float32) 50 | 51 | 52 | def print_trainable_parameters(args, model): 53 | """ 54 | Prints the number of trainable parameters in the model. 55 | """ 56 | trainable_params = 0 57 | all_param = 0 58 | for _, param in model.named_parameters(): 59 | all_param += param.numel() 60 | if param.requires_grad: 61 | trainable_params += param.numel() 62 | if args.bits == 4: trainable_params /= 2 63 | print( 64 | f"trainable params: {trainable_params} || " 65 | f"all params: {all_param} || " 66 | f"trainable: {100 * trainable_params / all_param}" 67 | ) 68 | 69 | 70 | def str_or_bool(value): 71 | if str(value).lower() in ('yes', 'true', 't', 'y', '1'): 72 | return True 73 | elif str(value).lower() in ('no', 'false', 'f', 'n', '0'): 74 | return False 75 | else: 76 | return str(value) # if it's not a recognizable boolean, treat it as a string 77 | -------------------------------------------------------------------------------- /research/paper.md: -------------------------------------------------------------------------------- 1 | Title: Leveraging Large Language Models with LLaMA: Dataset Creation, Weight Merging, Quantization, and Training 2 | 3 | Authors: Francis LaBounty Jr. and Devin Schumacher 4 | 5 | Affiliations: SERP AI 6 | 7 | Abstract: 8 | In this paper, we present a comprehensive approach to fine-tuning large language models (LLMs) using the LLaMA (Large Language Model Adaptation) framework. We demonstrate how to create finetune datasets, merge adapter weights, quantize models, and train LLMs with LoRA (Low-Rank Adapters). We provide details on gathering samples of inputs and outputs, constructing prompt templates, and training models to complete prompts. We also discuss the use of weight merging for faster inference and quantization for memory-efficient deployment. Our experiments show that our approach is effective in generating high-quality datasets and training LLMs with improved performance. 9 | 10 | Introduction 11 | Large language models (LLMs) have achieved state-of-the-art performance on a wide range of natural language processing (NLP) tasks. However, fine-tuning LLMs for specific tasks or domains can be challenging due to the need for high-quality datasets, efficient training techniques, and memory-efficient deployment. In this paper, we present a comprehensive approach to fine-tuning LLMs using the LLaMA (Large Language Model Adaptation) framework. Our approach includes creating finetune datasets, merging adapter weights, quantizing models, and training LLMs with LoRA (Low-Rank Adapters). 12 | 13 | Making Datasets 14 | 2.1 Gathering Samples 15 | The simplest way to create a dataset is to gather a set of inputs and outputs using existing LLMs. We demonstrate how to take an existing dataset (e.g., Anthropic's HH) and use an LLM to alter the outputs and/or inputs to create a new dataset in the tone or personality of one's choosing. We also show how to generate datasets using few-shot templates with a desired task or API usage and have an LLM generate the outputs (and inputs, if desired). 16 | 17 | 2.2 Training a Model to Complete Prompts 18 | We discuss training a model to complete prompts for tasks such as creative writing and idea generation. We describe how to add start and end tokens to each body of text, concatenate them, chunk them by the desired sequence length, and train a model. 19 | 20 | Weight Merging 21 | We present a technique to merge the LoRA adapter with the foundation model for further training, faster inference, and quantization. We provide a command to merge adapter weights and discuss the benefits of this approach. 22 | 23 | Quantization 24 | We discuss quantizing the model to 4/3/2 bit for memory-efficient deployment. We provide instructions and a link to the repository for quantization. We also emphasize the importance of merging the adapter before quantization. 25 | 26 | Training 27 | We provide details on training LLMs with LoRA using the Peft library. We discuss the recommended starting parameters, requirements, and training scripts. We also present optional techniques such as flash attention for faster training. 28 | 29 | Conclusion 30 | In conclusion, we have presented a comprehensive approach to fine-tuning LLMs using the LLaMA framework. Our approach includes creating finetune datasets, merging adapter weights, quantizing models, and training LLMs with LoRA. Our experiments demonstrate the effectiveness of our approach in generating high-quality datasets and training LLMs with improved performance. We believe that our approach will be valuable to researchers and practitioners working with large language models. 31 | 32 | References: 33 | [1] [Reference details] 34 | [2] [Reference details] 35 | [3] [Reference details] 36 | 37 | Note: Please replace the placeholders such as [Author Name], [Affiliation], and [Reference details] with the actual information. 38 | -------------------------------------------------------------------------------- /examples/alter_dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 10, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# imports and setup\n", 10 | "import openai\n", 11 | "from datasets import load_dataset\n", 12 | "\n", 13 | "api_key = ''\n", 14 | "chat_model = 'gpt-3.5-turbo'\n", 15 | "temperature = 1.0\n", 16 | "top_p = 1.0\n", 17 | "max_tokens = 2048 # llama's max sequence length\n", 18 | "presence_penalty = 0.0\n", 19 | "frequency_penalty = 0.0\n", 20 | "logit_bias = {}\n", 21 | "\n", 22 | "openai.api_key = api_key" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 11, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stderr", 32 | "output_type": "stream", 33 | "text": [ 34 | "Found cached dataset parquet (C:/Users/labou/.cache/huggingface/datasets/Dahoas___parquet/default-b25c081aeeca3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" 35 | ] 36 | }, 37 | { 38 | "data": { 39 | "application/vnd.jupyter.widget-view+json": { 40 | "model_id": "d169f505f83949c88ec0ca414106b0f0", 41 | "version_major": 2, 42 | "version_minor": 0 43 | }, 44 | "text/plain": [ 45 | " 0%| | 0/2 [00:00` or login with the hugginface cli) 28 | 29 | Example launch command: 30 | `python finetune_peft_8bit.py --num_train_epochs=2 --model_name_or_path=meta-llama/Llama-2-7b-hf --model_output_dir=LLaMA/LoRA/7B --output_dir=LLaMA/LoRA/train/7B --block_size=600 --per_device_train_batch_size=2 --gradient_accumulation_steps=64 --fp16=true --logging_steps=1 --log_level=info --learning_rate=2.0e-04 --lr_scheduler_type=linear --warmup_ratio=0.06 --weight_decay=0.1 --optim=adamw_torch_fused --evaluation_strategy=steps --save_strategy=steps --eval_steps=400 --save_steps=400 --output_dir="LoRA" --save_total_limit=3 --load_best_model_at_end=True --dataset_name=Dahoas/full-hh-rlhf --r=64 --lora_alpha=32 --lora_dropout=0.05` 31 | 32 | 4-bit training example: 33 | `python finetune_peft_8bit.py --num_train_epochs=1 --model_name_or_path=meta-llama/Llama-2-7b-hf --model_output_dir=LLaMA/LoRA/7B --output_dir=LLaMA/LoRA/train/7B --bits=4 --bf16 --quant_type=nf4 --double_quant=True --gradient_checkpointing=True --block_size=2048 --per_device_train_batch_size=4 --gradient_accumulation_steps=32 --logging_steps=1 --log_level=info --learning_rate=2.0e-04 --lr_scheduler_type=linear --warmup_ratio=0.06 --weight_decay=0.1 --optim=paged_adamw_32bit --evaluation_strategy=steps --save_strategy=steps --eval_steps=400 --save_steps=400 --output_dir="LoRA" --save_total_limit=3 --load_best_model_at_end=True --dataset_name=Dahoas/full-hh-rlhf --r=64 --lora_alpha=32 --lora_dropout=0.05 --max_grad_norm=0.3` 34 | 35 | Or if using multiple gpus (make sure you have the correct number of gpus in the `accelerate_config.yaml` file) 36 | `accelerate launch --config_file=accelerate_config.yaml finetune_peft_8bit.py --multi_gpu=True --tensor_parallel=False --num_train_epochs=2 --model_name_or_path=meta-llama/Llama-2-13b-hf --model_output_dir=LLaMA/LoRA/13B --output_dir=LLaMA/LoRA/train/13B --block_size=600 --per_device_train_batch_size=2 --gradient_accumulation_steps=8 --fp16=true --logging_steps=1 --log_level=info --learning_rate=2.0e-04 --lr_scheduler_type=linear --warmup_ratio=0.06 --weight_decay=0.1 --optim=adamw_torch_fused --evaluation_strategy=steps --save_strategy=steps --eval_steps=400 --save_steps=400 --output_dir="LoRA" --save_total_limit=3 --load_best_model_at_end=True --remove_unused_columns=False --dataset_name=Dahoas/full-hh-rlhf --r=64, --lora_alpha=32, --lora_dropout=0.05` 37 | 38 | Or if using multiple gpus and tensor parrallelism (go in to the main training file and edit the `get_device_map` function with your number of devices and desired memory allocation) 39 | `python finetune_peft_8bit.py --num_train_epochs=2 --multi_gpu=True --tensor_parallel=True --model_name_or_path=meta-llama/Llama-2-70b-hf --model_output_dir=LLaMA/LoRA/70B --output_dir=LLaMA/LoRA/train/70B --block_size=600 --per_device_train_batch_size=2 --gradient_accumulation_steps=64 --fp16=true --logging_steps=1 --log_level=info --learning_rate=2.0e-04 --lr_scheduler_type=linear --warmup_ratio=0.06 --weight_decay=0.1 --optim=adamw_torch_fused --evaluation_strategy=steps --save_strategy=steps --eval_steps=400 --save_steps=400 --output_dir="LoRA" --save_total_limit=3 --load_best_model_at_end=True --dataset_name=Dahoas/full-hh-rlhf --r=64 --lora_alpha=32 --lora_dropout=0.05` 40 | 41 | 42 | ## Flash attention (optional - pytorch 2.0 required) 43 | If you want to use flash attention, you need to change the source code of the transformers library. You can find the source code [here](https://github.com/huggingface/transformers). You need to change the `modeling_llama.py` file. located at `transformers/src/transformers/models/llama/modeling_llama.py`. You need to change the `LlamaAttention` class to the following: 44 | 45 | ```python 46 | class LlamaAttention(nn.Module): 47 | """Multi-headed attention from 'Attention Is All You Need' paper""" 48 | 49 | def __init__( 50 | self, 51 | hidden_size: int, 52 | num_heads: int, 53 | ): 54 | super().__init__() 55 | self.hidden_size = hidden_size 56 | self.num_heads = num_heads 57 | self.head_dim = hidden_size // num_heads 58 | 59 | if (self.head_dim * num_heads) != self.hidden_size: 60 | raise ValueError( 61 | f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" 62 | f" and `num_heads`: {num_heads})." 63 | ) 64 | self.q_proj = nn.Linear( 65 | hidden_size, 66 | num_heads * self.head_dim, 67 | bias=False, 68 | ) 69 | self.k_proj = nn.Linear( 70 | hidden_size, 71 | num_heads * self.head_dim, 72 | bias=False, 73 | ) 74 | self.v_proj = nn.Linear( 75 | hidden_size, 76 | num_heads * self.head_dim, 77 | bias=False, 78 | ) 79 | self.o_proj = nn.Linear( 80 | num_heads * self.head_dim, 81 | hidden_size, 82 | bias=False, 83 | ) 84 | self.rotary_emb = LlamaRotaryEmbedding(self.head_dim) 85 | 86 | self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention') 87 | if not self.flash: 88 | print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0") 89 | self.register_buffer("bias", torch.tril(torch.ones(hidden_size, hidden_size)) 90 | .view(1, 1, hidden_size, hidden_size)) 91 | 92 | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): 93 | return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() 94 | 95 | def forward( 96 | self, 97 | hidden_states: torch.Tensor, 98 | past_key_value: Optional[Tuple[torch.Tensor]] = None, 99 | attention_mask: Optional[torch.Tensor] = None, 100 | output_attentions: bool = False, 101 | use_cache: bool = False, 102 | ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: 103 | """Input shape: Batch x Time x Channel""" 104 | 105 | bsz, q_len, _ = hidden_states.size() 106 | 107 | query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) 108 | key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) 109 | value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) 110 | 111 | kv_seq_len = key_states.shape[-2] 112 | offset = 0 113 | if past_key_value is not None: 114 | offset = past_key_value[0].shape[-2] 115 | kv_seq_len += offset 116 | cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) 117 | query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, offset=offset) 118 | # [bsz, nh, t, hd] 119 | 120 | if past_key_value is not None: 121 | # reuse k, v, self_attention 122 | key_states = torch.cat([past_key_value[0], key_states], dim=2) 123 | value_states = torch.cat([past_key_value[1], value_states], dim=2) 124 | 125 | if self.flash: 126 | attn_output = torch.nn.functional.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask=attention_mask) 127 | else: 128 | past_key_value = (key_states, value_states) 129 | 130 | attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) 131 | 132 | if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len): 133 | raise ValueError( 134 | f"Attention weights should be of size {(bsz * self.num_heads, q_len, kv_seq_len)}, but is" 135 | f" {attn_weights.size()}" 136 | ) 137 | 138 | if attention_mask is not None: 139 | if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): 140 | raise ValueError( 141 | f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}" 142 | ) 143 | attn_weights = attn_weights + attention_mask 144 | attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min)) 145 | 146 | # upcast attention to fp32 147 | attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) 148 | attn_output = torch.matmul(attn_weights, value_states) 149 | 150 | if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim): 151 | raise ValueError( 152 | f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" 153 | f" {attn_output.size()}" 154 | ) 155 | 156 | attn_output = attn_output.transpose(1, 2) 157 | attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) 158 | 159 | attn_output = self.o_proj(attn_output) 160 | 161 | if not output_attentions: 162 | attn_weights = None 163 | 164 | return attn_output, attn_weights, past_key_value 165 | ``` 166 | 167 | The changes are 168 | ```python 169 | self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention') 170 | if not self.flash: 171 | print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0") 172 | self.register_buffer("bias", torch.tril(torch.ones(hidden_size, hidden_size)) 173 | .view(1, 1, hidden_size, hidden_size)) 174 | ``` 175 | and 176 | ```python 177 | if self.flash: 178 | attn_output = torch.nn.functional.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask=attention_mask) 179 | else: 180 | ... 181 | ``` 182 | 183 | After those changes, pip install the local repo by cd'ing in to the transformers repo (root directory) and then running `pip install .`. Then, add/uncomment the following to the imports of the training script: 184 | ```python 185 | import torch.backends.cuda 186 | torch.backends.cuda.enable_flash_sdp(enabled=True) 187 | ``` 188 | -------------------------------------------------------------------------------- /examples/create_dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 23, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# imports and setup\n", 10 | "import openai\n", 11 | "from duckduckgo_search import ddg\n", 12 | "import re\n", 13 | "\n", 14 | "api_key = ''\n", 15 | "chat_model = 'gpt-3.5-turbo'\n", 16 | "temperature = 1.0\n", 17 | "top_p = 1.0\n", 18 | "max_tokens = 2048 # llama's max sequence length\n", 19 | "presence_penalty = 0.0\n", 20 | "frequency_penalty = 0.0\n", 21 | "logit_bias = {}\n", 22 | "\n", 23 | "# change to None if you don't want to use any stop sequence\n", 24 | "stop = ['Observation:']\n", 25 | "\n", 26 | "openai.api_key = api_key" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# This is an example on a chain of thought loop with tool usage (add support for whatever tool/api you want)" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 62, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "def parse_action(text):\n", 45 | " try:\n", 46 | " # Regular expression pattern to match the last \"Action\" and \"Action Input\" in the text\n", 47 | " pattern = r'Action: ([^\\n]*\\n)+Action Input: ([^\\n]*)$'\n", 48 | "\n", 49 | " match = re.search(pattern, text, re.MULTILINE)\n", 50 | "\n", 51 | " if match:\n", 52 | " last_action = match.group(1).strip()\n", 53 | " last_action_input = match.group(2).strip()\n", 54 | " if last_action.lower() == 'web search':\n", 55 | " last_action_input = last_action_input.strip('\\\"') # remove quotes to receive better search results\n", 56 | " print('Searching for: ' + last_action_input + '...')\n", 57 | " results = ddg(keywords=last_action_input, safesearch='Off', time=None, max_results=5)\n", 58 | " out = '{'\n", 59 | " for result in results:\n", 60 | " out += 'title: ' + result['title'] + ',\\n\\tbody: ' + result['body'] + ',\\n\\t'\n", 61 | " return out.strip() + '}'\n", 62 | " elif last_action.lower() == 'calculator':\n", 63 | " print('Calculating: ' + last_action_input + '...')\n", 64 | " # rough example of a calculator\n", 65 | " return eval(last_action_input)\n", 66 | " else:\n", 67 | " return None\n", 68 | " else:\n", 69 | " return None\n", 70 | " except:\n", 71 | " return None" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 55, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "# define prompt (adapted from gpt-4 paper)\n", 81 | "prompt = \"\"\"Answer the following questions as best as you can. You have access to the following tools:\n", 82 | "Web Search: Searches the web for the given search query.\n", 83 | "Calculator: Performs basic arithmetic operations.\n", 84 | "Use the following format:\n", 85 | "Question: the input question you must answer\n", 86 | "Thought: you should always think about what to do\n", 87 | "Action: the action you to take, should be one of [Web Search, Calculator]\n", 88 | "Action Input: the input to the action\n", 89 | "Observation: the result of the action\n", 90 | "... (this Thought/Action/Observation loop can repeat N times)\n", 91 | "Thought: I now know the final answer\n", 92 | "Final Answer: the final answer to the original input question\n", 93 | "Begin!\n", 94 | "Question: {question}\"\"\"" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 5, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "# load model_inputs if applicable\n", 104 | "questions = [\n", 105 | " 'How many moons are in the solar system?',\n", 106 | " 'What is the capital of France?',\n", 107 | " 'What is the latest ml research?',\n", 108 | "]" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 60, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "name": "stdout", 118 | "output_type": "stream", 119 | "text": [ 120 | "Searching for: \"how many moons are in the solar system\"...\n", 121 | "Searching for: \"how many moons are in the solar system in total\"...\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "# loop through questions\n", 127 | "responses = []\n", 128 | "for question in questions:\n", 129 | " # Initial response\n", 130 | " user_messages = [{\n", 131 | " 'role': 'user',\n", 132 | " 'content': prompt.format(question=question)\n", 133 | " }]\n", 134 | " response = openai.ChatCompletion.create(\n", 135 | " model=chat_model,\n", 136 | " messages=user_messages,\n", 137 | " temperature=temperature,\n", 138 | " top_p=top_p,\n", 139 | " n=1,\n", 140 | " stream=False,\n", 141 | " stop=stop,\n", 142 | " max_tokens=max_tokens,\n", 143 | " presence_penalty=presence_penalty,\n", 144 | " frequency_penalty=frequency_penalty,\n", 145 | " logit_bias=logit_bias,\n", 146 | " user=''\n", 147 | " )\n", 148 | "\n", 149 | " # action loop\n", 150 | " next_message = ''\n", 151 | " while True:\n", 152 | " action_result = parse_action(response.choices[0].message.content)\n", 153 | " if next_message != '':\n", 154 | " next_message = next_message + ' '\n", 155 | " if action_result is None:\n", 156 | " break\n", 157 | " next_message += response.choices[0].message.content.strip() + '\\nObservation: ' + action_result + '\\nThought:'\n", 158 | " # get the next action\n", 159 | " response = openai.ChatCompletion.create(\n", 160 | " model=chat_model,\n", 161 | " messages=[user_messages[0], {'role': 'assistant', 'content': next_message}],\n", 162 | " temperature=temperature,\n", 163 | " top_p=top_p,\n", 164 | " n=1,\n", 165 | " stream=False,\n", 166 | " stop=stop,\n", 167 | " max_tokens=max_tokens,\n", 168 | " presence_penalty=presence_penalty,\n", 169 | " frequency_penalty=frequency_penalty,\n", 170 | " logit_bias=logit_bias,\n", 171 | " user=''\n", 172 | " )\n", 173 | " \n", 174 | " responses.append({'question': question, 'answer': next_message + response.choices[0].message.content.strip()})" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 61, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "name": "stdout", 184 | "output_type": "stream", 185 | "text": [ 186 | "Thought: I don't think I know the answer to this question off the top of my head.\n", 187 | "Action: Web Search\n", 188 | "Action Input: \"how many moons are in the solar system\"\n", 189 | "Observation: {title: Overview | Moons - NASA Solar System Exploration,\n", 190 | "\tbody: How Many Moons Are There in the Solar System? The traditional moon count most people are familiar with stands at 226: One moon for Earth; Two for Mars; 95 at Jupiter; 83 at Saturn; 27 at Uranus; 14 at Neptune; and 5 for dwarf planet Pluto.According to NASA/JPLs Solar System Dynamics team, astronomers have documented another 462 moons orbiting smaller objects, such as asteroids, dwarf ...,\n", 191 | "\ttitle: How Many Moons? | NASA Space Place - NASA Science for Kids,\n", 192 | "\tbody: Uranus and Neptune. Uranus has 27 moons that we know of. Some of them are half made of ice. Lastly, Neptune has 14 named moons. One of Neptunes moons, Triton, is as big as dwarf planet Pluto. To learn more about the moons in our solar system, visit the NASA Solar System Exploration moons page. article last updated August 29, 2022.,\n", 193 | "\ttitle: Our Solar System | NASA Solar System Exploration,\n", 194 | "\tbody: Our solar system is made up of a star—the Sun—eight planets, 146 moons, a bunch of comets, asteroids and space rocks, ice, and several dwarf planets, such as Pluto.,\n", 195 | "\ttitle: How Many Moons are in the Solar System? - Universe Today,\n", 196 | "\tbody: Similar to Jupiter, it is estimated that Saturn has at least 150 moons and moonlets, but only 83 of these moons have been given official names or designations. Of these, 57 are less than 10 km (6. ...,\n", 197 | "\ttitle: List of moons | Britannica,\n", 198 | "\tbody: Table of Contents. There are 171 moons, or natural satellites, orbiting the planets in our solar system; Earth, Mars, Jupiter, Saturn, Uranus, and Neptune have 1, 2, 66, 62, 27, and 13 moons, respectively. The following is a list of some of the major planetary moons, including those of the dwarf planet Pluto.,}\n", 199 | "Thought: There are likely more than 200 moons in the solar system.\n", 200 | "Action: Web Search\n", 201 | "Action Input: \"how many moons are in the solar system in total\"\n", 202 | "Observation: {title: How Many Moons are in the Solar System? - Universe Today,\n", 203 | "\tbody: Similar to Jupiter, it is estimated that Saturn has at least 150 moons and moonlets, but only 83 of these moons have been given official names or designations. Of these, 57 are less than 10 km (6. ...,\n", 204 | "\ttitle: List of Moons in the Solar System · Facts and information,\n", 205 | "\tbody: Moons in the Solar System. There are currently 181 known moons in our solar system orbiting the various planets and dwarf planets. Of the 13 planets and dwarf planets, there are four which dont have any moons. These are the planets Mercury and Venus, and the dwarf planets Ceres and Makemake.,\n", 206 | "\ttitle: Total Number of Moons in the Solar System with Facts - Planets Education,\n", 207 | "\tbody: Solar System. There are a total number of 216 moons (natural satellites) in the solar system for Planets and Dwarf planets. All 8 planets have a total of 207 moons and all 5 dwarf planets have a total of 9 moons. Some Possible Dwarf Planets and Minor Planets including Asteroids also have small moons. Where some planets have no moon, some have ...,\n", 208 | "\ttitle: What are the moons of the solar system and how many are there - ZME Science,\n", 209 | "\tbody: A A. This is an illustration of some of the most significant moons of our solar system at their correct relative sizes to each other and to Earth. Pictured are Earths Moon; Jupiters Callisto ...,\n", 210 | "\ttitle: How many moons does our solar system have? - Universal-Sci,\n", 211 | "\tbody: In truth, answering that question requires a bit of a clarification first. If we are talking about confirmed moons that orbit any of the planets of the Solar System (i.e. those that are consistent with the definition adopted by the IAU in 2006 ), then we can say that there are currently 173 known moons.,}\n", 212 | "Thought: There are approximately 181 moons in the solar system according to my web search results.\n", 213 | "Final Answer: Approximately 181 moons.\n" 214 | ] 215 | } 216 | ], 217 | "source": [ 218 | "print(responses[0]['answer'])" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "# Then you can just create the training dataset by doing combining the question and answer in to one prompt in the training script data loading portion." 228 | ] 229 | } 230 | ], 231 | "metadata": { 232 | "kernelspec": { 233 | "display_name": "Python 3", 234 | "language": "python", 235 | "name": "python3" 236 | }, 237 | "language_info": { 238 | "codemirror_mode": { 239 | "name": "ipython", 240 | "version": 3 241 | }, 242 | "file_extension": ".py", 243 | "mimetype": "text/x-python", 244 | "name": "python", 245 | "nbconvert_exporter": "python", 246 | "pygments_lexer": "ipython3", 247 | "version": "3.10.8" 248 | }, 249 | "orig_nbformat": 4 250 | }, 251 | "nbformat": 4, 252 | "nbformat_minor": 2 253 | } 254 | -------------------------------------------------------------------------------- /finetune_peft_8bit.py: -------------------------------------------------------------------------------- 1 | # Code used from https://github.com/CarperAI/trlx/blob/main/examples/hh/sft_hh.py https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py 2 | 3 | from dataclasses import dataclass, field 4 | from itertools import chain 5 | from typing import Optional, Union 6 | 7 | import os 8 | import torch 9 | import torch.nn as nn 10 | import transformers 11 | import accelerate 12 | from datasets import load_dataset 13 | from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training 14 | from peft.tuners.lora import LoraLayer 15 | from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments, LlamaTokenizer, LlamaForCausalLM, LlamaConfig 16 | 17 | from accelerate import init_empty_weights, infer_auto_device_map 18 | from transformers import AutoConfig, set_seed, BitsAndBytesConfig 19 | 20 | from utils import CastOutputToFloat, smart_tokenizer_and_embedding_resize, print_trainable_parameters, str_or_bool 21 | 22 | import torch.backends.cuda 23 | torch.backends.cuda.matmul.allow_tf32 = True 24 | # Uncomment the following line to enable flash attention (model source code must be modified) 25 | # torch.backends.cuda.enable_flash_sdp(enabled=True) 26 | 27 | 28 | IGNORE_INDEX = -100 29 | DEFAULT_PAD_TOKEN = "[PAD]" 30 | 31 | 32 | def save_tunable_parameters(model, path): 33 | saved_params = { 34 | k: v.to("cpu") 35 | for k, v in model.named_parameters() 36 | if v.requires_grad 37 | } 38 | torch.save(saved_params, path) 39 | 40 | 41 | @dataclass 42 | class ModelArguments: 43 | """ 44 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. 45 | """ 46 | 47 | model_name_or_path: Optional[str] = field( 48 | default="meta-llama/Llama-2-7b-hf", 49 | metadata={ 50 | "help": ( 51 | "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." 52 | ) 53 | }, 54 | ) 55 | cache_dir: Optional[str] = field( 56 | default=None 57 | ) 58 | r: Optional[int] = field( 59 | default=64, metadata={"help": "The LoRA rank."} 60 | ) 61 | lora_alpha: Optional[float] = field( 62 | default=32, metadata={"help": "The LoRA alpha."} 63 | ) 64 | lora_dropout: Optional[float] = field( 65 | default=0.05, metadata={"help": "The LoRA dropout."} 66 | ) 67 | bits: Optional[int] = field( 68 | default=4, metadata={"help": "The number of bits to quantize to."} 69 | ) 70 | double_quant: Optional[bool] = field( 71 | default=True, metadata={"help": "Whether to use double quantization."} 72 | ) 73 | quant_type: str = field( 74 | default="nf4", metadata={"help": "Quantization data type to use. [fp4, nf4]"} 75 | ) 76 | trust_remote_code: Optional[bool] = field( 77 | default=False, 78 | metadata={"help": "Enable unpickling of arbitrary code in AutoModelForCausalLM."} 79 | ) 80 | use_auth_token: str_or_bool = field( 81 | default=False, 82 | metadata={"help": "Enables using Huggingface auth token to download private/restricted models."} 83 | ) 84 | 85 | 86 | @dataclass 87 | class DataTrainingArguments: 88 | dataset_name: Optional[str] = field( 89 | default="Dahoas/full-hh-rlhf", metadata={"help": "The name of the dataset to use (via the datasets library)."} 90 | ) 91 | block_size: Optional[int] = field( 92 | default=4096, metadata={"help": "The maximum length of the training sequence."} 93 | ) 94 | multi_gpu: Optional[bool] = field( 95 | default=False, metadata={"help": "Whether to use multiple GPUs."} 96 | ) 97 | tensor_parallel: Optional[bool] = field( 98 | default=False, metadata={"help": "Whether to use tensor parallelism. (Must be used with multi_gpu)"} 99 | ) 100 | model_output_dir: Optional[str] = field( 101 | default="LLaMA/LoRA", metadata={"help": "The directory to save the model."} 102 | ) 103 | 104 | 105 | def get_device_map(model_name, id_=0, do_int8=False, do_int4=True): 106 | 107 | with init_empty_weights(): 108 | config = LlamaConfig.from_pretrained(model_name) 109 | model = AutoModelForCausalLM.from_config(config) 110 | 111 | d = {id_: "5000MiB"} 112 | d[1] = "4500MiB" 113 | d[2] = "4500MiB" 114 | d[3] = "4500MiB" 115 | d[4] = "4500MiB" 116 | d[5] = "4500MiB" 117 | d[6] = "4500MiB" 118 | d[7] = "6000MiB" 119 | dtype = torch.float16 120 | if do_int8: 121 | dtype = torch.int8 122 | elif do_int4: 123 | dtype = torch.int4 124 | device_map = infer_auto_device_map( 125 | model, max_memory=d, dtype=dtype, no_split_module_classes=["BloomBlock", "OPTDecoderLayer", "LLaMADecoderLayer", "LlamaDecoderLayer"] 126 | ) 127 | print(device_map) 128 | del model 129 | return device_map 130 | 131 | 132 | def main(): 133 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) 134 | 135 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 136 | if data_args.tensor_parallel == True and data_args.multi_gpu == False: 137 | raise ValueError("Tensor parallelism can only be used with multi_gpu.") 138 | 139 | if data_args.multi_gpu == True: 140 | if data_args.tensor_parallel == True: 141 | # split the model across GPUs 142 | device_map = get_device_map(model_args.model_name_or_path) 143 | else: 144 | # stick a copy of the model on each GPU 145 | device_map = {"": accelerate.Accelerator().process_index} 146 | else: 147 | device_map = "auto" 148 | 149 | compute_dtype = (torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)) 150 | 151 | model = AutoModelForCausalLM.from_pretrained( 152 | model_args.model_name_or_path, 153 | cache_dir=model_args.cache_dir, 154 | load_in_4bit=model_args.bits == 4, 155 | load_in_8bit=model_args.bits == 8, 156 | device_map=device_map, 157 | quantization_config=BitsAndBytesConfig( 158 | load_in_4bit=model_args.bits == 4, 159 | load_in_8bit=model_args.bits == 8, 160 | llm_int8_threshold=6.0, 161 | llm_int8_has_fp16_weight=False, 162 | bnb_4bit_compute_dtype=compute_dtype, 163 | bnb_4bit_use_double_quant=model_args.double_quant, 164 | bnb_4bit_quant_type=model_args.quant_type, 165 | ), 166 | torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)), 167 | trust_remote_code=model_args.trust_remote_code, 168 | use_auth_token=model_args.use_auth_token 169 | ) 170 | 171 | model.config.torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)) 172 | 173 | tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, max_length=4096) 174 | if tokenizer._pad_token is None: 175 | smart_tokenizer_and_embedding_resize( 176 | special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN), 177 | tokenizer=tokenizer, 178 | model=model, 179 | ) 180 | 181 | tokenizer.add_special_tokens({ 182 | "eos_token": tokenizer.convert_ids_to_tokens(model.config.eos_token_id), 183 | "bos_token": tokenizer.convert_ids_to_tokens(model.config.bos_token_id), 184 | "unk_token": tokenizer.convert_ids_to_tokens( 185 | model.config.pad_token_id if model.config.pad_token_id != -1 else tokenizer.pad_token_id 186 | ), 187 | }) 188 | 189 | 190 | # ### Prepare model for training 191 | # 192 | # Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: 193 | # - Cast the layer norm in `float32` for stability purposes 194 | # - Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states 195 | # - Enable gradient checkpointing for more memory-efficient training 196 | # - Cast the output logits in `float32` for smoother sampling during the sampling procedure 197 | 198 | # for param in model.parameters(): 199 | # param.requires_grad = False # freeze the model - train adapters later 200 | # if param.ndim == 1: 201 | # # cast the small parameters (e.g. layernorm) to fp32 for stability 202 | # param.data = param.data.to(torch.float16) #32) half precision seems to work just as well in practice 203 | model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing) 204 | 205 | # model.lm_head = CastOutputToFloat(model.lm_head) 206 | 207 | 208 | # model = prepare_model_for_int8_training(model) seemed to mess up training stability for some reason 209 | 210 | 211 | # ### Apply LoRA 212 | # 213 | # Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`. 214 | 215 | target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'] # edit with your desired target modules 216 | config = LoraConfig( 217 | r=model_args.r, 218 | lora_alpha=model_args.lora_alpha, 219 | target_modules=target_modules, 220 | lora_dropout=model_args.lora_dropout, 221 | bias="none", 222 | task_type="CAUSAL_LM" 223 | ) 224 | 225 | model = get_peft_model(model, config) 226 | 227 | for name, module in model.named_modules(): 228 | if isinstance(module, LoraLayer): 229 | if training_args.bf16: 230 | module = module.to(torch.bfloat16) 231 | if 'norm' in name: 232 | module = module.to(torch.float32) 233 | if 'lm_head' in name or 'embed_tokens' in name: 234 | if hasattr(module, 'weight'): 235 | if training_args.bf16 and module.weight.dtype == torch.float32: 236 | module = module.to(torch.bfloat16) 237 | 238 | print_trainable_parameters(model_args, model) 239 | 240 | block_size = data_args.block_size 241 | 242 | 243 | ### Prepare dataset 244 | 245 | # Use this function to concatenate all texts from your dataset and generate chunks of block_size. 246 | # def group_texts(examples): 247 | # # Concatenate all texts. 248 | # concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} 249 | # total_length = len(concatenated_examples[list(examples.keys())[0]]) 250 | # # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can 251 | # # customize this part to your needs. 252 | # if total_length >= block_size: 253 | # total_length = (total_length // block_size) * block_size 254 | # # Split by chunks of max_len. 255 | # result = { 256 | # k: [t[i : i + block_size] for i in range(0, total_length, block_size)] 257 | # for k, t in concatenated_examples.items() 258 | # } 259 | # result["labels"] = result["input_ids"].copy() 260 | # return result 261 | 262 | # def group_texts(examples): 263 | # examples["labels"] = examples["input_ids"].copy() 264 | # return examples 265 | 266 | def preprocess(sample): 267 | sample["chosen_sample"] = sample["prompt"] + sample["chosen"] 268 | return sample 269 | 270 | # tokenizer.add_bos_token = True 271 | # tokenizer.add_eos_token = False # Uncomment if you concatenate all texts from your dataset and generate chunks of block_size. 272 | # tokenizer.padding_side = "left" 273 | # tokenizer.truncation_side = "left" 274 | 275 | def tokenize(prompt): 276 | result = tokenizer( 277 | prompt, 278 | truncation=True, 279 | max_length=block_size, 280 | padding="max_length", 281 | add_special_tokens=True 282 | ) 283 | return { 284 | "input_ids": result["input_ids"], 285 | "attention_mask": result["attention_mask"], 286 | } 287 | 288 | 289 | ### Training 290 | dataset = load_dataset(data_args.dataset_name).map(preprocess) 291 | columns = dataset["train"].features 292 | # Use this for simple exmaple samples (conversation turns with dialogue history, instructions/responses, etc.) 293 | dataset = dataset.map(lambda samples: tokenize(samples["chosen_sample"]), batched=True, remove_columns=columns) 294 | # Use this to concatenate all texts from your dataset and generate chunks of block_size. (Books, etc.) 295 | #dataset = dataset.map(lambda samples: tokenizer(samples["chosen_sample"], padding=False, add_special_tokens=True), batched=True, remove_columns=columns) 296 | #dataset = dataset.map(group_texts, batched=True) 297 | 298 | # Train 299 | # model = torch.compile(model) # pytorch 2.0 but doesn't seem to work yet? (Should increase speed) 300 | if data_args.tensor_parallel == True: 301 | model.is_parallelizable = True 302 | model.model_parallel = True 303 | 304 | trainer = transformers.Trainer( 305 | model=model, 306 | train_dataset=dataset['train'], 307 | eval_dataset=dataset['test'], 308 | args=training_args, 309 | data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), 310 | ) 311 | model.config.use_cache = False # silence the warnings. Please re-enable for inference! 312 | trainer.train() 313 | model.config.use_cache = True 314 | 315 | # Save model 316 | model.save_pretrained(data_args.model_output_dir) 317 | 318 | if __name__ == "__main__": 319 | main() 320 | --------------------------------------------------------------------------------