├── README.md ├── chat └── README.md ├── instruct ├── README.md ├── make_dataset.py └── utils.py ├── requirements.txt └── simple ├── README.md ├── create_quotes_dataset.py └── quotes.json /README.md: -------------------------------------------------------------------------------- 1 | # Llama 2 Dataset Formats 2 | 3 | When creating a dataset for Llama2(or most of GPT based models for that matter), there are typically four different dataset formats in my experience. 4 | 5 | While these datasets can have subtypes of formats, they generally fit into one of these 4. 6 | 7 | Each of these formats varies in their difficulty and their flexibility. With the easier formats being less flexible in their abilities 8 | 9 | ## Pretraining Format 10 | 11 | This format is the format used to actually pretrain GPT-like models. 12 | 13 | It's simply a whole bunch of text with a BOS and EOS token to mark the beginning of the text. 14 | 15 | Base models are trained with this format of dataset. 16 | 17 | Models generated with these datasets are not typically as useful outside of few-shot and zero-shot learning(with creative prompts) 18 | 19 | ## Simple Format 20 | 21 | The simple format is the easiest one to do that can provide value with finetuning. 22 | 23 | You use this one when you want to use these LLMs to accomplish a few tasks and you have input and output pairs for your dataset. 24 | 25 | The model will learn to generate an output given an input. 26 | 27 | The following is an example of the format: 28 | 29 | This example is from a quotes dataset. The input would be the type of quote, and the actual quote would be the output. 30 | 31 | ``` 32 | success: The successful cannot be unhappy -- it was a contradiction in terms. 33 | ``` 34 | 35 | See a video on this dataset [here](https://www.youtube.com/watch?v=07ppAKvOhqk&ab_channel=Brillibits) 36 | 37 | ### Simple Format With Tags 38 | 39 | You could also use tags to clearly mark the start and end of different parts of your input and output. 40 | 41 | Doing this can allow you to easily train a model to do different tasks and allows easy parsing of the output. 42 | 43 | You use models trained with this format like functions. 44 | 45 | ``` 46 | <|START TASK 1|><|START TASK 1 INPUT|>Task 1 input data is here<|END TASK 1 INPUT|><|START TASK 1 OUTPUT|>The correct output given the task and input data<|END TASK 1 OUTPUT|><|END TASK 1|> 47 | ``` 48 | 49 | ## Instruct Format 50 | 51 | The instruct dataset format takes more work but is great in allowing you to give instructions to LLM and have it perform those tasks. 52 | 53 | These models can be flexible on a variety of tasks, and you can also include your own custom tasks to the dataset to have it both be flexible, but good at your custom tasks. 54 | 55 | There are many different types of instruct dataset formats. Here is an example of an instruct dataset with and without context. 56 | 57 | The following is an example of the format with context: 58 | ``` 59 | From the text below, tell me where Mount Balinhard got its name. 60 | 61 | Input: 62 | Mount Balinhard is a summit in Alberta, Canada. 63 | 64 | Mount Balinhard was named for a title bestowed on the Earl of Southesk. 65 | 66 | Output: 67 | Balinhard was a title bestowed on the Earl of Southesk, from which Mount Balinhard gets its name. 68 | ``` 69 | 70 | Without context: 71 | 72 | ``` 73 | How many syllables are in the word smarter? 74 | 75 | Output: 76 | There are two syllables in the word smarter: smart-er. 77 | ``` 78 | 79 | ## Chat Format 80 | 81 | Chat is the hardest to get working well. This is due to the fact that conversations have a high level of variance. 82 | 83 | Having a large diverse dataset, and then using RLHF is typically key to getting good results. 84 | 85 | The following is an example of the format for chat llama2 models: 86 | 87 | ``` 88 | [INST]<> 89 | You are a friendly chatbot that gives helpful answers 90 | <> 91 | 92 | Hello[/INST]Hello, how are you?[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else! 93 | ``` 94 | 95 | Chat models can be with or without system prompts, but having a well-working system prompt can give you more control of how the model works. 96 | 97 | Thus, having a large diverse dataset, with a large diverse system prompts can be very hard. 98 | -------------------------------------------------------------------------------- /chat/README.md: -------------------------------------------------------------------------------- 1 | # Chat Dataset 2 | 3 | Code to be added later 4 | -------------------------------------------------------------------------------- /instruct/README.md: -------------------------------------------------------------------------------- 1 | # Instruct Dataset 2 | 3 | This code creates an instruct dataset from databricks/databricks-dolly-15k. 4 | 5 | ## Running 6 | 7 | See the flags with ```python make_dataset.py -h``` 8 | 9 | You will need to set your HuggingFace token if you have not already downloaded the Llama 2 tokenizer 10 | 11 | ```export HF_TOKEN=``` 12 | -------------------------------------------------------------------------------- /instruct/make_dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import plotly.express as px 4 | from transformers import AutoTokenizer 5 | import datasets 6 | import random 7 | import torch 8 | import numpy as np 9 | import pandas as pd 10 | from langchain import PromptTemplate 11 | 12 | from utils import get_logger 13 | 14 | 15 | def seed_all(seed): 16 | torch.manual_seed(seed) 17 | np.random.seed(seed) 18 | random.seed(seed) 19 | 20 | 21 | 22 | class InstructDataset: 23 | def __init__(self, args): 24 | self.args = args 25 | self.logger = get_logger("Dataset logger","info") 26 | if self.args.token is None: 27 | self.logger.info("No token passed, looking at HF_TOKEN environment variable") 28 | self.args.token = os.getenv("HF_TOKEN",None) 29 | if self.args.token is None: 30 | self.logger.info("HF_TOKEN not set") 31 | 32 | 33 | self.tokenizer = AutoTokenizer.from_pretrained(self.args.model,trust_remote_code=True,token=args.token,add_eos_token=False,add_bos_token=False) 34 | 35 | 36 | self.end_of_text_token = self.tokenizer.eos_token 37 | if self.tokenizer.bos_token is None: 38 | self.beginning_of_text_token = self.end_of_text_token 39 | else: 40 | self.beginning_of_text_token = self.tokenizer.bos_token 41 | 42 | instruct_prompt_template = self.beginning_of_text_token + "{instruction}\n\nOutput:\n{response}" + self.end_of_text_token 43 | self.instruct_prompt_template = PromptTemplate( 44 | input_variables=["instruction","response"], 45 | template=instruct_prompt_template) 46 | 47 | instruct_prompt_template_with_context = self.beginning_of_text_token + "{instruction}\n\nInput:\n{context}\n\nOutput:\n{response}" + self.end_of_text_token 48 | self.instruct_prompt_template_with_context = PromptTemplate( 49 | input_variables=["instruction", "context", "response"], 50 | template=instruct_prompt_template_with_context) 51 | 52 | self.dataset = self.create_instruction_dataset() 53 | self.train_dataset,self.validation_dataset,self.test_dataset = self.split_dataset() 54 | self.save_dataset_splits() 55 | 56 | 57 | def create_instruction_dataset(self): 58 | all_entries = self.get_base_instruction_dataset() 59 | all_lengths = [] 60 | 61 | for entry in all_entries: 62 | tokenized_entry = self.tokenizer(entry, return_tensors="np") 63 | tokenized_entry_length = len(tokenized_entry["input_ids"][0]) 64 | all_lengths.append(tokenized_entry_length) 65 | 66 | if self.args.histogram: 67 | fig = px.histogram(all_lengths) 68 | fig.show() 69 | print(f"Max length: {max(all_lengths)}") 70 | print(f"Min length: {min(all_lengths)}") 71 | print(f"Mean length: {np.mean(all_lengths)}") 72 | print(f"Median length: {np.median(all_lengths)}") 73 | print(f"Std: {np.std(all_lengths)}") 74 | print(f"Number of entries: {len(all_entries)}") 75 | return all_entries 76 | 77 | 78 | 79 | def get_base_instruction_dataset(self): 80 | all_prompts = [] 81 | 82 | instruct_dataset = datasets.load_dataset(self.args.dataset)["train"] 83 | for item in instruct_dataset: 84 | instruction = item["instruction"] 85 | context = item["context"] 86 | response = item["response"] 87 | if context == "" or context is None: 88 | prompt = self.instruct_prompt_template.format(instruction=instruction,response=response) 89 | else: 90 | prompt = self.instruct_prompt_template_with_context.format(instruction=instruction,context=context,response=response) 91 | 92 | tokenized_entry_input_ids = self.tokenizer(prompt, return_tensors="np")["input_ids"][0] 93 | tokenized_entry_length = len(tokenized_entry_input_ids) 94 | if tokenized_entry_length >= self.args.max_tokens: 95 | self.logger.info("Too long, not adding to dataset") 96 | elif self.args.pad_token_id is not None and self.args.pad_token_id in tokenized_entry_input_ids: 97 | self.logger.info("Contains pad token, not adding to dataset") 98 | self.logger.warn("Recreate the dataset with a different pad token") 99 | print(prompt) 100 | quit() 101 | else: 102 | all_prompts.append(prompt) 103 | return all_prompts 104 | 105 | def split_dataset(self,train_size=0.85,validation_size=0.149,test_size=0.001): 106 | random.shuffle(self.dataset) 107 | train_size = int(train_size*len(self.dataset)) 108 | validation_size = int(validation_size*len(self.dataset)) 109 | test_size = int(test_size*len(self.dataset)) 110 | train_dataset = self.dataset[:train_size] 111 | validation_dataset = self.dataset[train_size:train_size+validation_size] 112 | test_dataset = self.dataset[train_size+validation_size:] 113 | print("Train size: ",len(train_dataset)) 114 | print("Validation size: ",len(validation_dataset)) 115 | print("Test size: ",len(test_dataset)) 116 | return train_dataset,validation_dataset,test_dataset 117 | 118 | def save_dataset_splits(self): 119 | train_name = self.args.name + "_" + "train_" + str(self.args.seed) + ".csv" 120 | validation_name = self.args.name + "_" + "validation_" + str(self.args.seed) + ".csv" 121 | test_name = self.args.name + "_" + "test_" + str(self.args.seed) + ".csv" 122 | train_path = os.path.join(os.path.realpath('./'),train_name) 123 | validation_path = os.path.join(os.path.realpath('./'),validation_name) 124 | test_path = os.path.join(os.path.realpath('./'),test_name) 125 | 126 | train_df = pd.DataFrame(self.train_dataset,columns=["text"]) 127 | validation_df = pd.DataFrame(self.validation_dataset,columns=["text"]) 128 | test_df = pd.DataFrame(self.test_dataset,columns=["text"]) 129 | 130 | train_df.to_csv(train_path,index=False) 131 | validation_df.to_csv(validation_path,index=False) 132 | test_df.to_csv(test_path,index=False) 133 | 134 | if __name__ == "__main__": 135 | parser = argparse.ArgumentParser() 136 | 137 | parser.add_argument("-s","--seed", type=int, help="Seed for random number generator", default=42) 138 | parser.add_argument("-m","--model", type=str, help="What model to use for the tokenizer", default="meta-llama/Llama-2-7b-hf") 139 | parser.add_argument("-t","--token",type=str,help="Huggingface token",default=None) 140 | parser.add_argument("-n","--name",type=str,help="What name to save the csv files under",default="instruct") 141 | parser.add_argument("-d","--dataset",type=str,help="What huggingface dataset to use",default="databricks/databricks-dolly-15k") 142 | parser.add_argument("--histogram",action="store_true",default=False) 143 | parser.add_argument("--max_tokens",type=int,help="Maximum number of tokens to use",default=1024) 144 | parser.add_argument("--pad_token_id",type=int,default=0) 145 | 146 | args = parser.parse_args() 147 | seed_all(args.seed) 148 | 149 | instruct_dataset = InstructDataset(args) 150 | -------------------------------------------------------------------------------- /instruct/utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from typing_extensions import Literal 3 | from rich.logging import RichHandler 4 | 5 | 6 | def get_logger(name: str, level: Literal["info", "warning", "debug"]) -> logging.Logger: 7 | rich_handler = RichHandler(level=logging.INFO, rich_tracebacks=True, markup=True) 8 | 9 | logger = logging.getLogger(name) 10 | logger.setLevel(logging._nameToLevel[level.upper()]) 11 | 12 | if not logger.handlers: 13 | logger.addHandler(rich_handler) 14 | 15 | logger.propagate = False 16 | 17 | return logger -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | datasets 2 | pandas 3 | torch 4 | transformers 5 | plotly 6 | rich 7 | numpy 8 | langchain 9 | scipy 10 | scikit-learn -------------------------------------------------------------------------------- /simple/README.md: -------------------------------------------------------------------------------- 1 | # Simple Dataset 2 | 3 | This code creates a simple dataset for quotes. It is used in the video tutorial [here](https://www.youtube.com/watch?v=07ppAKvOhqk&ab_channel=Blake) 4 | 5 | ## Running 6 | 7 | Unless you want to run with a different dataset file, you don't need any flags 8 | 9 | ```python create_quotes_dataset.py``` -------------------------------------------------------------------------------- /simple/create_quotes_dataset.py: -------------------------------------------------------------------------------- 1 | import json 2 | import argparse 3 | import os 4 | import numpy as np 5 | from sklearn.model_selection import train_test_split 6 | import pandas as pd 7 | 8 | def isascii(s): 9 | """Check if the characters in string s are in ASCII, U+0-U+7F.""" 10 | return len(s) == len(s.encode()) 11 | 12 | 13 | def create_sentences(sentences,key,dict): 14 | data = dict[key] 15 | counter = 0 16 | for item in data: 17 | #cleans data with these weird entries 18 | if item.find("[10w]")==-1: 19 | #skips non ascii 20 | if not isascii(item): 21 | continue 22 | #saves the quotes with style of: <|endoftext|>love: quote about love here<|endoftext|> 23 | sentences.append("" + key + ": " + item + "") 24 | return sentences 25 | 26 | if __name__ == "__main__": 27 | parser = argparse.ArgumentParser("Tool to train a custom GPT3 model") 28 | parser.add_argument("-f", "--file",type=str,default="./quotes.json") 29 | args = parser.parse_args() 30 | json_file = args.file 31 | json_file = os.path.realpath(json_file) 32 | #loads the file 33 | with open(json_file) as f: 34 | data = json.load(f) 35 | #creates dictonary that will hold categories 36 | category_dict = {} 37 | #loops through all quotes and adds them to the appropriate categories 38 | for index, item in enumerate(data): 39 | if item["Category"] in category_dict: 40 | category_dict[item["Category"]].append(item["Quote"]) 41 | else: 42 | entry = [item["Quote"]] 43 | category_dict[item["Category"]] = entry 44 | #remove empty entry 45 | category_dict.pop("",None) 46 | #prints out the the different quote categories 47 | print(category_dict.keys()) 48 | 49 | sentences = [] 50 | for key in category_dict.keys(): 51 | #creates the data entries for training the model 52 | sentences = create_sentences(sentences,key,category_dict) 53 | #splits into train and test data for training 54 | train_sentences,test_sentences = train_test_split(sentences,test_size=0.2) 55 | #creates dataframes that will eventually be saved with "text" column as a csv file 56 | train_df = pd.DataFrame(train_sentences,columns=['text']) 57 | validate_df = pd.DataFrame(test_sentences,columns=['text']) 58 | #drops na items if there are any 59 | train_df = train_df.dropna() 60 | validate_df = validate_df.dropna() 61 | train_df.to_csv("train.csv") 62 | validate_df.to_csv("validation.csv") 63 | 64 | --------------------------------------------------------------------------------