├── README.md
├── chat
└── README.md
├── instruct
├── README.md
├── make_dataset.py
└── utils.py
├── requirements.txt
└── simple
├── README.md
├── create_quotes_dataset.py
└── quotes.json
/README.md:
--------------------------------------------------------------------------------
1 | # Llama 2 Dataset Formats
2 |
3 | When creating a dataset for Llama2(or most of GPT based models for that matter), there are typically four different dataset formats in my experience.
4 |
5 | While these datasets can have subtypes of formats, they generally fit into one of these 4.
6 |
7 | Each of these formats varies in their difficulty and their flexibility. With the easier formats being less flexible in their abilities
8 |
9 | ## Pretraining Format
10 |
11 | This format is the format used to actually pretrain GPT-like models.
12 |
13 | It's simply a whole bunch of text with a BOS and EOS token to mark the beginning of the text.
14 |
15 | Base models are trained with this format of dataset.
16 |
17 | Models generated with these datasets are not typically as useful outside of few-shot and zero-shot learning(with creative prompts)
18 |
19 | ## Simple Format
20 |
21 | The simple format is the easiest one to do that can provide value with finetuning.
22 |
23 | You use this one when you want to use these LLMs to accomplish a few tasks and you have input and output pairs for your dataset.
24 |
25 | The model will learn to generate an output given an input.
26 |
27 | The following is an example of the format:
28 |
29 | This example is from a quotes dataset. The input would be the type of quote, and the actual quote would be the output.
30 |
31 | ```
32 | success: The successful cannot be unhappy -- it was a contradiction in terms.
33 | ```
34 |
35 | See a video on this dataset [here](https://www.youtube.com/watch?v=07ppAKvOhqk&ab_channel=Brillibits)
36 |
37 | ### Simple Format With Tags
38 |
39 | You could also use tags to clearly mark the start and end of different parts of your input and output.
40 |
41 | Doing this can allow you to easily train a model to do different tasks and allows easy parsing of the output.
42 |
43 | You use models trained with this format like functions.
44 |
45 | ```
46 | <|START TASK 1|><|START TASK 1 INPUT|>Task 1 input data is here<|END TASK 1 INPUT|><|START TASK 1 OUTPUT|>The correct output given the task and input data<|END TASK 1 OUTPUT|><|END TASK 1|>
47 | ```
48 |
49 | ## Instruct Format
50 |
51 | The instruct dataset format takes more work but is great in allowing you to give instructions to LLM and have it perform those tasks.
52 |
53 | These models can be flexible on a variety of tasks, and you can also include your own custom tasks to the dataset to have it both be flexible, but good at your custom tasks.
54 |
55 | There are many different types of instruct dataset formats. Here is an example of an instruct dataset with and without context.
56 |
57 | The following is an example of the format with context:
58 | ```
59 | From the text below, tell me where Mount Balinhard got its name.
60 |
61 | Input:
62 | Mount Balinhard is a summit in Alberta, Canada.
63 |
64 | Mount Balinhard was named for a title bestowed on the Earl of Southesk.
65 |
66 | Output:
67 | Balinhard was a title bestowed on the Earl of Southesk, from which Mount Balinhard gets its name.
68 | ```
69 |
70 | Without context:
71 |
72 | ```
73 | How many syllables are in the word smarter?
74 |
75 | Output:
76 | There are two syllables in the word smarter: smart-er.
77 | ```
78 |
79 | ## Chat Format
80 |
81 | Chat is the hardest to get working well. This is due to the fact that conversations have a high level of variance.
82 |
83 | Having a large diverse dataset, and then using RLHF is typically key to getting good results.
84 |
85 | The following is an example of the format for chat llama2 models:
86 |
87 | ```
88 | [INST]<>
89 | You are a friendly chatbot that gives helpful answers
90 | <>
91 |
92 | Hello[/INST]Hello, how are you?[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else!
93 | ```
94 |
95 | Chat models can be with or without system prompts, but having a well-working system prompt can give you more control of how the model works.
96 |
97 | Thus, having a large diverse dataset, with a large diverse system prompts can be very hard.
98 |
--------------------------------------------------------------------------------
/chat/README.md:
--------------------------------------------------------------------------------
1 | # Chat Dataset
2 |
3 | Code to be added later
4 |
--------------------------------------------------------------------------------
/instruct/README.md:
--------------------------------------------------------------------------------
1 | # Instruct Dataset
2 |
3 | This code creates an instruct dataset from databricks/databricks-dolly-15k.
4 |
5 | ## Running
6 |
7 | See the flags with ```python make_dataset.py -h```
8 |
9 | You will need to set your HuggingFace token if you have not already downloaded the Llama 2 tokenizer
10 |
11 | ```export HF_TOKEN=```
12 |
--------------------------------------------------------------------------------
/instruct/make_dataset.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import plotly.express as px
4 | from transformers import AutoTokenizer
5 | import datasets
6 | import random
7 | import torch
8 | import numpy as np
9 | import pandas as pd
10 | from langchain import PromptTemplate
11 |
12 | from utils import get_logger
13 |
14 |
15 | def seed_all(seed):
16 | torch.manual_seed(seed)
17 | np.random.seed(seed)
18 | random.seed(seed)
19 |
20 |
21 |
22 | class InstructDataset:
23 | def __init__(self, args):
24 | self.args = args
25 | self.logger = get_logger("Dataset logger","info")
26 | if self.args.token is None:
27 | self.logger.info("No token passed, looking at HF_TOKEN environment variable")
28 | self.args.token = os.getenv("HF_TOKEN",None)
29 | if self.args.token is None:
30 | self.logger.info("HF_TOKEN not set")
31 |
32 |
33 | self.tokenizer = AutoTokenizer.from_pretrained(self.args.model,trust_remote_code=True,token=args.token,add_eos_token=False,add_bos_token=False)
34 |
35 |
36 | self.end_of_text_token = self.tokenizer.eos_token
37 | if self.tokenizer.bos_token is None:
38 | self.beginning_of_text_token = self.end_of_text_token
39 | else:
40 | self.beginning_of_text_token = self.tokenizer.bos_token
41 |
42 | instruct_prompt_template = self.beginning_of_text_token + "{instruction}\n\nOutput:\n{response}" + self.end_of_text_token
43 | self.instruct_prompt_template = PromptTemplate(
44 | input_variables=["instruction","response"],
45 | template=instruct_prompt_template)
46 |
47 | instruct_prompt_template_with_context = self.beginning_of_text_token + "{instruction}\n\nInput:\n{context}\n\nOutput:\n{response}" + self.end_of_text_token
48 | self.instruct_prompt_template_with_context = PromptTemplate(
49 | input_variables=["instruction", "context", "response"],
50 | template=instruct_prompt_template_with_context)
51 |
52 | self.dataset = self.create_instruction_dataset()
53 | self.train_dataset,self.validation_dataset,self.test_dataset = self.split_dataset()
54 | self.save_dataset_splits()
55 |
56 |
57 | def create_instruction_dataset(self):
58 | all_entries = self.get_base_instruction_dataset()
59 | all_lengths = []
60 |
61 | for entry in all_entries:
62 | tokenized_entry = self.tokenizer(entry, return_tensors="np")
63 | tokenized_entry_length = len(tokenized_entry["input_ids"][0])
64 | all_lengths.append(tokenized_entry_length)
65 |
66 | if self.args.histogram:
67 | fig = px.histogram(all_lengths)
68 | fig.show()
69 | print(f"Max length: {max(all_lengths)}")
70 | print(f"Min length: {min(all_lengths)}")
71 | print(f"Mean length: {np.mean(all_lengths)}")
72 | print(f"Median length: {np.median(all_lengths)}")
73 | print(f"Std: {np.std(all_lengths)}")
74 | print(f"Number of entries: {len(all_entries)}")
75 | return all_entries
76 |
77 |
78 |
79 | def get_base_instruction_dataset(self):
80 | all_prompts = []
81 |
82 | instruct_dataset = datasets.load_dataset(self.args.dataset)["train"]
83 | for item in instruct_dataset:
84 | instruction = item["instruction"]
85 | context = item["context"]
86 | response = item["response"]
87 | if context == "" or context is None:
88 | prompt = self.instruct_prompt_template.format(instruction=instruction,response=response)
89 | else:
90 | prompt = self.instruct_prompt_template_with_context.format(instruction=instruction,context=context,response=response)
91 |
92 | tokenized_entry_input_ids = self.tokenizer(prompt, return_tensors="np")["input_ids"][0]
93 | tokenized_entry_length = len(tokenized_entry_input_ids)
94 | if tokenized_entry_length >= self.args.max_tokens:
95 | self.logger.info("Too long, not adding to dataset")
96 | elif self.args.pad_token_id is not None and self.args.pad_token_id in tokenized_entry_input_ids:
97 | self.logger.info("Contains pad token, not adding to dataset")
98 | self.logger.warn("Recreate the dataset with a different pad token")
99 | print(prompt)
100 | quit()
101 | else:
102 | all_prompts.append(prompt)
103 | return all_prompts
104 |
105 | def split_dataset(self,train_size=0.85,validation_size=0.149,test_size=0.001):
106 | random.shuffle(self.dataset)
107 | train_size = int(train_size*len(self.dataset))
108 | validation_size = int(validation_size*len(self.dataset))
109 | test_size = int(test_size*len(self.dataset))
110 | train_dataset = self.dataset[:train_size]
111 | validation_dataset = self.dataset[train_size:train_size+validation_size]
112 | test_dataset = self.dataset[train_size+validation_size:]
113 | print("Train size: ",len(train_dataset))
114 | print("Validation size: ",len(validation_dataset))
115 | print("Test size: ",len(test_dataset))
116 | return train_dataset,validation_dataset,test_dataset
117 |
118 | def save_dataset_splits(self):
119 | train_name = self.args.name + "_" + "train_" + str(self.args.seed) + ".csv"
120 | validation_name = self.args.name + "_" + "validation_" + str(self.args.seed) + ".csv"
121 | test_name = self.args.name + "_" + "test_" + str(self.args.seed) + ".csv"
122 | train_path = os.path.join(os.path.realpath('./'),train_name)
123 | validation_path = os.path.join(os.path.realpath('./'),validation_name)
124 | test_path = os.path.join(os.path.realpath('./'),test_name)
125 |
126 | train_df = pd.DataFrame(self.train_dataset,columns=["text"])
127 | validation_df = pd.DataFrame(self.validation_dataset,columns=["text"])
128 | test_df = pd.DataFrame(self.test_dataset,columns=["text"])
129 |
130 | train_df.to_csv(train_path,index=False)
131 | validation_df.to_csv(validation_path,index=False)
132 | test_df.to_csv(test_path,index=False)
133 |
134 | if __name__ == "__main__":
135 | parser = argparse.ArgumentParser()
136 |
137 | parser.add_argument("-s","--seed", type=int, help="Seed for random number generator", default=42)
138 | parser.add_argument("-m","--model", type=str, help="What model to use for the tokenizer", default="meta-llama/Llama-2-7b-hf")
139 | parser.add_argument("-t","--token",type=str,help="Huggingface token",default=None)
140 | parser.add_argument("-n","--name",type=str,help="What name to save the csv files under",default="instruct")
141 | parser.add_argument("-d","--dataset",type=str,help="What huggingface dataset to use",default="databricks/databricks-dolly-15k")
142 | parser.add_argument("--histogram",action="store_true",default=False)
143 | parser.add_argument("--max_tokens",type=int,help="Maximum number of tokens to use",default=1024)
144 | parser.add_argument("--pad_token_id",type=int,default=0)
145 |
146 | args = parser.parse_args()
147 | seed_all(args.seed)
148 |
149 | instruct_dataset = InstructDataset(args)
150 |
--------------------------------------------------------------------------------
/instruct/utils.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from typing_extensions import Literal
3 | from rich.logging import RichHandler
4 |
5 |
6 | def get_logger(name: str, level: Literal["info", "warning", "debug"]) -> logging.Logger:
7 | rich_handler = RichHandler(level=logging.INFO, rich_tracebacks=True, markup=True)
8 |
9 | logger = logging.getLogger(name)
10 | logger.setLevel(logging._nameToLevel[level.upper()])
11 |
12 | if not logger.handlers:
13 | logger.addHandler(rich_handler)
14 |
15 | logger.propagate = False
16 |
17 | return logger
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets
2 | pandas
3 | torch
4 | transformers
5 | plotly
6 | rich
7 | numpy
8 | langchain
9 | scipy
10 | scikit-learn
--------------------------------------------------------------------------------
/simple/README.md:
--------------------------------------------------------------------------------
1 | # Simple Dataset
2 |
3 | This code creates a simple dataset for quotes. It is used in the video tutorial [here](https://www.youtube.com/watch?v=07ppAKvOhqk&ab_channel=Blake)
4 |
5 | ## Running
6 |
7 | Unless you want to run with a different dataset file, you don't need any flags
8 |
9 | ```python create_quotes_dataset.py```
--------------------------------------------------------------------------------
/simple/create_quotes_dataset.py:
--------------------------------------------------------------------------------
1 | import json
2 | import argparse
3 | import os
4 | import numpy as np
5 | from sklearn.model_selection import train_test_split
6 | import pandas as pd
7 |
8 | def isascii(s):
9 | """Check if the characters in string s are in ASCII, U+0-U+7F."""
10 | return len(s) == len(s.encode())
11 |
12 |
13 | def create_sentences(sentences,key,dict):
14 | data = dict[key]
15 | counter = 0
16 | for item in data:
17 | #cleans data with these weird entries
18 | if item.find("[10w]")==-1:
19 | #skips non ascii
20 | if not isascii(item):
21 | continue
22 | #saves the quotes with style of: <|endoftext|>love: quote about love here<|endoftext|>
23 | sentences.append("" + key + ": " + item + "")
24 | return sentences
25 |
26 | if __name__ == "__main__":
27 | parser = argparse.ArgumentParser("Tool to train a custom GPT3 model")
28 | parser.add_argument("-f", "--file",type=str,default="./quotes.json")
29 | args = parser.parse_args()
30 | json_file = args.file
31 | json_file = os.path.realpath(json_file)
32 | #loads the file
33 | with open(json_file) as f:
34 | data = json.load(f)
35 | #creates dictonary that will hold categories
36 | category_dict = {}
37 | #loops through all quotes and adds them to the appropriate categories
38 | for index, item in enumerate(data):
39 | if item["Category"] in category_dict:
40 | category_dict[item["Category"]].append(item["Quote"])
41 | else:
42 | entry = [item["Quote"]]
43 | category_dict[item["Category"]] = entry
44 | #remove empty entry
45 | category_dict.pop("",None)
46 | #prints out the the different quote categories
47 | print(category_dict.keys())
48 |
49 | sentences = []
50 | for key in category_dict.keys():
51 | #creates the data entries for training the model
52 | sentences = create_sentences(sentences,key,category_dict)
53 | #splits into train and test data for training
54 | train_sentences,test_sentences = train_test_split(sentences,test_size=0.2)
55 | #creates dataframes that will eventually be saved with "text" column as a csv file
56 | train_df = pd.DataFrame(train_sentences,columns=['text'])
57 | validate_df = pd.DataFrame(test_sentences,columns=['text'])
58 | #drops na items if there are any
59 | train_df = train_df.dropna()
60 | validate_df = validate_df.dropna()
61 | train_df.to_csv("train.csv")
62 | validate_df.to_csv("validation.csv")
63 |
64 |
--------------------------------------------------------------------------------