├── README.md
├── chat
    └── README.md
├── instruct
    ├── README.md
    ├── make_dataset.py
    └── utils.py
├── requirements.txt
└── simple
    ├── README.md
    ├── create_quotes_dataset.py
    └── quotes.json


/README.md:
--------------------------------------------------------------------------------
 1 | # Llama 2 Dataset Formats
 2 | 
 3 | When creating a dataset for Llama2(or most of GPT based models for that matter), there are typically four different dataset formats in my experience.
 4 | 
 5 | While these datasets can have subtypes of formats, they generally fit into one of these 4.
 6 | 
 7 | Each of these formats varies in their difficulty and their flexibility.  With the easier formats being less flexible in their abilities
 8 | 
 9 | ## Pretraining Format
10 | 
11 | This format is the format used to actually pretrain GPT-like models.
12 | 
13 | It's simply a whole bunch of text with a BOS and EOS token to mark the beginning of the text.
14 | 
15 | Base models are trained with this format of dataset.
16 | 
17 | Models generated with these datasets are not typically as useful outside of few-shot and zero-shot learning(with creative prompts)
18 | 
19 | ## Simple Format
20 | 
21 | The simple format is the easiest one to do that can provide value with finetuning.  
22 | 
23 | You use this one when you want to use these LLMs to accomplish a few tasks and you have input and output pairs for your dataset.
24 | 
25 | The model will learn to generate an output given an input.
26 | 
27 | The following is an example of the format:
28 | 
29 | This example is from a quotes dataset.  The input would be the type of quote, and the actual quote would be the output.
30 | 
31 | ```
32 | <s>success: The successful cannot be unhappy -- it was a contradiction in terms.</s>
33 | ```
34 | 
35 | See a video on this dataset [here](https://www.youtube.com/watch?v=07ppAKvOhqk&ab_channel=Brillibits)
36 | 
37 | ### Simple Format With Tags
38 | 
39 | You could also use tags to clearly mark the start and end of different parts of your input and output.  
40 | 
41 | Doing this can allow you to easily train a model to do different tasks and allows easy parsing of the output.
42 | 
43 | You use models trained with this format like functions.
44 | 
45 | ```
46 | <s><|START TASK 1|><|START TASK 1 INPUT|>Task 1 input data is here<|END TASK 1 INPUT|><|START TASK 1 OUTPUT|>The correct output given the task and input data<|END TASK 1 OUTPUT|><|END TASK 1|></s>
47 | ```
48 | 
49 | ## Instruct Format
50 | 
51 | The instruct dataset format takes more work but is great in allowing you to give instructions to LLM and have it perform those tasks.
52 | 
53 | These models can be flexible on a variety of tasks, and you can also include your own custom tasks to the dataset to have it both be flexible, but good at your custom tasks.
54 | 
55 | There are many different types of instruct dataset formats.  Here is an example of an instruct dataset with and without context.
56 | 
57 | The following is an example of the format with context:
58 | ```
59 | <s>From the text below, tell me where Mount Balinhard got its name.
60 | 
61 | Input:
62 | Mount Balinhard is a summit in Alberta, Canada.
63 | 
64 | Mount Balinhard was named for a title bestowed on the Earl of Southesk.
65 | 
66 | Output:
67 | Balinhard was a title bestowed on the Earl of Southesk, from which Mount Balinhard gets its name.</s>
68 | ```
69 | 
70 | Without context:
71 | 
72 | ```
73 | <s>How many syllables are in the word smarter?
74 | 
75 | Output:
76 | There are two syllables in the word smarter: smart-er.</s>
77 | ```
78 | 
79 | ## Chat Format
80 | 
81 | Chat is the hardest to get working well.  This is due to the fact that conversations have a high level of variance.  
82 | 
83 | Having a large diverse dataset, and then using RLHF is typically key to getting good results.
84 | 
85 | The following is an example of the format for chat llama2 models:
86 | 
87 | ```
88 | [INST]<<SYS>>
89 | You are a friendly chatbot that gives helpful answers
90 | <</SYS>>
91 | 
92 | Hello[/INST]Hello, how are you?</s><s>[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else!</s>
93 | ```
94 | 
95 | Chat models can be with or without system prompts, but having a well-working system prompt can give you more control of how the model works.
96 | 
97 | Thus, having a large diverse dataset, with a large diverse system prompts can be very hard.
98 | 


--------------------------------------------------------------------------------
/chat/README.md:
--------------------------------------------------------------------------------
1 | # Chat Dataset
2 | 
3 | Code to be added later
4 | 


--------------------------------------------------------------------------------
/instruct/README.md:
--------------------------------------------------------------------------------
 1 | # Instruct Dataset
 2 | 
 3 | This code creates an instruct dataset from databricks/databricks-dolly-15k.
 4 | 
 5 | ## Running
 6 | 
 7 | See the flags with ```python make_dataset.py -h```
 8 | 
 9 | You will need to set your HuggingFace token if you have not already downloaded the Llama 2 tokenizer
10 | 
11 | ```export HF_TOKEN=<insert here>```
12 | 


--------------------------------------------------------------------------------
/instruct/make_dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | import plotly.express as px
  4 | from transformers import AutoTokenizer
  5 | import datasets
  6 | import random
  7 | import torch
  8 | import numpy as np
  9 | import pandas as pd
 10 | from langchain import PromptTemplate
 11 | 
 12 | from utils import get_logger
 13 | 
 14 | 
 15 | def seed_all(seed):
 16 |     torch.manual_seed(seed)
 17 |     np.random.seed(seed)
 18 |     random.seed(seed)
 19 | 
 20 | 
 21 | 
 22 | class InstructDataset:
 23 |     def __init__(self, args):
 24 |         self.args = args
 25 |         self.logger = get_logger("Dataset logger","info")
 26 |         if self.args.token is None:
 27 |             self.logger.info("No token passed, looking at HF_TOKEN environment variable")
 28 |             self.args.token = os.getenv("HF_TOKEN",None)
 29 |             if self.args.token is None:
 30 |                 self.logger.info("HF_TOKEN not set")
 31 | 
 32 |         
 33 |         self.tokenizer = AutoTokenizer.from_pretrained(self.args.model,trust_remote_code=True,token=args.token,add_eos_token=False,add_bos_token=False)
 34 | 
 35 | 
 36 |         self.end_of_text_token = self.tokenizer.eos_token
 37 |         if self.tokenizer.bos_token is None:
 38 |             self.beginning_of_text_token = self.end_of_text_token
 39 |         else:
 40 |             self.beginning_of_text_token = self.tokenizer.bos_token
 41 | 
 42 |         instruct_prompt_template = self.beginning_of_text_token + "{instruction}\n\nOutput:\n{response}" + self.end_of_text_token
 43 |         self.instruct_prompt_template = PromptTemplate(
 44 |             input_variables=["instruction","response"],
 45 |             template=instruct_prompt_template)
 46 |         
 47 |         instruct_prompt_template_with_context = self.beginning_of_text_token + "{instruction}\n\nInput:\n{context}\n\nOutput:\n{response}" + self.end_of_text_token
 48 |         self.instruct_prompt_template_with_context = PromptTemplate(
 49 |             input_variables=["instruction", "context", "response"],
 50 |             template=instruct_prompt_template_with_context)
 51 | 
 52 |         self.dataset = self.create_instruction_dataset()
 53 |         self.train_dataset,self.validation_dataset,self.test_dataset = self.split_dataset()
 54 |         self.save_dataset_splits()
 55 | 
 56 | 
 57 |     def create_instruction_dataset(self):
 58 |         all_entries = self.get_base_instruction_dataset()
 59 |         all_lengths = []
 60 | 
 61 |         for entry in all_entries:
 62 |             tokenized_entry = self.tokenizer(entry, return_tensors="np")
 63 |             tokenized_entry_length = len(tokenized_entry["input_ids"][0])
 64 |             all_lengths.append(tokenized_entry_length)
 65 | 
 66 |         if self.args.histogram:
 67 |             fig = px.histogram(all_lengths)
 68 |             fig.show()
 69 |         print(f"Max length: {max(all_lengths)}")
 70 |         print(f"Min length: {min(all_lengths)}")
 71 |         print(f"Mean length: {np.mean(all_lengths)}")
 72 |         print(f"Median length: {np.median(all_lengths)}")
 73 |         print(f"Std: {np.std(all_lengths)}")
 74 |         print(f"Number of entries: {len(all_entries)}")
 75 |         return all_entries
 76 | 
 77 | 
 78 | 
 79 |     def get_base_instruction_dataset(self):
 80 |         all_prompts = []
 81 | 
 82 |         instruct_dataset = datasets.load_dataset(self.args.dataset)["train"]
 83 |         for item in instruct_dataset:
 84 |             instruction = item["instruction"]
 85 |             context = item["context"]
 86 |             response = item["response"]
 87 |             if context == "" or context is None:
 88 |                 prompt = self.instruct_prompt_template.format(instruction=instruction,response=response)
 89 |             else:
 90 |                 prompt = self.instruct_prompt_template_with_context.format(instruction=instruction,context=context,response=response)
 91 | 
 92 |             tokenized_entry_input_ids = self.tokenizer(prompt, return_tensors="np")["input_ids"][0]
 93 |             tokenized_entry_length = len(tokenized_entry_input_ids)
 94 |             if tokenized_entry_length >= self.args.max_tokens:
 95 |                 self.logger.info("Too long, not adding to dataset")
 96 |             elif self.args.pad_token_id is not None and self.args.pad_token_id in tokenized_entry_input_ids:
 97 |                 self.logger.info("Contains pad token, not adding to dataset")
 98 |                 self.logger.warn("Recreate the dataset with a different pad token")
 99 |                 print(prompt)
100 |                 quit()
101 |             else:
102 |                 all_prompts.append(prompt)
103 |         return all_prompts
104 | 
105 |     def split_dataset(self,train_size=0.85,validation_size=0.149,test_size=0.001):
106 |         random.shuffle(self.dataset)
107 |         train_size = int(train_size*len(self.dataset))
108 |         validation_size = int(validation_size*len(self.dataset))
109 |         test_size = int(test_size*len(self.dataset))
110 |         train_dataset = self.dataset[:train_size]
111 |         validation_dataset = self.dataset[train_size:train_size+validation_size]
112 |         test_dataset = self.dataset[train_size+validation_size:]
113 |         print("Train size: ",len(train_dataset))
114 |         print("Validation size: ",len(validation_dataset))
115 |         print("Test size: ",len(test_dataset))
116 |         return train_dataset,validation_dataset,test_dataset
117 | 
118 |     def save_dataset_splits(self):
119 |         train_name = self.args.name + "_" + "train_" + str(self.args.seed) + ".csv"
120 |         validation_name = self.args.name + "_" + "validation_" + str(self.args.seed) + ".csv"
121 |         test_name = self.args.name + "_" + "test_" + str(self.args.seed) + ".csv"
122 |         train_path = os.path.join(os.path.realpath('./'),train_name)
123 |         validation_path = os.path.join(os.path.realpath('./'),validation_name)
124 |         test_path = os.path.join(os.path.realpath('./'),test_name)
125 | 
126 |         train_df = pd.DataFrame(self.train_dataset,columns=["text"])
127 |         validation_df = pd.DataFrame(self.validation_dataset,columns=["text"])
128 |         test_df = pd.DataFrame(self.test_dataset,columns=["text"])
129 | 
130 |         train_df.to_csv(train_path,index=False)
131 |         validation_df.to_csv(validation_path,index=False)
132 |         test_df.to_csv(test_path,index=False)
133 | 
134 | if __name__ == "__main__":
135 |     parser = argparse.ArgumentParser()
136 | 
137 |     parser.add_argument("-s","--seed", type=int, help="Seed for random number generator", default=42)
138 |     parser.add_argument("-m","--model", type=str, help="What model to use for the tokenizer", default="meta-llama/Llama-2-7b-hf")
139 |     parser.add_argument("-t","--token",type=str,help="Huggingface token",default=None)
140 |     parser.add_argument("-n","--name",type=str,help="What name to save the csv files under",default="instruct")
141 |     parser.add_argument("-d","--dataset",type=str,help="What huggingface dataset to use",default="databricks/databricks-dolly-15k")
142 |     parser.add_argument("--histogram",action="store_true",default=False)
143 |     parser.add_argument("--max_tokens",type=int,help="Maximum number of tokens to use",default=1024)
144 |     parser.add_argument("--pad_token_id",type=int,default=0)
145 | 
146 |     args = parser.parse_args()
147 |     seed_all(args.seed)
148 |     
149 |     instruct_dataset = InstructDataset(args)
150 | 


--------------------------------------------------------------------------------
/instruct/utils.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | from typing_extensions import Literal
 3 | from rich.logging import RichHandler
 4 | 
 5 | 
 6 | def get_logger(name: str, level: Literal["info", "warning", "debug"]) -> logging.Logger:
 7 |     rich_handler = RichHandler(level=logging.INFO, rich_tracebacks=True, markup=True)
 8 | 
 9 |     logger = logging.getLogger(name)
10 |     logger.setLevel(logging._nameToLevel[level.upper()])
11 | 
12 |     if not logger.handlers:
13 |         logger.addHandler(rich_handler)
14 | 
15 |     logger.propagate = False
16 | 
17 |     return logger


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | datasets
 2 | pandas
 3 | torch
 4 | transformers
 5 | plotly
 6 | rich
 7 | numpy
 8 | langchain
 9 | scipy 
10 | scikit-learn


--------------------------------------------------------------------------------
/simple/README.md:
--------------------------------------------------------------------------------
1 | # Simple Dataset
2 | 
3 | This code creates a simple dataset for quotes.  It is used in the video tutorial [here](https://www.youtube.com/watch?v=07ppAKvOhqk&ab_channel=Blake)
4 | 
5 | ## Running
6 | 
7 | Unless you want to run with a different dataset file, you don't need any flags
8 | 
9 | ```python create_quotes_dataset.py```


--------------------------------------------------------------------------------
/simple/create_quotes_dataset.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import argparse
 3 | import os
 4 | import numpy as np
 5 | from sklearn.model_selection import train_test_split
 6 | import pandas as pd
 7 | 
 8 | def isascii(s):
 9 |     """Check if the characters in string s are in ASCII, U+0-U+7F."""
10 |     return len(s) == len(s.encode())
11 | 
12 | 
13 | def create_sentences(sentences,key,dict):
14 |     data = dict[key]
15 |     counter = 0
16 |     for item in data:
17 |         #cleans data with these weird entries
18 |         if item.find("[10w]")==-1:
19 |             #skips non ascii
20 |             if not isascii(item):
21 |                 continue
22 |             #saves the quotes with style of: <|endoftext|>love: quote about love here<|endoftext|>
23 |             sentences.append("<s>" + key + ": " + item + "</s>")
24 |     return sentences
25 | 
26 | if __name__ == "__main__":
27 |     parser = argparse.ArgumentParser("Tool to train a custom GPT3 model")
28 |     parser.add_argument("-f", "--file",type=str,default="./quotes.json")
29 |     args = parser.parse_args()
30 |     json_file = args.file
31 |     json_file = os.path.realpath(json_file)
32 |     #loads the file
33 |     with open(json_file) as f:
34 |         data = json.load(f)
35 |     #creates dictonary that will hold categories
36 |     category_dict = {}
37 |     #loops through all quotes and adds them to the appropriate categories
38 |     for index, item in enumerate(data):
39 |         if item["Category"] in category_dict:
40 |             category_dict[item["Category"]].append(item["Quote"])
41 |         else:
42 |             entry = [item["Quote"]]
43 |             category_dict[item["Category"]] = entry
44 |     #remove empty entry
45 |     category_dict.pop("",None)
46 |     #prints out the the different quote categories
47 |     print(category_dict.keys())
48 | 
49 |     sentences = []
50 |     for key in category_dict.keys():
51 |         #creates the data entries for training the model
52 |        sentences = create_sentences(sentences,key,category_dict)
53 |     #splits into train and test data for training
54 |     train_sentences,test_sentences = train_test_split(sentences,test_size=0.2)
55 |     #creates dataframes that will eventually be saved with "text" column as a csv file
56 |     train_df = pd.DataFrame(train_sentences,columns=['text'])
57 |     validate_df = pd.DataFrame(test_sentences,columns=['text'])
58 |     #drops na items if there are any
59 |     train_df = train_df.dropna()
60 |     validate_df = validate_df.dropna()
61 |     train_df.to_csv("train.csv")
62 |     validate_df.to_csv("validation.csv")
63 | 
64 | 


--------------------------------------------------------------------------------