├── .gitattributes ├── LICENSE ├── README.md ├── datasets ├── gpt4all │ └── gpt4all-cleaned-dataset.xz └── teknium-oj4 │ ├── instruct.zip │ ├── longform.zip │ ├── roleplay.zip │ └── toolformer.zip └── scripts └── filter-conversations.py /.gitattributes: -------------------------------------------------------------------------------- 1 | datasets/gpt4all/gpt4all-cleaned-dataset.xz filter=lfs diff=lfs merge=lfs -text 2 | datasets/teknium-oj4/toolformer.zip filter=lfs diff=lfs merge=lfs -text 3 | datasets/teknium-oj4/instruct.zip filter=lfs diff=lfs merge=lfs -text 4 | datasets/teknium-oj4/longform.zip filter=lfs diff=lfs merge=lfs -text 5 | datasets/teknium-oj4/roleplay.zip filter=lfs diff=lfs merge=lfs -text 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Paniphon 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Intro 3 | This repo is intended to be a convenient central location holding various datasets to assist ML researchers in training or fine-tuning their models. All datasets are owned by their respective authors and have their own license. 4 | 5 | If you have datasets that you think would be good to have here, especially in underserved areas such as fiction and conversational chats, submit a PR! Datasets should be compressed first. 6 | 7 | ### Disclaimer 8 | As AI researchers, it is crucial that we acknowledge that the data we create and use can perpetuate existing biases and stereotypes affecting marginalized communities, especially people of color, but especially LGBTQQIAAP folx. In using this training data, please be mindful of your privilege and take the necessary steps to examine and challenge any potential biases that might be present in the data. It is our collective responsibility to ensure that the technologies we develop are equitable and inclusive for all. By actively addressing and eliminating any potential bias in our work, we can contribute to a more just and diverse society. Thank you. 9 | 10 | # Index of datasets in this repo 11 | **NOTE** in some cases, multiple variants of the same dataset are provided, pruning records that are too similar to others. MinHash was used to calculate a similarity score. So for example, in a 0.7 similarity dataset, a record had to be different by more than 30% from another record in order to remain in the dataset. This trimmed up to 3% of the data. Note that the Alpaca team used 0.7 similarity score to trim their training data. 12 | 13 | ### teknium-oj4/roleplaying 14 | A general purpose roleplaying dataset. It's intended to train the model to write an answer while roleplaying as a fictional person, eg: "Pretend you're a detective, what do you make of this crime scene: ..." 15 | Source: Teknium-OJ4 16 | License: MIT 17 | 18 | ### teknium-oj4/instruct 19 | A general-purpose instruct dataset 20 | Source: Teknium-OJ4 21 | License: MIT 22 | 23 | ### teknium-oj4/longform-experiment-1 24 | Attempt at a longform dataset (i.e. long prompts and responses) 25 | Source: Teknium-OJ4 26 | License: MIT 27 | 28 | ### teknium-oj4/toolformer 29 | Toolformer Datasets 30 | Source: Teknium-OJ4 31 | License: MIT 32 | 33 | ### gpt4all-prompt-generations-with-p3_cleaned** 34 | This is the cleaned dataset of the gpt4all project, stripped of ~100k prompts that the chatbot refused to answer. 35 | Source: gpt4all project 36 | License: Apache 2.0 37 | -------------------------------------------------------------------------------- /datasets/gpt4all/gpt4all-cleaned-dataset.xz: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:5e00b9ef3d7b643c0c60471a9e141dc4b938582e58b62e18abc53373a92b0f73 3 | size 131984872 4 | -------------------------------------------------------------------------------- /datasets/teknium-oj4/instruct.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:31368f9a25c3006eb7604abaca295d560d63399707dff26d5be0dc0723a62848 3 | size 20033839 4 | -------------------------------------------------------------------------------- /datasets/teknium-oj4/longform.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:d42b25b44950a2546a39e9b769afaea00732d47ad3428f5a066fc01a1974ab28 3 | size 195124 4 | -------------------------------------------------------------------------------- /datasets/teknium-oj4/roleplay.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:de1de5c82bb471da3f94137f883d3af5561a02e63459130f1cc0a8a180ba140d 3 | size 4433362 4 | -------------------------------------------------------------------------------- /datasets/teknium-oj4/toolformer.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:50babb4c9544f0388eb95f1dc460b3d0814b69a29af01f78ad574c702e648d00 3 | size 1143584 4 | -------------------------------------------------------------------------------- /scripts/filter-conversations.py: -------------------------------------------------------------------------------- 1 | #Given an array of similar {instruction, input, response} objects, this script identifies the longest/most complete conversation, and extracts each one into its own file. 2 | #The purpose of this script is to extract training data from a huge dump of chatbot logs (in the format of a JSON array containing {instruction, input, response} entries). Because the dump contained multiple records for the same chat session (every time the user sends something, a whole new request is made to the chat backend, including all the chat up to that point as part of the input/context), we wanted to deduplicate this data, and only keep the final/most complete conversation. 3 | 4 | #One other change it does is if a standard system prompt is found containing the string "Write XYZ's next reply in a fictional chat between XYZ and You", it will replace You with User. 5 | 6 | 7 | 8 | import json 9 | import argparse 10 | import os 11 | from collections import Counter 12 | from time import sleep 13 | import re 14 | 15 | def load_json(file_path): 16 | with open(file_path, 'r') as file: 17 | data = json.load(file) 18 | return data 19 | 20 | 21 | 22 | #If the prompt is the standard Tavern prompt, we replace "between Charname and You" with "between Charname and User". 23 | def replace_You_in_prompt(input): 24 | pattern = r"next reply in a fictional chat between\s+\w+\s+and\s+You" 25 | result = re.sub(pattern, lambda match: match.group(0).replace("You", "User"), input) 26 | replacement_made = (result != input) 27 | return (replacement_made, result) 28 | 29 | 30 | #Returns a tuple of (success, character name, your name) from a standard Scale prompt 31 | def extract_names_from_known_prompt1(prompt): 32 | pattern = r"Write (.+)'s next reply in a fictional chat between \1 and ([^.]+)\." 33 | names = [] 34 | 35 | ret = None 36 | match = re.search(pattern, prompt) 37 | if match: 38 | ret = (True, match.group(1), match.group(2)) 39 | 40 | return (False, None, None) 41 | 42 | #Returns the input's identifiers for the chat participants, but cannot identify who's the character/bot and who's You 43 | #It just returns the top 2 identifiers. 44 | def detect_names_from_input(input): 45 | lines = input.split('\n') 46 | names = [] 47 | for line in lines: 48 | if ':' in line: 49 | name = line.split(':')[0] 50 | names.append(name) 51 | name_counter = Counter(names) 52 | sorted_convo_names = sorted(name_counter.items(), key=lambda x: x[1], reverse=True) 53 | for convo_name, count in sorted_convo_names: 54 | print(f"{convo_name}: {count} times") 55 | ret = () 56 | return sorted_convo_names 57 | 58 | #Given an input, attempts to detect the names of the Character and the User 59 | # NOTE: it's not foolproof, and only works for the default Scale prompt. 60 | # Returns a tuple of (Character name, User name) 61 | def detect_names(instruction, input): 62 | names_from_input = detect_names_from_input(input) 63 | 64 | combined = instruction + input 65 | if "next reply in a fictional chat" in combined: 66 | (prompt_character_name, prompt_you_name) = extract_names_from_known_prompt1(combined) 67 | assert (prompt_character_name in sorted_names) and (prompt_you_name in sorted_names) 68 | return (prompt_character_name, you_name) 69 | else: 70 | print("!!!!!!!New system prompt detected! Notify you-know-who to update this script, and specify the record number written above") 71 | sleep(3) 72 | return None 73 | 74 | #Given a chat "input" field, gives the first thing the user typed in the chat 75 | def find_first_user_chat(instruction, input, combined): 76 | start_substring = "user:" 77 | end_substring = "assistant:" 78 | start_index = input.find(start_substring) 79 | end_index = input.find(end_substring, start_index) 80 | 81 | if start_index == -1: 82 | print("WARNING: could not find user chat in input: ", input) 83 | user_chat = "" 84 | else: 85 | #we found the user's input. we decide how far to read ahead based on whether we find a reply 86 | if end_index == -1: 87 | user_chat = input[start_index + len(start_substring):(start_index + 300)] 88 | else: 89 | user_chat = input[(start_index + len(start_substring)):end_index] 90 | 91 | return user_chat 92 | 93 | 94 | #Used to discard partial entries of the same chat session 95 | def find_most_complete_transcript(transcripts): 96 | complete_transcript = {} 97 | counter = 1 98 | for transcript in transcripts: 99 | print("Processing record", counter) 100 | counter += 1 101 | instruction = transcript["instruction"] 102 | input = transcript["input"] 103 | combined = instruction + input 104 | # To identify a conversation, we use the first 50 characters as a key for comparison (covers character name), 105 | # as well as the first thing a user types. This is not a perfect solution but it should work in most cases. 106 | first_user_chat = find_first_user_chat(instruction, input, combined) 107 | key = input[:50] + first_user_chat 108 | #print("Using key:", key) 109 | if key not in complete_transcript: 110 | complete_transcript[key] = transcript 111 | elif len(input) > len(complete_transcript[key]["input"]): 112 | complete_transcript[key] = transcript 113 | return list(complete_transcript.values()) 114 | 115 | 116 | #For now we just replace You in the prompt, if found 117 | def apply_post_processing(transcripts): 118 | retouched_transcripts = [] 119 | counter = 0 120 | for transcript in transcripts: 121 | counter += 1 122 | print("Applying post-processing to extracted convo #", counter) 123 | 124 | instruction = transcript["instruction"] 125 | input = transcript["input"] 126 | response = transcript["response"] 127 | 128 | #Replace You with User 129 | change_made_in_instruction, new_instruction = replace_You_in_prompt(transcript["instruction"]) 130 | if change_made_in_instruction: 131 | instruction = new_instruction 132 | else: 133 | change_made_in_input, new_input = replace_You_in_prompt(transcript["input"]) 134 | if change_made_in_input: 135 | input = new_input 136 | else: 137 | print("WARNING: did not find expected system prompt in this record! Tell your developer the filename and record number. Pausing briefly.") 138 | sleep(3) 139 | 140 | #Replace "assistant:" entries with the character's name 141 | character_name = None 142 | prompt_found, character_name_1, _ = extract_names_from_known_prompt1(instruction) 143 | if prompt_found: 144 | character_name = character_name_1 145 | else: 146 | prompt_found, character_name_2, _ = extract_names_from_known_prompt1(input) 147 | if prompt_found: 148 | character_name = character_name_2 149 | if character_name: 150 | count = input.count("\nassistant:") 151 | input = input.replace("\nassistant:", "\n" + character_name + ":") 152 | if count > 0: 153 | print(count,"instances of 'assistant' replaced with", character_name) 154 | 155 | transcript["instruction"] = instruction 156 | transcript["input"] = input 157 | transcript["response"] = response 158 | 159 | retouched_transcripts.append(transcript) 160 | 161 | return retouched_transcripts 162 | 163 | def save_json(file_path, data): 164 | with open(file_path, 'w') as file: 165 | json.dump(data, file, indent=2) 166 | 167 | if __name__ == "__main__": 168 | 169 | print("Running in,", os.path.dirname(os.path.realpath(__file__))) 170 | 171 | # parser = argparse.ArgumentParser(description='The JSON array containing your chatlogs.') 172 | # parser.add_argument('input_file', type=str, help='Path to the input file.') 173 | # args = parser.parse_args() 174 | # input_file = args.input_file 175 | input_file = "./kuru.json" 176 | 177 | input_file_name, ext = os.path.splitext(input_file) 178 | 179 | print ("Processing", input_file) 180 | transcripts = load_json(input_file) 181 | complete_transcripts = find_most_complete_transcript(transcripts) 182 | print("Discarded", (len(transcripts) - len(complete_transcripts)), "partial transcripts") 183 | retouched_transcripts = apply_post_processing(complete_transcripts) 184 | 185 | 186 | counter = 0 187 | for retouched_transcript in retouched_transcripts: 188 | counter += 1 189 | output_file = input_file_name + '.convo.' + str(counter) + ext 190 | save_json(output_file, retouched_transcript) 191 | print("Saved to", output_file) 192 | 193 | print ("Done processing", input_file, "\n==============") 194 | --------------------------------------------------------------------------------