├── .gitattributes
├── LICENSE
├── README.md
├── datasets
    ├── gpt4all
    │   └── gpt4all-cleaned-dataset.xz
    └── teknium-oj4
    │   ├── instruct.zip
    │   ├── longform.zip
    │   ├── roleplay.zip
    │   └── toolformer.zip
└── scripts
    └── filter-conversations.py


/.gitattributes:
--------------------------------------------------------------------------------
1 | datasets/gpt4all/gpt4all-cleaned-dataset.xz filter=lfs diff=lfs merge=lfs -text
2 | datasets/teknium-oj4/toolformer.zip filter=lfs diff=lfs merge=lfs -text
3 | datasets/teknium-oj4/instruct.zip filter=lfs diff=lfs merge=lfs -text
4 | datasets/teknium-oj4/longform.zip filter=lfs diff=lfs merge=lfs -text
5 | datasets/teknium-oj4/roleplay.zip filter=lfs diff=lfs merge=lfs -text
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Paniphon
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Intro
 3 | This repo is intended to be a convenient central location holding various datasets to assist ML researchers in training or fine-tuning their models. All datasets are owned by their respective authors and have their own license.
 4 | 
 5 | If you have datasets that you think would be good to have here, especially in underserved areas such as fiction and conversational chats, submit a PR! Datasets should be compressed first.
 6 | 
 7 | ### Disclaimer
 8 | As AI researchers, it is crucial that we acknowledge that the data we create and use can perpetuate existing biases and stereotypes affecting marginalized communities, especially people of color, but especially LGBTQQIAAP folx. In using this training data, please be mindful of your privilege and take the necessary steps to examine and challenge any potential biases that might be present in the data. It is our collective responsibility to ensure that the technologies we develop are equitable and inclusive for all. By actively addressing and eliminating any potential bias in our work, we can contribute to a more just and diverse society. Thank you.
 9 | 
10 | # Index of datasets in this repo
11 | **NOTE** in some cases, multiple variants of the same dataset are provided, pruning records that are too similar to others. MinHash was used to calculate a similarity score. So for example, in a 0.7 similarity dataset, a record had to be different by more than 30% from another record in order to remain in the dataset. This trimmed up to 3% of the data. Note that the Alpaca team used 0.7 similarity score to trim their training data.
12 | 
13 | ### teknium-oj4/roleplaying
14 | A general purpose roleplaying dataset. It's intended to train the model to write an answer while roleplaying as a fictional person, eg: "Pretend you're a detective, what do you make of this crime scene: ..."  
15 | Source: Teknium-OJ4  
16 | License: MIT
17 | 
18 | ### teknium-oj4/instruct
19 | A general-purpose instruct dataset  
20 | Source: Teknium-OJ4  
21 | License: MIT  
22 | 
23 | ### teknium-oj4/longform-experiment-1
24 | Attempt at a longform dataset (i.e. long prompts and responses)  
25 | Source: Teknium-OJ4  
26 | License: MIT
27 | 
28 | ### teknium-oj4/toolformer
29 | Toolformer Datasets  
30 | Source: Teknium-OJ4  
31 | License: MIT  
32 | 
33 | ### gpt4all-prompt-generations-with-p3_cleaned**
34 | This is the cleaned dataset of the gpt4all project, stripped of ~100k prompts that the chatbot refused to answer.
35 | Source: gpt4all project  
36 | License: Apache 2.0  
37 | 


--------------------------------------------------------------------------------
/datasets/gpt4all/gpt4all-cleaned-dataset.xz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:5e00b9ef3d7b643c0c60471a9e141dc4b938582e58b62e18abc53373a92b0f73
3 | size 131984872
4 | 


--------------------------------------------------------------------------------
/datasets/teknium-oj4/instruct.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:31368f9a25c3006eb7604abaca295d560d63399707dff26d5be0dc0723a62848
3 | size 20033839
4 | 


--------------------------------------------------------------------------------
/datasets/teknium-oj4/longform.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:d42b25b44950a2546a39e9b769afaea00732d47ad3428f5a066fc01a1974ab28
3 | size 195124
4 | 


--------------------------------------------------------------------------------
/datasets/teknium-oj4/roleplay.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:de1de5c82bb471da3f94137f883d3af5561a02e63459130f1cc0a8a180ba140d
3 | size 4433362
4 | 


--------------------------------------------------------------------------------
/datasets/teknium-oj4/toolformer.zip:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:50babb4c9544f0388eb95f1dc460b3d0814b69a29af01f78ad574c702e648d00
3 | size 1143584
4 | 


--------------------------------------------------------------------------------
/scripts/filter-conversations.py:
--------------------------------------------------------------------------------
  1 | #Given an array of similar {instruction, input, response} objects, this script identifies the longest/most complete conversation, and extracts each one into its own file.
  2 | #The purpose of this script is to extract training data from a huge dump of chatbot logs (in the format of a JSON array containing {instruction, input, response} entries). Because the dump contained multiple records for the same chat session (every time the user sends something, a whole new request is made to the chat backend, including all the chat up to that point as part of the input/context), we wanted to deduplicate this data, and only keep the final/most complete conversation.
  3 | 
  4 | #One other change it does is if a standard system prompt is found containing the string "Write XYZ's next reply in a fictional chat between XYZ and You", it will replace You with User.
  5 | 
  6 | 
  7 | 
  8 | import json
  9 | import argparse
 10 | import os
 11 | from collections import Counter
 12 | from time import sleep
 13 | import re
 14 | 
 15 | def load_json(file_path):
 16 |     with open(file_path, 'r') as file:
 17 |         data = json.load(file)
 18 |     return data
 19 | 
 20 | 
 21 | 
 22 | #If the prompt is the standard Tavern prompt, we replace "between Charname and You" with "between Charname and User".
 23 | def replace_You_in_prompt(input):
 24 |     pattern = r"next reply in a fictional chat between\s+\w+\s+and\s+You"
 25 |     result = re.sub(pattern, lambda match: match.group(0).replace("You", "User"), input)
 26 |     replacement_made = (result != input)
 27 |     return (replacement_made, result)
 28 | 
 29 | 
 30 | #Returns a tuple of (success, character name, your name) from a standard Scale prompt
 31 | def extract_names_from_known_prompt1(prompt):
 32 |     pattern = r"Write (.+)'s next reply in a fictional chat between \1 and ([^.]+)\."
 33 |     names = []
 34 | 
 35 |     ret = None
 36 |     match = re.search(pattern, prompt)
 37 |     if match:
 38 |         ret = (True, match.group(1), match.group(2))
 39 |     
 40 |     return (False, None, None)
 41 | 
 42 | #Returns the input's identifiers for the chat participants, but cannot identify who's the character/bot and who's You
 43 | #It just returns the top 2 identifiers.
 44 | def detect_names_from_input(input):
 45 |     lines = input.split('\n')
 46 |     names = []
 47 |     for line in lines:
 48 |         if ':' in line:
 49 |             name = line.split(':')[0]
 50 |             names.append(name)
 51 |     name_counter = Counter(names)
 52 |     sorted_convo_names = sorted(name_counter.items(), key=lambda x: x[1], reverse=True)
 53 |     for convo_name, count in sorted_convo_names:
 54 |         print(f"{convo_name}: {count} times")
 55 |     ret = ()
 56 |     return sorted_convo_names
 57 | 
 58 | #Given an input, attempts to detect the names of the Character and the User
 59 | # NOTE: it's not foolproof, and only works for the default Scale prompt.
 60 | # Returns a tuple of (Character name, User name)
 61 | def detect_names(instruction, input):
 62 |     names_from_input = detect_names_from_input(input)
 63 | 
 64 |     combined = instruction + input
 65 |     if "next reply in a fictional chat" in combined:
 66 |         (prompt_character_name, prompt_you_name) = extract_names_from_known_prompt1(combined)
 67 |         assert (prompt_character_name in sorted_names) and (prompt_you_name in sorted_names)
 68 |         return (prompt_character_name, you_name)
 69 |     else:
 70 |         print("!!!!!!!New system prompt detected! Notify you-know-who to update this script, and specify the record number written above")
 71 |         sleep(3)
 72 |         return None
 73 | 
 74 | #Given a chat "input" field, gives the first thing the user typed in the chat
 75 | def find_first_user_chat(instruction, input, combined):
 76 |     start_substring = "user:"
 77 |     end_substring = "assistant:"
 78 |     start_index = input.find(start_substring)
 79 |     end_index = input.find(end_substring, start_index)
 80 | 
 81 |     if start_index == -1:
 82 |         print("WARNING: could not find user chat in input: ", input)
 83 |         user_chat = ""
 84 |     else:
 85 |         #we found the user's input. we decide how far to read ahead based on whether we find a reply
 86 |         if end_index == -1:
 87 |             user_chat = input[start_index + len(start_substring):(start_index + 300)]
 88 |         else:
 89 |             user_chat = input[(start_index + len(start_substring)):end_index]
 90 |         
 91 |     return user_chat
 92 | 
 93 | 
 94 | #Used to discard partial entries of the same chat session
 95 | def find_most_complete_transcript(transcripts):
 96 |     complete_transcript = {}
 97 |     counter = 1
 98 |     for transcript in transcripts:
 99 |         print("Processing record", counter)
100 |         counter += 1
101 |         instruction = transcript["instruction"]
102 |         input = transcript["input"]
103 |         combined = instruction + input
104 |         # To identify a conversation, we use the first 50 characters as a key for comparison (covers character name),
105 |         # as well as the first thing a user types. This is not a perfect solution but it should work in most cases.
106 |         first_user_chat = find_first_user_chat(instruction, input, combined)
107 |         key = input[:50] + first_user_chat
108 |         #print("Using key:", key)
109 |         if key not in complete_transcript:
110 |             complete_transcript[key] = transcript
111 |         elif len(input) > len(complete_transcript[key]["input"]):
112 |             complete_transcript[key] = transcript
113 |     return list(complete_transcript.values())
114 | 
115 | 
116 | #For now we just replace You in the prompt, if found
117 | def apply_post_processing(transcripts):
118 |     retouched_transcripts = []
119 |     counter = 0
120 |     for transcript in transcripts:
121 |         counter += 1
122 |         print("Applying post-processing to extracted convo #", counter)
123 | 
124 |         instruction = transcript["instruction"]
125 |         input = transcript["input"]
126 |         response = transcript["response"]
127 | 
128 |         #Replace You with User
129 |         change_made_in_instruction, new_instruction = replace_You_in_prompt(transcript["instruction"])
130 |         if change_made_in_instruction:
131 |             instruction = new_instruction
132 |         else:
133 |             change_made_in_input, new_input = replace_You_in_prompt(transcript["input"])
134 |             if change_made_in_input:
135 |                 input = new_input
136 |             else:
137 |                 print("WARNING: did not find expected system prompt in this record! Tell your developer the filename and record number. Pausing briefly.")
138 |                 sleep(3)
139 | 
140 |         #Replace "assistant:" entries with the character's name
141 |         character_name = None
142 |         prompt_found, character_name_1, _ = extract_names_from_known_prompt1(instruction)
143 |         if prompt_found:
144 |             character_name = character_name_1
145 |         else:
146 |             prompt_found, character_name_2, _ = extract_names_from_known_prompt1(input)
147 |             if prompt_found:
148 |                 character_name = character_name_2
149 |         if character_name:
150 |             count = input.count("\nassistant:")
151 |             input = input.replace("\nassistant:", "\n" + character_name + ":")
152 |             if count > 0:
153 |                 print(count,"instances of 'assistant' replaced with", character_name)
154 | 
155 |         transcript["instruction"] = instruction
156 |         transcript["input"] = input
157 |         transcript["response"] = response
158 | 
159 |         retouched_transcripts.append(transcript)
160 | 
161 |     return retouched_transcripts
162 | 
163 | def save_json(file_path, data):
164 |     with open(file_path, 'w') as file:
165 |         json.dump(data, file, indent=2)
166 | 
167 | if __name__ == "__main__":
168 | 
169 |     print("Running in,", os.path.dirname(os.path.realpath(__file__)))
170 | 
171 |     # parser = argparse.ArgumentParser(description='The JSON array containing your chatlogs.')
172 |     # parser.add_argument('input_file', type=str, help='Path to the input file.')
173 |     # args = parser.parse_args()
174 |     # input_file = args.input_file
175 |     input_file = "./kuru.json"
176 | 
177 |     input_file_name, ext = os.path.splitext(input_file)
178 | 
179 |     print ("Processing", input_file)
180 |     transcripts = load_json(input_file)
181 |     complete_transcripts = find_most_complete_transcript(transcripts)
182 |     print("Discarded", (len(transcripts) - len(complete_transcripts)), "partial transcripts")
183 |     retouched_transcripts = apply_post_processing(complete_transcripts)
184 | 
185 | 
186 |     counter = 0
187 |     for retouched_transcript in retouched_transcripts:
188 |         counter += 1
189 |         output_file = input_file_name + '.convo.' + str(counter) + ext
190 |         save_json(output_file, retouched_transcript)
191 |         print("Saved to", output_file)
192 | 
193 |     print ("Done processing", input_file, "\n==============")
194 | 


--------------------------------------------------------------------------------