├── README.md └── ft.py /README.md: -------------------------------------------------------------------------------- 1 | # ez_finetune 2 | 3 | ## Why It's Useful 4 | Imagine you have a folder full of blog posts and you wanted to fine tune a model to write like you. You'd have to write example prompts that correlate to each blog post and then reformat everything. This does both of those for you. 5 | 6 | Another good use case would be if you need your AI feature to always return in a specific format. 7 | 8 | `ez_finetune` is a Python application designed to simplify the process of creating fine-tuning training data for AI language models. It's specific for OpenAI's spec and useful for those looking to custom-train models like GPT-4 on specific domains or styles without manually crafting queries for each piece of training content. 9 | 10 | ## What the Script Does 11 | The `ez_finetune` script takes a desired fine-tuning persona (a system prompt) and a folder of documents as input. It then performs the following steps: 12 | 1. **User Query Generation**: For each file in the folder, the script generates a plausible user query that might have resulted in the file's content. This is done by using a large language model to reverse engineer a query from the document. 13 | 2. **Formatting**: The script formats the output to be compatible with openai's fine-tuning requirements. Each file's content is paired with the generated query and the provided persona, formatted as a JSON object suitable for training AI language models. 14 | 15 | ## Usage 16 | To use `ez_finetune`, you need to provide the persona and the folder path containing the documents for fine-tuning. You can also specify the output file where the training data will be written. 17 | 18 | Example command: 19 | ```shell 20 | python ft.py --persona "Your desired fine-tuning persona" --folder "/path/to/folder" --output "/path/to/output.jsonl" 21 | ``` 22 | 23 | ## Future Feature Ideas 24 | - [ ] Automated persona creation based on desired output 25 | - [ ] Support for different file formats (e.g., PDF, DOCX). 26 | - [ ] Automated validation of generated queries to ensure relevance. 27 | -------------------------------------------------------------------------------- /ft.py: -------------------------------------------------------------------------------- 1 | import openai 2 | from packaging import version 3 | 4 | required_version = version.parse("1.1.1") 5 | current_version = version.parse(openai.__version__) 6 | 7 | if current_version < required_version: 8 | raise ValueError(f"Error: OpenAI version {openai.__version__}" 9 | " is less than the required version 1.1.1") 10 | else: 11 | print("OpenAI version is compatible.") 12 | 13 | import os 14 | import json from openai import OpenAI 15 | 16 | class FineTuningDataGenerator: 17 | def __init__(self, persona, folder_path): 18 | self.persona = persona 19 | self.folder_path = folder_path 20 | self.client = OpenAI() 21 | 22 | def generate_query(self, file_content): 23 | response = self.client.chat.completions.create( 24 | model="gpt-4-1106-preview", 25 | messages=[ 26 | {"role": "system", "content": "You are a query generator. For given user content, you will write a query that might have resulted in the content as output. For example, if the user's content is a poem about sharks. You would output 'Write a poem about sharks'."}, 27 | {"role": "user", "content": file_content} 28 | ] 29 | ) 30 | return response.choices[0].message.content.strip() 31 | 32 | 33 | def process_file(self, file_path): 34 | with open(file_path, 'r') as file: 35 | file_content = file.read() 36 | query = self.generate_query(file_content) 37 | return { 38 | "messages": [ 39 | {"role": "system", "content": self.persona}, 40 | {"role": "user", "content": query}, 41 | {"role": "assistant", "content": file_content} 42 | ] 43 | } 44 | 45 | def generate_data(self): 46 | training_data = [] 47 | for file_name in os.listdir(self.folder_path): 48 | file_path = os.path.join(self.folder_path, file_name) 49 | if os.path.isfile(file_path): 50 | data = self.process_file(file_path) 51 | training_data.append(data) 52 | return training_data 53 | def main(persona, folder_path, output_file): 54 | generator = FineTuningDataGenerator(persona, folder_path) 55 | training_data = generator.generate_data() 56 | with open(output_file, 'w') as f: 57 | for data in training_data: 58 | f.write(json.dumps(data) + '\n') 59 | 60 | if __name__ == "__main__": 61 | import argparse 62 | parser = argparse.ArgumentParser(description="Generate fine-tuning data for GPT models.") 63 | parser.add_argument("--persona", required=True, help="The fine-tuning persona/system prompt.") 64 | parser.add_argument("--folder", required=True, help="The folder containing documents for fine-tuning.") 65 | parser.add_argument("--output", required=True, help="The file to write the training data to.") 66 | args = parser.parse_args() 67 | main(args.persona, args.folder, args.output) 68 | 69 | --------------------------------------------------------------------------------