├── README.md
└── ft.py


/README.md:
--------------------------------------------------------------------------------
 1 | # ez_finetune
 2 | 
 3 | ## Why It's Useful
 4 | Imagine you have a folder full of blog posts and you wanted to fine tune a model to write like you. You'd have to write example prompts that correlate to each blog post and then reformat everything. This does both of those for you.
 5 | 
 6 | Another good use case would be if you need your AI feature to always return in a specific format. 
 7 | 
 8 | `ez_finetune` is a Python application designed to simplify the process of creating fine-tuning training data for AI language models. It's specific for OpenAI's spec and useful for those looking to custom-train models like GPT-4 on specific domains or styles without manually crafting queries for each piece of training content.
 9 | 
10 | ## What the Script Does
11 | The `ez_finetune` script takes a desired fine-tuning persona (a system prompt) and a folder of documents as input. It then performs the following steps:
12 | 1. **User Query Generation**: For each file in the folder, the script generates a plausible user query that might have resulted in the file's content. This is done by using a large language model to reverse engineer a query from the document.
13 | 2. **Formatting**: The script formats the output to be compatible with openai's fine-tuning requirements. Each file's content is paired with the generated query and the provided persona, formatted as a JSON object suitable for training AI language models.
14 | 
15 | ## Usage
16 | To use `ez_finetune`, you need to provide the persona and the folder path containing the documents for fine-tuning. You can also specify the output file where the training data will be written.
17 | 
18 | Example command:
19 | ```shell
20 | python ft.py --persona "Your desired fine-tuning persona" --folder "/path/to/folder" --output "/path/to/output.jsonl"
21 | ```
22 | 
23 | ## Future Feature Ideas
24 | - [ ] Automated persona creation based on desired output
25 | - [ ] Support for different file formats (e.g., PDF, DOCX).
26 | - [ ] Automated validation of generated queries to ensure relevance.
27 | 


--------------------------------------------------------------------------------
/ft.py:
--------------------------------------------------------------------------------
 1 | import openai
 2 | from packaging import version
 3 | 
 4 | required_version = version.parse("1.1.1")
 5 | current_version = version.parse(openai.__version__)
 6 | 
 7 | if current_version < required_version:
 8 |     raise ValueError(f"Error: OpenAI version {openai.__version__}"
 9 |                      " is less than the required version 1.1.1")
10 | else:
11 |     print("OpenAI version is compatible.")
12 | 
13 | import os
14 | import json from openai import OpenAI
15 | 
16 | class FineTuningDataGenerator:
17 |     def __init__(self, persona, folder_path):
18 |         self.persona = persona
19 |         self.folder_path = folder_path
20 |         self.client = OpenAI()
21 | 
22 |     def generate_query(self, file_content):
23 |         response = self.client.chat.completions.create(
24 |             model="gpt-4-1106-preview",
25 |             messages=[
26 |                 {"role": "system", "content": "You are a query generator. For given user content, you will write a query that might have resulted in the content as output. For example, if the user's content is a poem about sharks. You would output 'Write a poem about sharks'."},
27 |                 {"role": "user", "content": file_content}
28 |             ]
29 |         )
30 |         return response.choices[0].message.content.strip()
31 | 
32 | 
33 |     def process_file(self, file_path):
34 |         with open(file_path, 'r') as file:
35 |             file_content = file.read()
36 |         query = self.generate_query(file_content)
37 |         return {
38 |             "messages": [
39 |                 {"role": "system", "content": self.persona},
40 |                 {"role": "user", "content": query},
41 |                 {"role": "assistant", "content": file_content}
42 |             ]
43 |         }
44 | 
45 |     def generate_data(self):
46 |         training_data = []
47 |         for file_name in os.listdir(self.folder_path):
48 |             file_path = os.path.join(self.folder_path, file_name)
49 |             if os.path.isfile(file_path):
50 |                 data = self.process_file(file_path)
51 |                 training_data.append(data)
52 |         return training_data
53 | def main(persona, folder_path, output_file):
54 |     generator = FineTuningDataGenerator(persona, folder_path)
55 |     training_data = generator.generate_data()
56 |     with open(output_file, 'w') as f:
57 |         for data in training_data:
58 |             f.write(json.dumps(data) + '\n')
59 | 
60 | if __name__ == "__main__":
61 |     import argparse
62 |     parser = argparse.ArgumentParser(description="Generate fine-tuning data for GPT models.")
63 |     parser.add_argument("--persona", required=True, help="The fine-tuning persona/system prompt.")
64 |     parser.add_argument("--folder", required=True, help="The folder containing documents for fine-tuning.")
65 |     parser.add_argument("--output", required=True, help="The file to write the training data to.")
66 |     args = parser.parse_args()
67 |     main(args.persona, args.folder, args.output)
68 | 
69 | 


--------------------------------------------------------------------------------