├── .gitignore ├── README.md ├── alt_app.py ├── app.py ├── dataset_uploader.py ├── requirements.txt └── templates ├── dpo_form.html ├── index.html ├── main.js └── sft_form.html /.gitignore: -------------------------------------------------------------------------------- 1 | /README_files/ 2 | /venv/ 3 | README.html 4 | dpo_data.json 5 | sft_data.json 6 | __pycache__/ 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ShareGPT Builder 2 | 3 | ShareGPT Builder is a versatile Gradio application that provides two key functionalities for training Language Learning Models (LLMs). 4 | 5 | The application is designed to run locally, and submitted examples will be stored locally in the applications directory, but can also be served as a web application to anyone. 6 | 7 | ### Supervised Fine-Tuning (SFT) Conversation Sample Builder: 8 | 9 | Firstly, it allows you to manually construct and store ShareGPT Formatted (SFT) conversations involving a system, a human, and GPT role or the Standard Formatted conversations involving a system, a user and an assistant. These conversations are automatically uploaded to huggingface. 10 | 11 | For datasets using this format, refer to the [Hermes 2.5 Dataset here](https://huggingface.co/datasets/teknium/OpenHermes-2.5). 12 | 13 |

14 | 15 | 16 | ### Direct Preference Optimization (DPO) RLHF Sample Builder: 17 | Secondly, the application also includes a DPO Sample Builder. This feature enables the creation of sample comparison conversation responses, for Reinforcement Learning from Human Feedback (RLHF). This data gets automatically uploaded to the hub, and is in the Intel NeuralChat DPO format. 18 | 19 |

20 | 21 | ### Datasets inspector 22 | In this tab you can check all of your uploaded datasets, since the datasets are not uploaded in real time and there's an interval between the commits you will have to wait a little bit until the upload finishes as well as huggingface dataset viewer finished processing the newly commited data. 23 | 24 |

25 | 26 | ## Installation 27 | 28 | 1. Clone the repository: 29 | ```bash 30 | git clone https://github.com/teknium1/sharegpt-builder.git 31 | ``` 32 | 33 | 2. Navigate to the project directory: 34 | ```bash 35 | cd sharegpt-builder 36 | ``` 37 | 38 | 3. Install the required Python packages: 39 | ```bash 40 | pip install -r requirements.txt 41 | ``` 42 | 43 | 4. login with your HuggingFace token with write access if you aren't already: 44 | ```bash 45 | huggingface-cli login 46 | ``` 47 | 48 | ## Usage 49 | 50 | 1. Run the Gradio application: 51 | ```bash 52 | python app.py 53 | ``` 54 | 55 | 2. Open your web browser and navigate to `http://127.0.0.1:7860/`. 56 | 57 | 3. You will find 2 tabs, one for SFT and one for DPO, navigate to the one you want to contribute to and click there. 58 | 59 | 4. To add more turns to the conversation, fill the text field and press **↳ enter** 60 | 61 | 5. After adding all the turns, click `save chat` to upload the conversation. 62 | 63 | 6. The uploaded conversations can be viewed directly on the hub. 64 | 65 | ## Contributing 66 | Contributions are welcome and greatly appreciated! Every little bit helps, and credit will always be given. 67 | 68 | * `12/17/2024` : Thanks to [not-lain](https://github.com/not-lain) for fixing sharegpt template as well as adding the dataset viewer tab 69 | * `12/13/2024` : Thanks to [aldryss](https://github.com/aldryss) for updating the UI 🔥 70 | * `12/12/2024` : Thanks to [not-lain](https://github.com/not-lain) for the help switching from flask to gradio and supporting automatic dataset upload 🔥 71 | 72 | 73 | Here are ways to contribute: 74 | 75 | 1. Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug. 76 | 2. Fork the repository on GitHub and start making your changes to a new branch. 77 | 3. Write a test which shows that the bug was fixed or that the feature works as expected. 78 | 4. Send a pull request and bug the maintainer until it gets merged and published. 79 | 80 | Alternatively, you can contribute via submission of bugs or feature requests to the issues tab. 81 | 82 | ## Note 83 | 84 | The application is set to run in debug mode. For production use, make sure to turn off debug mode in `app.py`. 85 | 86 | ## License 87 | 88 | This project is licensed under the terms of the MIT license. 89 | -------------------------------------------------------------------------------- /alt_app.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, render_template, request, redirect, url_for 2 | import json, os 3 | 4 | app = Flask(__name__) 5 | 6 | def clean_entry(entry): 7 | entry = entry.strip().replace("\r", "").replace(" \n", "\n") 8 | return entry 9 | 10 | # Route for index/main page 11 | @app.route('/', defaults={'active_tab': 'sft'}) 12 | @app.route('/') 13 | def index(active_tab): 14 | return render_template('index.html', active_tab=active_tab) 15 | 16 | # Route for the SFT Dataset Builder. 17 | @app.route('/sft', methods=['GET', 'POST']) 18 | def form(): 19 | if request.method == 'POST': 20 | # Extract form data 21 | system_prompt = request.form.get('system') 22 | user_prompts = request.form.getlist('user[]') 23 | gpt_responses = request.form.getlist('gpt[]') 24 | 25 | # Clean the system prompt, user prompts, and gpt responses 26 | system_prompt = clean_entry(system_prompt) 27 | user_prompts = [clean_entry(prompt) for prompt in user_prompts] 28 | gpt_responses = [clean_entry(response) for response in gpt_responses] 29 | 30 | # Data to be appended 31 | data_to_append = { 32 | 'conversations': [ 33 | { 34 | 'from': 'system', 35 | 'value': system_prompt 36 | } 37 | ], 38 | 'source': 'manual' 39 | } 40 | 41 | # Add turns to the conversation 42 | for user_prompt, gpt_response in zip(user_prompts, gpt_responses): 43 | data_to_append['conversations'].append({ 44 | 'from': 'human', 45 | 'value': user_prompt 46 | }) 47 | data_to_append['conversations'].append({ 48 | 'from': 'gpt', 49 | 'value': gpt_response 50 | }) 51 | 52 | # File path 53 | file_path = './sft_data.json' 54 | 55 | # Check if file exists and append data 56 | if os.path.exists(file_path): 57 | with open(file_path, 'r+', encoding='utf-8') as file: 58 | data = json.load(file) 59 | data.append(data_to_append) 60 | file.seek(0) 61 | json.dump(data, file, indent=4) 62 | else: 63 | with open(file_path, 'w', encoding='utf-8') as file: 64 | json.dump([data_to_append], file, indent=4) 65 | 66 | return redirect(url_for('index')) 67 | return redirect(url_for('index')) 68 | 69 | # Route for the DPO dataset builder 70 | @app.route('/dpo', methods=['GET', 'POST']) 71 | def dpo_form(): 72 | if request.method == 'POST': 73 | # Extract form data 74 | system_prompt = request.form.get('system') 75 | prompt = request.form.get('prompt') 76 | chosen = request.form.get('chosen') 77 | rejected = request.form.get('rejected') 78 | 79 | # Data to be appended 80 | data_to_append = { 81 | 'system': clean_entry(system_prompt), 82 | 'question': clean_entry(prompt), 83 | 'chosen': clean_entry(chosen), 84 | 'rejected': clean_entry(rejected), 85 | 'source': 'manual' 86 | } 87 | 88 | # File path 89 | file_path = './dpo_data.json' 90 | 91 | # Check if file exists and append data 92 | if os.path.exists(file_path): 93 | with open(file_path, 'r+', encoding='utf-8') as file: 94 | data = json.load(file) 95 | data.append(data_to_append) 96 | file.seek(0) 97 | json.dump(data, file, indent=4) 98 | else: 99 | with open(file_path, 'w', encoding='utf-8') as file: 100 | json.dump([data_to_append], file, indent=4) 101 | 102 | return "Success", 200 103 | return render_template('index.html', active_tab='dpo') 104 | 105 | if __name__ == '__main__': 106 | app.run(debug=True, port=7272) 107 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import gradio as gr 2 | from huggingface_hub import whoami 3 | import datetime 4 | from dataset_uploader import ParquetScheduler 5 | 6 | ########## 7 | # Setup # 8 | ########## 9 | 10 | contributor_username = whoami()["name"] 11 | 12 | # only show an info the first time uploading to the hub 13 | show_info = True 14 | 15 | every = 1 # we push once every 1 minute (use 5 if there are lots of people using the same HF token) 16 | 17 | choices = ["sharegpt","standard"] 18 | 19 | # schedulers 20 | schedulers = { 21 | "sft-sharegpt": ParquetScheduler(repo_id=f"{contributor_username}/sft-sharegpt", every=every), 22 | "sft-standard": ParquetScheduler(repo_id=f"{contributor_username}/sft-standard", every=every), 23 | "dpo-sharegpt": ParquetScheduler(repo_id=f"{contributor_username}/dpo-sharegpt", every=every), 24 | "dpo-standard": ParquetScheduler(repo_id=f"{contributor_username}/dpo-standard", every=every), 25 | } 26 | 27 | 28 | ########## 29 | # Utils # 30 | ########## 31 | 32 | 33 | def chat_message(role, content, prompt_type=None): 34 | """ 35 | A function that transforms the chat content into a chat message 36 | Args: 37 | role: A string, either "user" or "assistant" 38 | content: A string, the content of the message 39 | prompt_type: A string, either "standard" or "sharegpt" 40 | Returns: 41 | A dictionary, the message to be sent to the chatbot. 42 | """ 43 | if prompt_type == "sharegpt": 44 | if role == "user": 45 | role = "human" 46 | elif role == "assistant": 47 | role = "gpt" 48 | # sharegpt chat format 49 | return {"from": role, "value": content} 50 | else: 51 | return {"role": role, "content": content} 52 | 53 | 54 | def chat(prompt: str, history=[]): 55 | """ 56 | A function that generates a response to a given prompt. 57 | Args: 58 | prompt: A string, the prompt to be sent to the chatbot. 59 | history: A list of dictionaries, each dictionary being a message from the user or the assistant. 60 | Returns: 61 | A generator in the form of a single updated list of dictionaries, being a list of messages from the user and assistant 62 | """ 63 | if history == [] or (len(history) > 1 and history[-1]["role"] == "assistant"): 64 | history.append(chat_message("user", prompt)) 65 | else: 66 | history.append(chat_message("assistant", prompt)) 67 | return history 68 | 69 | 70 | def clear_textbox_field(): 71 | """ 72 | A function that clears the textbox field. 73 | """ 74 | return None 75 | 76 | 77 | def clear_both_fields(): 78 | """ 79 | A function that clears both the textbox and the chatbot. 80 | """ 81 | return None, None 82 | 83 | 84 | def clear_3_fields(): 85 | """ 86 | A function that clears both the textbox and the chatbot. 87 | """ 88 | return None, None, None 89 | 90 | 91 | def setup_submission(system_prompt="", history=[], chat_format="sharegpt"): 92 | # removes the extra metadata field from the chat history and format sharegpt accordingly 93 | for i in range(len(history)): 94 | sample = history[i] 95 | history[i] = chat_message( 96 | sample["role"], sample["content"], prompt_type=chat_format 97 | ) 98 | 99 | # add system prompt if provided 100 | system_prompt = system_prompt.strip() 101 | if system_prompt != "": 102 | sys = chat_message("system", system_prompt, prompt_type=chat_format) 103 | history.insert(0, sys) 104 | 105 | return history 106 | 107 | 108 | def save_sft_data(system_prompt="", history=[], sft_chat_format="sharegpt"): 109 | """ 110 | A function that pushes the data to the hub. 111 | """ 112 | 113 | # setup the info message to only show once 114 | global show_info 115 | scheduler = schedulers[f"sft-{sft_chat_format}"] 116 | 117 | # case user clicked submit and did not have any chat history 118 | if history == []: 119 | raise gr.Error("you need to setup a chat first") 120 | 121 | # case history ends with user prompt 122 | if history[-1]["role"] == "user": 123 | raise gr.Error("history needs to end with assistant prompt") 124 | 125 | history = setup_submission(system_prompt, history, sft_chat_format) 126 | # preparing the submission 127 | data = {"contributor": contributor_username} 128 | data["timestamp"] = str(datetime.datetime.now(datetime.UTC)) 129 | data["chat_format"] = sft_chat_format 130 | data["conversations"] = history 131 | 132 | # submitting the data 133 | scheduler.append(data) 134 | 135 | # show the info message only once 136 | if show_info: 137 | gr.Info("Data has been saved successfully (this message is only shown once)") 138 | gr.Info( 139 | "The scheduler may take up to 1 minute to push the data, please wait 🤗" 140 | ) 141 | show_info = False 142 | 143 | 144 | def save_dpo_data( 145 | system_prompt="", history=[], chosen="", rejected="", dpo_chat_format="sharegpt" 146 | ): 147 | """ 148 | A function that pushes the data to the hub. 149 | """ 150 | 151 | # setup the info message to only show once 152 | global show_info 153 | scheduler = schedulers[f"dpo-{dpo_chat_format}"] 154 | 155 | # case user clicked submit and did not have any chat history 156 | if history == []: 157 | raise gr.Error("you need to setup a chat first") 158 | 159 | # case history ends with user prompt 160 | if history[-1]["role"] == "assistant": 161 | raise gr.Error("history needs to end with user prompt") 162 | 163 | # case chosen and rejected are not full 164 | chosen, rejected = chosen.strip(), rejected.strip() 165 | if chosen == "" or rejected == "": 166 | raise gr.Error( 167 | "both chosen and rejected need to have a text when you click the submit button" 168 | ) 169 | 170 | history = setup_submission(system_prompt, history, dpo_chat_format) 171 | chosen_chat, rejected_chat = history.copy(), history.copy() 172 | chosen_chat.append(chat_message("user", chosen, dpo_chat_format)) 173 | rejected_chat.append(chat_message("user", rejected, dpo_chat_format)) 174 | 175 | # preparing the submission 176 | data = {"contributor": contributor_username} 177 | 178 | data["timestamp"] = str(datetime.datetime.now(datetime.UTC)) 179 | data["chat_format"] = dpo_chat_format 180 | data["prompt"] = history 181 | data["chosen"] = chosen_chat 182 | data["rejected"] = rejected_chat 183 | 184 | # submitting the data 185 | scheduler.append(data) 186 | 187 | # show the info message only once 188 | if show_info: 189 | gr.Info("Data has been saved successfully (this message is only shown once)") 190 | gr.Info( 191 | "The scheduler may take up to 1 minute to push the data, please wait 🤗" 192 | ) 193 | show_info = False 194 | 195 | 196 | def undo_chat(history): 197 | return history[:-2] 198 | 199 | 200 | ############## 201 | # Interface # 202 | ############## 203 | 204 | with gr.Blocks() as demo: 205 | gr.Markdown("

ShareGPT-Builder

") 206 | 207 | #### SFT #### 208 | with gr.Tab("SFT"): 209 | with gr.Accordion("system prompt", open=False): 210 | system_prompt = gr.TextArea(show_label=False, container=False) 211 | sft_chat_format = gr.Radio(choices=choices, value="sharegpt") 212 | 213 | chatbot = gr.Chatbot( 214 | type="messages", show_copy_button=True, show_copy_all_button=True 215 | ) 216 | textbox = gr.Textbox(show_label=False, submit_btn=True) 217 | textbox.submit( 218 | fn=chat, inputs=[textbox, chatbot], outputs=[chatbot] 219 | ).then( # empty field for convinience 220 | clear_textbox_field, outputs=[textbox] 221 | ) 222 | chatbot.undo(undo_chat, inputs=chatbot, outputs=chatbot) 223 | with gr.Row(): 224 | clear_button = gr.Button("Clear") 225 | clear_button.click(clear_both_fields, outputs=[textbox, chatbot]) 226 | submit = gr.Button("save chat", variant="primary") 227 | submit.click( 228 | save_sft_data, inputs=[system_prompt, chatbot, sft_chat_format] 229 | ).then(clear_both_fields, outputs=[textbox, chatbot]) 230 | 231 | #### DPO #### 232 | with gr.Tab("DPO"): 233 | with gr.Accordion("system prompt", open=False): 234 | dpo_system_prompt = gr.TextArea(show_label=False, container=False) 235 | dpo_chat_format = gr.Radio(choices=choices, value="sharegpt") 236 | dpo_chatbot = gr.Chatbot( 237 | type="messages", show_copy_button=True, show_copy_all_button=True 238 | ) 239 | gr.Markdown( 240 | "type in either of these fields and press enter, when you are ready for the final submission fill both fields, don't press enter and click on the save chat button" 241 | ) 242 | with gr.Row(): 243 | dpo_rejected_textbox = gr.Textbox(label="rejected (or add chat)", render=True) 244 | dpo_chosen_textbox = gr.Textbox(label="chosen (or add chat)") 245 | # submit using either of these fields 246 | dpo_chosen_textbox.submit( 247 | fn=chat, inputs=[dpo_chosen_textbox, dpo_chatbot], outputs=[dpo_chatbot] 248 | ).then( # empty field for convinience 249 | clear_textbox_field, outputs=[dpo_chosen_textbox] 250 | ) 251 | dpo_rejected_textbox.submit( 252 | fn=chat, 253 | inputs=[dpo_rejected_textbox, dpo_chatbot], 254 | outputs=[dpo_chatbot], 255 | ).then( # empty field for convinience 256 | clear_textbox_field, outputs=[dpo_rejected_textbox] 257 | ) 258 | dpo_chatbot.undo(undo_chat, inputs=dpo_chatbot, outputs=dpo_chatbot) 259 | with gr.Row(): 260 | dpo_clear_button = gr.Button("Clear") 261 | dpo_clear_button.click( 262 | clear_3_fields, 263 | outputs=[dpo_chosen_textbox, dpo_rejected_textbox, dpo_chatbot], 264 | ) 265 | dpo_submit = gr.Button("save chat", variant="primary") 266 | dpo_submit.click( 267 | save_dpo_data, 268 | inputs=[ 269 | dpo_system_prompt, 270 | dpo_chatbot, 271 | dpo_chosen_textbox, 272 | dpo_rejected_textbox, 273 | dpo_chat_format, 274 | ], 275 | ).then( 276 | clear_3_fields, 277 | outputs=[dpo_chosen_textbox, dpo_rejected_textbox, dpo_chatbot], 278 | ) 279 | with gr.Tab("Inspect datasets"): 280 | dataset = gr.Dropdown(choices=list(schedulers.keys())) 281 | @gr.render(inputs=dataset) 282 | def show_dataset(dataset) : 283 | gr.HTML(f""" """) 289 | 290 | if __name__ == "__main__": 291 | demo.launch(debug=True, show_error=True) 292 | -------------------------------------------------------------------------------- /dataset_uploader.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import tempfile 4 | import uuid 5 | from pathlib import Path 6 | from typing import Any, Dict, List, Optional, Union 7 | 8 | import pyarrow as pa 9 | import pyarrow.parquet as pq 10 | from huggingface_hub import CommitScheduler 11 | from huggingface_hub.hf_api import HfApi 12 | 13 | ################################### 14 | # Parquet scheduler # 15 | # Uploads data in parquet format # 16 | ################################### 17 | 18 | 19 | class ParquetScheduler(CommitScheduler): 20 | """ 21 | Usage: configure the scheduler with a repo id. Once started, you can add data to be uploaded to the Hub. 1 `.append` 22 | call will result in 1 row in your final dataset. 23 | 24 | ```py 25 | # Start scheduler 26 | >>> scheduler = ParquetScheduler(repo_id="my-parquet-dataset") 27 | 28 | # Append some data to be uploaded 29 | >>> scheduler.append({...}) 30 | >>> scheduler.append({...}) 31 | >>> scheduler.append({...}) 32 | ``` 33 | 34 | The scheduler will automatically infer the schema from the data it pushes. 35 | Optionally, you can manually set the schema yourself: 36 | 37 | ```py 38 | >>> scheduler = ParquetScheduler( 39 | ... repo_id="my-parquet-dataset", 40 | ... schema={ 41 | ... "prompt": {"_type": "Value", "dtype": "string"}, 42 | ... "negative_prompt": {"_type": "Value", "dtype": "string"}, 43 | ... "guidance_scale": {"_type": "Value", "dtype": "int64"}, 44 | ... "image": {"_type": "Image"}, 45 | ... }, 46 | ... ) 47 | 48 | See https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Value for the list of 49 | possible values. 50 | """ 51 | 52 | def __init__( 53 | self, 54 | *, 55 | repo_id: str, 56 | schema: Optional[Dict[str, Dict[str, str]]] = None, 57 | every: Union[int, float] = 5, 58 | path_in_repo: Optional[str] = "data", 59 | repo_type: Optional[str] = "dataset", 60 | revision: Optional[str] = None, 61 | private: bool = False, 62 | token: Optional[str] = None, 63 | allow_patterns: Union[List[str], str, None] = None, 64 | ignore_patterns: Union[List[str], str, None] = None, 65 | hf_api: Optional[HfApi] = None, 66 | ) -> None: 67 | super().__init__( 68 | repo_id=repo_id, 69 | folder_path="dummy", # not used by the scheduler 70 | every=every, 71 | path_in_repo=path_in_repo, 72 | repo_type=repo_type, 73 | revision=revision, 74 | private=private, 75 | token=token, 76 | allow_patterns=allow_patterns, 77 | ignore_patterns=ignore_patterns, 78 | hf_api=hf_api, 79 | ) 80 | 81 | self._rows: List[Dict[str, Any]] = [] 82 | self._schema = schema 83 | 84 | def append(self, row: Dict[str, Any]) -> None: 85 | """Add a new item to be uploaded.""" 86 | with self.lock: 87 | self._rows.append(row) 88 | 89 | def push_to_hub(self): 90 | # Check for new rows to push 91 | with self.lock: 92 | rows = self._rows 93 | self._rows = [] 94 | if not rows: 95 | return 96 | print(f"Got {len(rows)} item(s) to commit.") 97 | 98 | # Load images + create 'features' config for datasets library 99 | schema: Dict[str, Dict] = self._schema or {} 100 | path_to_cleanup: List[Path] = [] 101 | for row in rows: 102 | for key, value in row.items(): 103 | # Infer schema (for `datasets` library) 104 | if key not in schema: 105 | schema[key] = _infer_schema(key, value) 106 | 107 | # Load binary files if necessary 108 | if schema[key]["_type"] in ("Image", "Audio"): 109 | # It's an image or audio: we load the bytes and remember to cleanup the file 110 | file_path = Path(value) 111 | if file_path.is_file(): 112 | row[key] = { 113 | "path": file_path.name, 114 | "bytes": file_path.read_bytes(), 115 | } 116 | path_to_cleanup.append(file_path) 117 | 118 | # Complete rows if needed 119 | for row in rows: 120 | for feature in schema: 121 | if feature not in row: 122 | row[feature] = None 123 | 124 | # Export items to Arrow format 125 | table = pa.Table.from_pylist(rows) 126 | 127 | # Add metadata (used by datasets library) 128 | table = table.replace_schema_metadata( 129 | {"huggingface": json.dumps({"info": {"features": schema}})} 130 | ) 131 | 132 | # Write to parquet file 133 | archive_file = tempfile.NamedTemporaryFile(delete=False) 134 | pq.write_table(table, archive_file.name) 135 | archive_file.close() 136 | 137 | # Upload 138 | self.api.upload_file( 139 | repo_id=self.repo_id, 140 | repo_type=self.repo_type, 141 | revision=self.revision, 142 | path_in_repo=f"{uuid.uuid4()}.parquet", 143 | path_or_fileobj=archive_file.name, 144 | ) 145 | print("Commit completed.") 146 | 147 | # Cleanup 148 | os.unlink(archive_file.name) 149 | for path in path_to_cleanup: 150 | path.unlink(missing_ok=True) 151 | 152 | 153 | def _infer_schema(key: str, value: Any) -> Dict[str, str]: 154 | """ 155 | Infer schema for the `datasets` library. 156 | 157 | See https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Value. 158 | """ 159 | # In short any column_name in the dataset has any of these keywords 160 | # the column will be inferred into the correct column type accordingly 161 | if "image" in key: 162 | return {"_type": "Image"} 163 | if "audio" in key: 164 | return {"_type": "Audio"} 165 | if isinstance(value, int): 166 | return {"_type": "Value", "dtype": "int64"} 167 | if isinstance(value, float): 168 | return {"_type": "Value", "dtype": "float64"} 169 | if isinstance(value, bool): 170 | return {"_type": "Value", "dtype": "bool"} 171 | if isinstance(value, bytes): 172 | return {"_type": "Value", "dtype": "binary"} 173 | # Otherwise in last resort => convert it to a string 174 | return {"_type": "Value", "dtype": "string"} 175 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.26.4 2 | huggingface_hub==0.26.2 3 | pandas==2.2.3 4 | pyarrow==18.1.0 5 | gradio==5.3.0 6 | -------------------------------------------------------------------------------- /templates/dpo_form.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | DPO Builder 6 | 7 | 8 | 9 |

10 | 11 | 34 |

35 | 43 | 50 | 51 | -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | ShareGPT Builder 6 | 133 | 134 | 135 |

ShareGPT Builder

136 | 137 |

138 |

139 | 140 | 141 |

142 | 143 |

144 |

145 | blabla 146 |

147 |

148 | 149 | 150 |

151 | {% include 'sft_form.html' %} 152 |

153 | 154 |

155 | {% include 'dpo_form.html' %} 156 |

157 |

158 | 159 | 219 | 220 | -------------------------------------------------------------------------------- /templates/main.js: -------------------------------------------------------------------------------- 1 | function submitForm(e, formId, tabName) { 2 | e.preventDefault(); 3 | var form = document.getElementById(formId); 4 | fetch(form.action, { 5 | method: form.method, 6 | body: new FormData(form) 7 | }) 8 | .then(response => { 9 | if (response.ok) { 10 | // On success, open the specified tab, update the URL, and refresh the page 11 | openTab(tabName); 12 | history.pushState({}, "", "/"); 13 | location.reload(); // Add this line to refresh the page 14 | } else { 15 | alert("Form submission failed."); 16 | } 17 | }) 18 | .catch(error => console.error('Error:', error)); 19 | } -------------------------------------------------------------------------------- /templates/sft_form.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | ShareGPT Builder 6 | 7 | 8 |

9 | 31 |

32 | 59 | 60 | --------------------------------------------------------------------------------