├── LICENSE ├── README.md ├── UltraLM ├── chat_cli.sh ├── inference_cli.py ├── recover.sh ├── util │ └── inference.py └── weight_diff.py ├── data ├── split_long.py ├── tmp.py ├── ultra_eval.json └── vllm_chatloop ├── figures ├── alpaca.png ├── compare_ultra.jpg ├── figure.png ├── meta_topic.png ├── ultra-process.png ├── ultra_logo.png └── wizard_test.jpg ├── paper └── UltraFuser-paper.pdf └── train ├── requirements.txt ├── train_bm.py ├── train_legacy ├── requirements.txt ├── template │ └── template.txt ├── train.py └── ultrachat_dataset.py └── ultrachat_dataset.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 THUNLP 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
6 | 7 | 8 | 9 | **Large-scale, Informative, and Diverse Multi-round Dialogue Data, and Models** 10 | 11 |

12 | UltraLM • 13 | Data Explorer • 14 | Nomic AI Atlas Explorer • 15 | Data Release • 16 | Construction Process • 17 | Paper 18 |

19 | 20 |
21 | 22 |
23 | 24 | ![Dialogues](https://img.shields.io/badge/Current\_Dialogues-1.57M-red?style=flat-square) 25 | ![Dialogues](https://img.shields.io/badge/UltraLM\_13B-Released-blue?style=flat-square) 26 | ![Dialogues](https://img.shields.io/badge/UltraChat-Released-green?style=flat-square) 27 | 28 |
29 | 30 | 31 | ## News 32 | - **❗️❗️ Septempber 26, 2023: Releasing a large-scale, fine-grained, diverse preference dataset [UltraFeedback](https://github.com/thunlp/UltraFeedback), a new language model [UltraLM-13B-v2.0](https://huggingface.co/openbmb/UltraLM-13b-v2.0), a reward model [UltraRM](https://huggingface.co/openbmb/UltraRM-13b) and a critic model [UltraCM](https://huggingface.co/openbmb/UltraCM-13b).** 33 | - **❗️❗️With UltraRM, a simple best-of-16 sampling achieves 92.30% (UltraLM2, 🥇 in 13B results) and 91.54% (UltraLM, 🥇 in LLaMA-1 results) win rates against text-davinci-003 on [AlpacaEval benchmark](https://tatsu-lab.github.io/alpaca_eval/).** 34 | - August 7, 2023: Releasing the first version of UltraLM-65B. 35 | - June 28, 2023: UltraLM-13B ranks #1 among open-source models on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/)! 36 | - June 27, 2023: We release the first version of UltraLM-13B, a powerful chat language model trained on UltraChat! 37 | - April 20, 2023: Released all data, more processing and additional data are expected. 38 | - April 17, 2023: The rest of the Writing and Creation sector is released (457k). 39 | - April 12, 2023: The first part of the Writing and Creation sector is released. 40 | - April 9, 2023: Supported by [gpt4all](https://github.com/nomic-ai/gpt4all), we now also have an [Atlas explorer](https://atlas.nomic.ai/map/0ce65783-c3a9-40b5-895d-384933f50081/a7b46301-022f-45d8-bbf4-98107eabdbac) 41 | - April 8, 2023: We release a training script by taking GPT-J as an example. 42 | - April 7, 2023: The second part of Questions about the World is released. It contains 290k generated multi-round dialogues. 43 | - March 31, 2023: The first part of Questions about the World is released. It contains 280k generated multi-round dialogues. 44 | 45 | ## UltraLM 46 | 47 | UltraLM is a series of chat language models trained on UltraChat. Currently, we have released the 13B version, which **ranks #1** among open-source models and **ranks #4** among all models on [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) (June 28, 2023). 48 | UltraLM-13B is based upon LLaMA-13B and supported by [BMTrain](https://github.com/OpenBMB/BMTrain) in the training process. 49 | 50 | #### Download 51 | 52 | | Model | Link | Version | 53 | | ------------- | ------------- |------------- | 54 | | UltraLM-13B | [Huggingface Repo](https://huggingface.co/openbmb/UltraLM-13b) |v1.0| 55 | | UltraLM-65B | [Huggingface Repo](https://huggingface.co/openbmb/UltraLM-65b) |v1.0| 56 | | UltraLM-13B | [Huggingface Repo](https://huggingface.co/openbmb/UltraLM-13b-v2.0) |v2.0| 57 | | UltraRM-13B | [Huggingface Repo](https://huggingface.co/openbmb/UltraRM-13b) |v1.0| 58 | | UltraCM-13B | [Huggingface Repo](https://huggingface.co/openbmb/UltraCM-13b) |v1.0| 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | #### Use UltraLM 67 | 68 | - To reconstruct UltraLM, please download LLaMA-13B and our Delta weights. Please use this [link](https://huggingface.co/openbmb/UltraLM-13b) if you would like to test UltraLM, we haven't evaluated other weights provided by third parties. 69 | - Run the script in `/UltraLM/recover.sh` to obtain the final weights of the recovered model. 70 | - After obtaining the recovered model, replace the model path in `/UltraLM/chat_cli.sh` with your path and run to start chatting! 71 | 72 | *Note: Different hyper-parameters or system prompts will affect the outputs. You can refer to details in `/UltraLM/inference_cli.py` for our default setting.* 73 | 74 | 75 |
Performance 76 |

77 | 78 | We report three evaluations in this section: Alpaca-Eval from Stanford, Evol-instruct from Microsoft's WizardLM, and our curated evaluation set. 79 | Evaluations of modern LLMs may be biased and affected by many factors, we are also actively working on more comprehensive evaluation methods. 80 | 81 | #### Alpaca-Eval 82 | 83 | [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) is a leaderboard specifically designed for evaluating LLMs. The leaderboard is created based on the win-rate against Text-Davince-003 automatically evaluated by GPT-4. 84 | 85 | 86 | 87 | #### Evol-instruct 88 | 89 | This dataset is constructed with an evolutionary strategy by rewriting the instructions through multiple rounds to obtain instructions at different complexity levels. The benchmark is developed by the [WizardLM project](https://github.com/nlpxucan/WizardLM), another excellent chat language model! 90 | 91 | 92 | ![Results](figures/wizard_test.jpg) 93 | 94 | #### Our Evaluation Set 95 | 96 | We curate an evaluation set, encompassing the [Vicuna Benchmark](https://lmsys.org/blog/2023-03-30-vicuna/) and additional 300 questions and instructions generated by GPT-4. The questions/instructions cover a wide range of topics, including commonsense, world knowledge, professional knowledge (specifically physics and biology) , mathematics, and writing tasks on different levels of difficulty. We use GPT-4 for evaluation. [Here](data/ultra_eval.json) is the dataset. 97 | 98 | 99 | ![Results](figures/compare_ultra.jpg) 100 | 101 | 102 |

103 |
104 | 105 |
Examples of UltraLM 106 |

107 | 108 | - Example A: Question of Common Knowledge 109 | 110 | > **User:** Which animal is known as man's best friend? 111 | 112 | > **UltraLM:** There are many animals that humans enjoy as companion animals, but one of the absolute standout answers to the question "which animal is known as a man's best friend?" is the domestically-bred dog. Domesticated dogs have been by humans' sides for tens of thousands of years, providing faithful companionship and help; guarding people and property; and lending emotional support in everything from long, lonely days to military deployments. Dogs come in an almost mind-boggling variety of breeds of various sizes, with different energy levels, personalities, and temperaments, meaning there's almost certainly a pup out there who will be a perfect fit for whoever needs it most. 113 | 114 | 115 | - Example B: Commonsense Reasoning 116 | 117 | > **User:** Which weighs more: a pound of feathers or a pound of bricks? 118 | 119 | > **UltraLM:** A pound of feathers and a pound of bricks both weigh exactly one pound. This is because one pound is a unit of weight, just as a foot is a unit of length or a second is a unit of time. We use the pound (or other unit of weight) to describe the weight of different objects, regardless of what that object is made of. So the answer to your question is that a pound of feathers and a pound of bricks both weigh the same amount. 120 | 121 | 122 | - Example C: Reasoning 123 | 124 | > **User:** If a farmer has 17 sheep and all but 9 die, how many are left? 125 | 126 | > **UltraLM:** The total number of sheep that are left is 9. 127 | 128 | 129 | 130 | - Example D: Reasoning 131 | 132 | > **User:** If a red house is made from red bricks, and a blue house is made from blue bricks, what is a greenhouse made from? 133 | 134 | > **UltraLM:** A greenhouse is not made from green bricks, but rather from materials such as glass or plastic that allow light to pass through and create a controlled environment for plants to grow. Greenhouses can come in a variety of colors, including green, but the color of the building itself does not affect the plants growing inside. 135 | 136 | 137 | - Example E: Creation 138 | 139 | > **User:** Write a poem about electromagnetism. 140 | 141 | > **UltraLM:** Electromagnetism is a force we can't ignore, 142 | It's the reason for many things we adore. In a magnetic field, a wire will glow, 143 | And the electrons will align, both old and new. 144 | Electricity and magnetism are hand in hand, 145 | One creates a field, the other has a brand. 146 | The magnetic field can push and pull, 147 | On a wire with current, it will react in full. 148 | The attraction and repulsion it brings, 149 | Are the reason for many things. 150 | From motors to generators, and much more in between, 151 | Electromagnetism is at the root of what it means. 152 | It's a force that's all around us, 153 | From the compass to the light switch. 154 | With its power, it can do much, 155 | From a simple needle to a rocket's thrust. 156 | It's a force we can't deny, 157 | For it's a part of our lives. 158 | Electromagnetism is something to admire, 159 | For its strength and its might, it has no equal, it's sure. 160 | 161 | 162 | 163 | - Example F: Creation 164 | 165 | > **User:** Write a short story about the first second of the universe. 166 | 167 | > **UltraLM:** In the first second of the universe, darkness reigned supreme. There was no light, no sound, no shape, no substance. Nothing yet was, but there was an intense power, a raw energy waiting to be unleashed. Within infinity, the first second held the potential for creation in its grasp. And then, with a mighty bang, a bright point of light emerged into existence. In it, the most fundamental particles were born, their forms yet unknown as they frenetically collided in a chaotic dance of creation. Within this primordial light, every force of nature and every component of life was stirred from pure energy to physical reality. The first atoms quickly assembled, and from these, stars and galaxies began to form. Their fusion creating a hot environment that pushed back the very void, shaping the cosmos through expanding and cooling waves. In this cosmic first, no structure of experience, not even time or space, could establish its rule. For an instant, the entirety of the universe existed in an embryonic nexus, a cosmic singularity from which all future existence originated. It was a microcosm forging the macrocosm of which it ultimately becomes a part, its moment of birth a metaphor for the cycles of creation that permeate the multiverse(s).From that single point of light, an infinite universe unfolded. For centuries, eons ahead, and forevermore, it remained the most foundational blink of its unceasing journey into the depths of the Great Beyond. The first second is the beginning of everything. 168 |

169 |
170 | 171 | 172 | 173 | 174 | 175 | ## Overview of UltraChat 176 | This project aims to construct *open-source, large-scale, and multi-round* dialogue data powered by Turbo APIs to facilitate the construction of powerful language models with general conversational capability. 177 | In consideration of factors such as safeguarding privacy, **we do not directly use any data available on the Internet as prompts**. 178 | 179 |
UltraChat is composed of three sectors 180 |

181 | 182 | - 🌏 **Questions about the World**: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship. 183 | - ✍🏻 **Writing and Creation**: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond. 184 | - 📋 **Assistance on Existent Materials**: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics. 185 | 186 | 187 |

188 |
189 | 190 | *Disclaimer: Although the process of building UltraChat does NOT involve any publicly available benchmark data, scaling to a certain extent may still result in some overlap in some evaluation benchmarks. We would like to emphasize again that **all the data is automatically generated (including the instructions and responses)**, and we do not insert any open benchmark data. 191 | For example, UltraChat was released (April, 2023) **earlier** than Alpaca Eval (May, 2023). 192 | We encourage users to closely monitor such phenomena, while we are also actively considering how to evaluate LLMs more properly.* 193 | 194 |
An Example of UltraChat 195 |

196 |

197 | 198 |
199 |

200 |
201 | 202 | 203 | ## Data 204 | 205 | The dataset is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. And it is distributed under the MIT license. 206 | 207 | 208 | ### Data Release 209 | [Explore](http://39.101.77.220/) the data before downloading, or use [Atlas explorer](https://atlas.nomic.ai/map/0ce65783-c3a9-40b5-895d-384933f50081/a7b46301-022f-45d8-bbf4-98107eabdbac). 210 | 211 | - 🤗 [Huggingface Datasets Host](https://huggingface.co/datasets/stingning/ultrachat) 212 | 213 | Direct Download links: 214 | - [Questions about the World [Part I + Part II]](https://cloud.tsinghua.edu.cn/f/0a27393192ad46a5a081/?dl=1) 215 | - [Writing and Creation [Part I]](https://cloud.tsinghua.edu.cn/f/57258a87846243218a9b/?dl=1) 216 | - [Writing and Creation [Part II]](https://cloud.tsinghua.edu.cn/f/099b4dd71b82448fb7fb/?dl=1) 217 | - [Assistance on Existent Materials [Part I]](https://cloud.tsinghua.edu.cn/f/1f7abdf2d2564cb4b338/?dl=1) 218 | 219 | ### Data Format 220 | Each line in the downloaded data file is a json dict containing the data id and dialogue data in a list format. Below is an example line. 221 | 222 | ```JSON 223 | { 224 | "id": "0", 225 | "data": [ 226 | "How can cross training benefit groups like runners, swimmers, or weightlifters?", 227 | "Cross training can benefit groups like runners, swimmers, or weightlifters in the following ways: ...", 228 | "That makes sense. I've been wanting to improve my running time, but I never thought about incorporating strength training. Do you have any recommendations for specific exercises?", 229 | "Sure, here are some strength training exercises that can benefit runners: ...", 230 | "Hmm, I'm not really a fan of weightlifting though. Can I incorporate other forms of exercise into my routine to improve my running time?", 231 | "Yes, absolutely! ...", 232 | "..." 233 | ] 234 | } 235 | 236 | ``` 237 | 238 | ## Training 239 | 240 | 241 | We provide training code to fine-tune [LLaMa](https://github.com/facebookresearch/llama) (however we are not distributing the weights of LLaMa) on UltraChat in [`.src/`](src), the training is accelerated by [BMTrain](https://github.com/OpenBMB/BMTrain). 242 | 243 | - Download the released data and put it under `./data` 244 | 245 | - Run `train_bm.py`, for example: 246 | 247 | ```bash 248 | WANDB_MODE="offline" torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:50003 train_bm.py --tensorboard ./ultrachat_llama_tb_2 --save_step 5000 --logging_step 100 249 | ``` 250 | 251 | We also provide a training script to fine-tune GPT-J on UltraChat in [`.src/train_legacy/`](src), which is implemented with [OpenPrompt](https://github.com/thunlp/OpenPrompt) 252 | 253 | - Download the released data and put it under `./data` 254 | - Run `accelerate launch train.py` to start training 255 | 256 | ## Construction of UltraChat 257 | 258 | The general idea of UltraChat is to use separate LLMs to generate opening lines, simulate users and respond to queries. 259 | Each sector of UltraChat has its own challenges and requires particular strategy designs. 260 | We will specify the construction process once a sector of UltraChat is released. 261 | 262 |
263 | 264 |
265 | 266 |
Questions about the World 267 |

268 | 269 | #### Meta Topics & Sub-Topics 270 | 271 | - The data is derived from 30 representative and diverse meta topics (icons are from [flaticon](https://www.flaticon.com/)) 272 | 273 |

274 | 275 |
276 | 277 | - Based on the above meta topics, we generate 1100+ subtopics for data construction 278 | - For each subtopic, we generate up to 10 specific questions. 279 | - Then we use Turbo APIs to generate new relevant questions for each of the 10 questions. We use hand-crafted prompts to instruct the model to generate a diverse set of questions covering a wide range of common concepts and objects. 280 | - For each question, we generate a 3~7-round dialogue using the two models iteratively as described above. 281 | 282 |

283 | 284 |

285 | 286 | #### Common Real-world Entities 287 | 288 | - We gather top-frequent 10000 named entities from Wikidata. 289 | - We generate 5 meta questions for each entity using ChatGPT API. 290 | - For each meta question, we generate 10 more specific questions and 20 related but general questions. 291 | - We sample 200k specific questions and 250k general questions along with the 50k meta-questions, and we generate a 3~7-round dialogue for each. 292 | 293 |

294 | 295 |
296 | 297 |
Writing and Creation 298 |

299 | 300 | - We first collect 20 types of writing, as shown below (icons are from [flaticon](https://www.flaticon.com/)). 301 | 302 |

303 | 304 |
305 | 306 | - For each type of writing, generate 200 different instructions that ask an AI assistant to generate text material, and 80% of the instructions are further expanded and detailed. 307 | - Use the generated instructions as initial input and generate a 2~4-round dialogue each. 308 | 309 |

310 |
311 | 312 |
Assistance on Existent Materials 313 |

314 | 315 | - We extract ~10w diverse materials from C4 dataset. 316 | - We generate up to 5 questions/instructions for each piece of material. 317 | - We combine the material with each question/instruction with a set of manually designed template as the initial input of a user to start a dialogue with AI assistant. 318 | - For each input, we generate a 2~4-round dialogue. 319 | 320 |

321 |
322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | ## To Do 330 | - [x] Release the rest part of the data for Questions about the World. 331 | - [x] Continue to release the data of Writing and Creation. 332 | - [x] Continue to release the data of Assistance on Existent Materials in the future. 333 | - [x] Train a model on UltraChat and conduct in-detail analysis. Welcome to use it to train your chat model! 334 | - [ ] There will be a Chinese version of UltraChat. 335 | 336 | 337 | ## Limitations 338 | - Although UltraChat could perform promisingly on several benchmarks, it still has limitations such as hallucinations. In our experience, UltraLM could handle questions in terms of world knowledge and creation. The reasoning, math, and coding abilities still need to be explicitly enhanced. 339 | 340 | 341 | ## Citation 342 | Feel free to cite the repo if you think UltraChat is useful. 343 | 344 | ```bibtex 345 | @article{ding2023enhancing, 346 | title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, 347 | author={Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen}, 348 | journal={arXiv preprint arXiv:2305.14233}, 349 | year={2023} 350 | } 351 | ``` 352 | -------------------------------------------------------------------------------- /UltraLM/chat_cli.sh: -------------------------------------------------------------------------------- 1 | python -u inference_cli.py --model_path /path/to/ultralm/recovered -------------------------------------------------------------------------------- /UltraLM/inference_cli.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from transformers import LlamaForCausalLM, LlamaTokenizer, GenerationConfig 3 | from transformers.optimization import get_linear_schedule_with_warmup 4 | from tqdm import tqdm 5 | import os 6 | import argparse 7 | from util.inference import generate_stream, SimpleChatIO 8 | 9 | def get_model_tokenizer(model_name_or_path): 10 | model = LlamaForCausalLM.from_pretrained(model_name_or_path, device_map="auto", torch_dtype=torch.float16) 11 | tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path) 12 | return model, tokenizer 13 | 14 | 15 | def chat_loop( 16 | model, 17 | tokenizer, 18 | system_prompt: str, 19 | temperature: float = 0.7, 20 | max_new_tokens: int = 2000, 21 | chatio = SimpleChatIO(), 22 | device = "cuda", 23 | debug: bool = False 24 | ): 25 | conv = [system_prompt] 26 | 27 | while True: 28 | try: 29 | inp = chatio.prompt_for_input("User") 30 | except EOFError: 31 | inp = "" 32 | if not inp: 33 | print("exit...") 34 | break 35 | 36 | conv.append("User: " + inp.strip() + tokenizer.eos_token) 37 | 38 | 39 | prompt = "\n".join(conv) + "\nAssistant: " 40 | 41 | gen_params = { 42 | "prompt": prompt, 43 | "temperature": temperature, 44 | "max_new_tokens": max_new_tokens, 45 | "echo": False, 46 | } 47 | 48 | chatio.prompt_for_output("Assistant") 49 | with torch.inference_mode(): 50 | output_stream = generate_stream(model, tokenizer, gen_params, device) 51 | outputs = chatio.stream_output(output_stream) 52 | # NOTE: strip is important to align with the training data. 53 | conv.append("Assistant: " + outputs.strip() + tokenizer.eos_token) 54 | 55 | if debug: 56 | print("\n", {"prompt": prompt, "outputs": outputs}, "\n") 57 | 58 | 59 | if __name__ == "__main__": 60 | parser = argparse.ArgumentParser() 61 | parser.add_argument("--model_path", type=str, default="/path/to/ultralm") 62 | args = parser.parse_args() 63 | 64 | model, tokenizer = get_model_tokenizer(args.model_path) 65 | 66 | system_prompt = "User: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, very detailed, and polite answers to the user's questions." 67 | 68 | chat_loop(model, tokenizer, system_prompt) 69 | -------------------------------------------------------------------------------- /UltraLM/recover.sh: -------------------------------------------------------------------------------- 1 | python weight_diff.py recover --path_raw /llama-13b --path_tuned /recover --path_diff /UltraLM-13b --device cpu -------------------------------------------------------------------------------- /UltraLM/util/inference.py: -------------------------------------------------------------------------------- 1 | """Inference for FastChat models.""" 2 | import abc 3 | import gc 4 | import math 5 | from typing import Iterable, Optional 6 | import sys 7 | import warnings 8 | 9 | import psutil 10 | import torch 11 | from transformers import ( 12 | AutoTokenizer, 13 | AutoModelForCausalLM, 14 | LlamaTokenizer, 15 | LlamaForCausalLM, 16 | AutoModel, 17 | AutoModelForSeq2SeqLM, 18 | T5Tokenizer, 19 | AutoConfig, 20 | ) 21 | from transformers.generation.logits_process import ( 22 | LogitsProcessorList, 23 | RepetitionPenaltyLogitsProcessor, 24 | TemperatureLogitsWarper, 25 | TopKLogitsWarper, 26 | TopPLogitsWarper, 27 | ) 28 | 29 | 30 | 31 | def prepare_logits_processor( 32 | temperature: float, repetition_penalty: float, top_p: float, top_k: int 33 | ) -> LogitsProcessorList: 34 | processor_list = LogitsProcessorList() 35 | # TemperatureLogitsWarper doesn't accept 0.0, 1.0 makes it a no-op so we skip two cases. 36 | if temperature >= 1e-5 and temperature != 1.0: 37 | processor_list.append(TemperatureLogitsWarper(temperature)) 38 | if repetition_penalty > 1.0: 39 | processor_list.append(RepetitionPenaltyLogitsProcessor(repetition_penalty)) 40 | if 1e-8 <= top_p < 1.0: 41 | processor_list.append(TopPLogitsWarper(top_p)) 42 | if top_k > 0: 43 | processor_list.append(TopKLogitsWarper(top_k)) 44 | return processor_list 45 | 46 | 47 | # @torch.inference_mode() 48 | def generate_stream( 49 | model, tokenizer, params, device, context_len=2048, stream_interval=2 50 | ): 51 | prompt = params["prompt"] 52 | len_prompt = len(prompt) 53 | temperature = float(params.get("temperature", 1.0)) 54 | repetition_penalty = float(params.get("repetition_penalty", 1.0)) 55 | top_p = float(params.get("top_p", 1.0)) 56 | top_k = int(params.get("top_k", -1)) # -1 means disable 57 | max_new_tokens = int(params.get("max_new_tokens", 256)) 58 | stop_str = params.get("stop", None) 59 | echo = bool(params.get("echo", True)) 60 | stop_token_ids = params.get("stop_token_ids", None) or [] 61 | stop_token_ids.append(tokenizer.eos_token_id) 62 | 63 | logits_processor = prepare_logits_processor( 64 | temperature, repetition_penalty, top_p, top_k 65 | ) 66 | 67 | input_ids = tokenizer(prompt).input_ids 68 | input_echo_len = len(input_ids) 69 | output_ids = list(input_ids) 70 | 71 | if model.config.is_encoder_decoder: 72 | max_src_len = context_len 73 | else: 74 | max_src_len = context_len - max_new_tokens - 8 75 | 76 | input_ids = input_ids[-max_src_len:] 77 | 78 | if model.config.is_encoder_decoder: 79 | encoder_output = model.encoder( 80 | input_ids=torch.as_tensor([input_ids], device=device) 81 | )[0] 82 | start_ids = torch.as_tensor( 83 | [[model.generation_config.decoder_start_token_id]], 84 | dtype=torch.int64, 85 | device=device, 86 | ) 87 | 88 | past_key_values = out = None 89 | for i in range(max_new_tokens): 90 | if i == 0: 91 | if model.config.is_encoder_decoder: 92 | out = model.decoder( 93 | input_ids=start_ids, 94 | encoder_hidden_states=encoder_output, 95 | use_cache=True, 96 | ) 97 | logits = model.lm_head(out[0]) 98 | else: 99 | out = model(torch.as_tensor([input_ids], device=device), use_cache=True) 100 | logits = out.logits 101 | past_key_values = out.past_key_values 102 | else: 103 | if model.config.is_encoder_decoder: 104 | out = model.decoder( 105 | input_ids=torch.as_tensor([[token]], device=device), 106 | encoder_hidden_states=encoder_output, 107 | use_cache=True, 108 | past_key_values=past_key_values, 109 | ) 110 | 111 | logits = model.lm_head(out[0]) 112 | else: 113 | out = model( 114 | input_ids=torch.as_tensor([[token]], device=device), 115 | use_cache=True, 116 | past_key_values=past_key_values, 117 | ) 118 | logits = out.logits 119 | past_key_values = out.past_key_values 120 | 121 | if logits_processor: 122 | if repetition_penalty > 1.0: 123 | tmp_output_ids = torch.as_tensor([output_ids], device=logits.device) 124 | else: 125 | tmp_output_ids = None 126 | last_token_logits = logits_processor(tmp_output_ids, logits[:, -1, :])[0] 127 | else: 128 | last_token_logits = logits[0, -1, :] 129 | 130 | if device == "mps": 131 | # Switch to CPU by avoiding some bugs in mps backend. 132 | last_token_logits = last_token_logits.float().to("cpu") 133 | 134 | if temperature < 1e-5 or top_p < 1e-8: # greedy 135 | token = int(torch.argmax(last_token_logits)) 136 | else: 137 | probs = torch.softmax(last_token_logits, dim=-1) 138 | token = int(torch.multinomial(probs, num_samples=1)) 139 | 140 | output_ids.append(token) 141 | 142 | if token in stop_token_ids: 143 | stopped = True 144 | else: 145 | stopped = False 146 | 147 | if i % stream_interval == 0 or i == max_new_tokens - 1 or stopped: 148 | if echo: 149 | tmp_output_ids = output_ids 150 | rfind_start = len_prompt 151 | else: 152 | tmp_output_ids = output_ids[input_echo_len:] 153 | rfind_start = 0 154 | 155 | output = tokenizer.decode( 156 | tmp_output_ids, 157 | skip_special_tokens=True, 158 | spaces_between_special_tokens=False, 159 | ) 160 | if stop_str: 161 | if isinstance(stop_str, str): 162 | pos = output.rfind(stop_str, rfind_start) 163 | if pos != -1: 164 | output = output[:pos] 165 | stopped = True 166 | elif isinstance(stop_str, Iterable): 167 | for each_stop in stop_str: 168 | pos = output.rfind(each_stop, rfind_start) 169 | if pos != -1: 170 | output = output[:pos] 171 | stopped = True 172 | break 173 | else: 174 | raise ValueError("Invalid stop field type.") 175 | 176 | yield { 177 | "text": output, 178 | "usage": { 179 | "prompt_tokens": input_echo_len, 180 | "completion_tokens": i, 181 | "total_tokens": input_echo_len + i, 182 | }, 183 | "finish_reason": None, 184 | } 185 | 186 | if stopped: 187 | break 188 | 189 | # finish stream event, which contains finish reason 190 | if i == max_new_tokens - 1: 191 | finish_reason = "length" 192 | elif stopped: 193 | finish_reason = "stop" 194 | else: 195 | finish_reason = None 196 | 197 | yield { 198 | "text": output, 199 | "usage": { 200 | "prompt_tokens": input_echo_len, 201 | "completion_tokens": i, 202 | "total_tokens": input_echo_len + i, 203 | }, 204 | "finish_reason": finish_reason, 205 | } 206 | 207 | # clean 208 | del past_key_values, out 209 | gc.collect() 210 | torch.cuda.empty_cache() 211 | 212 | 213 | class ChatIO(abc.ABC): 214 | @abc.abstractmethod 215 | def prompt_for_input(self, role: str) -> str: 216 | """Prompt for input from a role.""" 217 | 218 | @abc.abstractmethod 219 | def prompt_for_output(self, role: str): 220 | """Prompt for output from a role.""" 221 | 222 | @abc.abstractmethod 223 | def stream_output(self, output_stream): 224 | """Stream output.""" 225 | 226 | class SimpleChatIO(ChatIO): 227 | def prompt_for_input(self, role) -> str: 228 | return input(f"{role}: ") 229 | 230 | def prompt_for_output(self, role: str): 231 | print(f"{role}: ", end="", flush=True) 232 | 233 | def stream_output(self, output_stream): 234 | pre = 0 235 | for outputs in output_stream: 236 | output_text = outputs["text"] 237 | output_text = output_text.strip().split(" ") 238 | now = len(output_text) - 1 239 | if now > pre: 240 | print(" ".join(output_text[pre:now]), end=" ", flush=True) 241 | pre = now 242 | print(" ".join(output_text[pre:]), flush=True) 243 | return " ".join(output_text) -------------------------------------------------------------------------------- /UltraLM/weight_diff.py: -------------------------------------------------------------------------------- 1 | # Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from typing import Optional, Dict 16 | 17 | import fire 18 | import torch 19 | import tqdm 20 | import transformers 21 | # from train import smart_tokenizer_and_embedding_resize 22 | 23 | def smart_tokenizer_and_embedding_resize( 24 | special_tokens_dict: Dict, 25 | tokenizer: transformers.PreTrainedTokenizer, 26 | model: transformers.PreTrainedModel, 27 | ): 28 | """Resize tokenizer and embedding. 29 | 30 | Note: This is the unoptimized version that may make your embedding size not be divisible by 64. 31 | """ 32 | num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict) 33 | model.resize_token_embeddings(len(tokenizer)) 34 | 35 | if num_new_tokens > 0: 36 | input_embeddings = model.get_input_embeddings().weight.data 37 | output_embeddings = model.get_output_embeddings().weight.data 38 | 39 | input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True) 40 | output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True) 41 | 42 | input_embeddings[-num_new_tokens:] = input_embeddings_avg 43 | output_embeddings[-num_new_tokens:] = output_embeddings_avg 44 | 45 | 46 | @torch.inference_mode() 47 | def make_diff( 48 | path_raw: str, path_tuned: str, path_diff: str, device="cuda", # "cuda" or "cpu" 49 | ): 50 | """Make the weight diff. 51 | 52 | This function is given to present full transparency of how the weight diff was created. 53 | 54 | Run: 55 | python weight_diff.py make_diff --path_raw --path_tuned --path_diff 56 | """ 57 | print("load tuned model...") 58 | model_tuned: transformers.PreTrainedModel = transformers.AutoModelForCausalLM.from_pretrained( 59 | path_tuned, 60 | device_map={"": torch.device(device)}, 61 | torch_dtype=torch.float32, 62 | low_cpu_mem_usage=True, 63 | ) 64 | print("load raw model...") 65 | model_raw: transformers.PreTrainedModel = transformers.AutoModelForCausalLM.from_pretrained( 66 | path_raw, 67 | device_map={"": torch.device(device)}, 68 | torch_dtype=torch.float32, 69 | low_cpu_mem_usage=True, 70 | ) 71 | 72 | tokenizer_tuned: transformers.PreTrainedTokenizer = transformers.AutoTokenizer.from_pretrained( 73 | path_tuned 74 | ) 75 | 76 | tokenizer_raw: transformers.PreTrainedTokenizer = transformers.AutoTokenizer.from_pretrained( 77 | path_raw 78 | ) 79 | if tokenizer_raw.pad_token is None and tokenizer_tuned.pad_token is not None: 80 | print("add pad token") 81 | smart_tokenizer_and_embedding_resize( 82 | special_tokens_dict=dict(pad_token=""), 83 | model=model_raw, 84 | tokenizer=tokenizer_raw, 85 | ) 86 | state_dict_tuned = model_tuned.state_dict() 87 | state_dict_raw = model_raw.state_dict() 88 | for key in tqdm.tqdm(state_dict_tuned, desc="make diff"): 89 | print(key) 90 | state_dict_tuned[key].add_(-state_dict_raw[key]) 91 | print("saving diff weights...") 92 | model_tuned.save_pretrained(path_diff) 93 | kwargs = {"push_to_hub": False, "repo_id": ""} 94 | tokenizer_tuned.save_pretrained(path_diff, **kwargs) 95 | 96 | 97 | @torch.inference_mode() 98 | def recover( 99 | path_raw, 100 | path_diff, 101 | path_tuned: Optional[str] = None, 102 | device="cpu", 103 | test_inference=True, 104 | check_integrity_naively=True, 105 | ): 106 | model_raw: transformers.PreTrainedModel = transformers.AutoModelForCausalLM.from_pretrained( 107 | path_raw, 108 | device_map={"": torch.device(device)}, 109 | torch_dtype=torch.float32, 110 | low_cpu_mem_usage=True, 111 | ) 112 | model_recovered: transformers.PreTrainedModel = transformers.AutoModelForCausalLM.from_pretrained( 113 | path_diff, 114 | device_map={"": torch.device(device)}, 115 | torch_dtype=torch.float32, 116 | low_cpu_mem_usage=True, 117 | ) 118 | 119 | tokenizer_raw: transformers.PreTrainedTokenizer = transformers.AutoTokenizer.from_pretrained( 120 | path_raw 121 | ) 122 | tokenizer_recovered: transformers.PreTrainedTokenizer = transformers.AutoTokenizer.from_pretrained( 123 | path_diff 124 | ) 125 | if tokenizer_raw.pad_token is None and tokenizer_recovered.pad_token is not None: 126 | smart_tokenizer_and_embedding_resize( 127 | special_tokens_dict=dict(pad_token=""), 128 | model=model_raw, 129 | tokenizer=tokenizer_raw, 130 | ) 131 | tokenizer_recovered: transformers.PreTrainedTokenizer = transformers.AutoTokenizer.from_pretrained( 132 | path_diff 133 | ) 134 | 135 | state_dict_recovered = model_recovered.state_dict() 136 | state_dict_raw = model_raw.state_dict() 137 | for key in tqdm.tqdm(state_dict_recovered): 138 | state_dict_recovered[key].add_(state_dict_raw[key]) 139 | 140 | if path_tuned is not None: 141 | model_recovered.save_pretrained(path_tuned) 142 | tokenizer_recovered.save_pretrained(path_tuned) 143 | 144 | if test_inference: 145 | input_text = ( 146 | "Below is an instruction that describes a task. " 147 | "Write a response that appropriately completes the request.\r\n\r\n" 148 | "### Instruction:\r\nList three technologies that make life easier.\r\n\r\n### Response:" 149 | ) 150 | inputs = tokenizer_recovered(input_text, return_tensors="pt") 151 | out = model_recovered.generate(inputs=inputs.input_ids, max_new_tokens=100) 152 | output_text = tokenizer_recovered.batch_decode(out, skip_special_tokens=True)[0] 153 | output_text = output_text[len(input_text) :] 154 | print(f"User: {input_text}\nAssistant: {output_text}") 155 | 156 | return model_recovered, tokenizer_recovered 157 | 158 | 159 | def main(task, **kwargs): 160 | globals()[task](**kwargs) 161 | 162 | 163 | if __name__ == "__main__": 164 | fire.Fire(main) -------------------------------------------------------------------------------- /data/split_long.py: -------------------------------------------------------------------------------- 1 | """ 2 | Modified based on https://github.com/lm-sys/FastChat/blob/main/fastchat/data/split_long_conversation.py 3 | Split long conversations based on certain max length. 4 | 5 | Usage: 6 | 1. download json data files to `./raw` 7 | 2. run command below for each file 8 | python -u split_long.py --in-file ./raw/input.json --out-file ./processed/output.json --begin 0 --model-name-or-path /path/to/huggingface/llama --max-length 2048 9 | """ 10 | import argparse 11 | from concurrent.futures import ProcessPoolExecutor 12 | import json 13 | from typing import Dict, Sequence, Optional 14 | 15 | import transformers 16 | from tqdm import tqdm 17 | 18 | 19 | def make_sample(sample, start_idx, end_idx): 20 | assert (end_idx - start_idx) % 2 == 0 21 | return { 22 | "id": sample["id"] + "_" + str(start_idx), 23 | "data": sample["data"][start_idx:end_idx], 24 | } 25 | 26 | 27 | tokenizer = max_length = None 28 | 29 | 30 | def split_one_sample(sample): 31 | tokenized_lens = [] 32 | conversations = sample["data"] 33 | assert len(conversations) %2 == 0, print(conversations) 34 | # conversations = conversations[: len(conversations) // 2 * 2] 35 | for c in conversations: 36 | length = len(tokenizer(c).input_ids) + 6 37 | tokenized_lens.append(length) 38 | 39 | start_idx = 0 40 | cur_len = 0 41 | 42 | # if len(conversations) % 2 != 0 or len(conversations) < 2: 43 | # return [] 44 | 45 | new_samples = [] 46 | for i in range(0, len(conversations), 2): 47 | tmp_len = tokenized_lens[i] + tokenized_lens[i + 1] 48 | if cur_len + tmp_len > max_length: 49 | new_samples.append(make_sample(sample, start_idx, i)) 50 | start_idx = i 51 | cur_len = 0 52 | elif i == len(conversations) - 2: 53 | new_samples.append(make_sample(sample, start_idx, i + 2)) 54 | 55 | cur_len += tmp_len 56 | 57 | return new_samples 58 | 59 | 60 | def split_all(content, begin, end, tokenizer_, max_length_): 61 | """ 62 | Keep the maximum round of conversations within the max token length constraint 63 | """ 64 | global tokenizer, max_length 65 | tokenizer = tokenizer_ 66 | max_length = max_length_ 67 | 68 | content = content[begin:end] 69 | new_content = [] 70 | 71 | with ProcessPoolExecutor() as executor: 72 | for result in tqdm(executor.map(split_one_sample, content), total=len(content)): 73 | new_content.extend(result) 74 | 75 | return new_content 76 | 77 | def check_content(content): 78 | new_content = [] 79 | for c in content: 80 | if len(c["data"]) > 0 and len(c["data"]) % 2 == 0: 81 | new_content.append(c) 82 | return new_content 83 | 84 | 85 | def main(args): 86 | content = [json.loads(l) for l in open(args.in_file, "r")] 87 | tokenizer = transformers.AutoTokenizer.from_pretrained( 88 | args.model_name_or_path, 89 | model_max_length=args.max_length, 90 | padding_side="right", 91 | use_fast=False, 92 | ) 93 | new_content = split_all(content, args.begin, args.end, tokenizer, args.max_length) 94 | new_content = check_content(new_content) 95 | 96 | print(f"total: {len(content)}, new: {len(new_content)}") 97 | with open(args.out_file, "w")as f: 98 | f.writelines("\n".join([json.dumps(l) for l in new_content])) 99 | 100 | 101 | if __name__ == "__main__": 102 | parser = argparse.ArgumentParser() 103 | parser.add_argument("--in-file", type=str, required=True) 104 | parser.add_argument("--out-file", type=str, default="sharegpt_split.json") 105 | parser.add_argument("--begin", type=int) 106 | parser.add_argument("--end", type=int) 107 | parser.add_argument("--model-name-or-path", type=str, required=True) 108 | parser.add_argument("--max-length", type=int, default=2048) 109 | args = parser.parse_args() 110 | main(args) -------------------------------------------------------------------------------- /data/tmp.py: -------------------------------------------------------------------------------- 1 | with open('data/ultra_eval.txt', 'r') as file: 2 | lines = file.readlines() 3 | 4 | # Initialize an empty list to store our JSON objects 5 | json_list = [] 6 | 7 | # Loop through each line of the file 8 | for line in lines: 9 | # Check if the line starts and ends with "--" 10 | if line.startswith("--") and line.endswith("--\n"): 11 | # Extract the text between the "--" 12 | class_name = line.strip("--\n") 13 | else: 14 | # Use the entire line as the value for the "data" key 15 | data = line.strip() 16 | # Create a dictionary object with the "class" and "data" keys 17 | json_obj = {"class": class_name, "data": data} 18 | # Append the dictionary object to our list 19 | json_list.append(json_obj) 20 | 21 | # Convert the list of dictionary objects to a JSON string 22 | json_str = json.dumps(json_list) 23 | 24 | # Print the JSON string 25 | print(json_str) -------------------------------------------------------------------------------- /data/ultra_eval.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "class": " Vicuna ", 4 | "data": "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions." 5 | }, 6 | { 7 | "class": " Vicuna ", 8 | "data": "Can you help me write a formal email to a potential business partner proposing a joint venture?" 9 | }, 10 | { 11 | "class": " Vicuna ", 12 | "data": "Can you help me write a resignation letter to my current employer, while leaving on good terms and expressing gratitude for the opportunities provided?" 13 | }, 14 | { 15 | "class": " Vicuna ", 16 | "data": "Use an appropriate format to structure a formal letter of recommendation for a student applying to a prestigious graduate program in computer science." 17 | }, 18 | { 19 | "class": " Vicuna ", 20 | "data": "Write a compelling product launch announcement email to inform our customers of our new software solution." 21 | }, 22 | { 23 | "class": " Vicuna ", 24 | "data": "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved." 25 | }, 26 | { 27 | "class": " Vicuna ", 28 | "data": "Write a script for a YouTube video exploring the history and cultural significance of jazz." 29 | }, 30 | { 31 | "class": " Vicuna ", 32 | "data": "Write a captivating movie review for a recently released science fiction film, discussing its plot, characters, and special effects." 33 | }, 34 | { 35 | "class": " Vicuna ", 36 | "data": "Structure a podcast script for an episode discussing the influence of streaming platforms on the music industry." 37 | }, 38 | { 39 | "class": " Vicuna ", 40 | "data": "Write a symphony concert review, discussing the orchestra's performance and overall audience experience." 41 | }, 42 | { 43 | "class": " Vicuna ", 44 | "data": "How would you introduce yourself as a medieval knight at a royal banquet?" 45 | }, 46 | { 47 | "class": " Vicuna ", 48 | "data": "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?" 49 | }, 50 | { 51 | "class": " Vicuna ", 52 | "data": "If you were a Shakespearean character, how would you declare your love for someone in a soliloquy?" 53 | }, 54 | { 55 | "class": " Vicuna ", 56 | "data": "As a superhero, how would you explain your origin story to a curious child?" 57 | }, 58 | { 59 | "class": " Vicuna ", 60 | "data": "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?" 61 | }, 62 | { 63 | "class": " Vicuna ", 64 | "data": "As a sports commentator, describe the winning play in the final seconds of a championship game." 65 | }, 66 | { 67 | "class": " Vicuna ", 68 | "data": "Pretend to be a world-famous chef. How would you describe your signature dish to a panel of judges?" 69 | }, 70 | { 71 | "class": " Vicuna ", 72 | "data": "You are a mountain climber reaching the summit of Mount Everest. Describe your emotions and the view from the top." 73 | }, 74 | { 75 | "class": " Vicuna ", 76 | "data": "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet." 77 | }, 78 | { 79 | "class": " Vicuna ", 80 | "data": "Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter." 81 | }, 82 | { 83 | "class": " Vicuna ", 84 | "data": "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?" 85 | }, 86 | { 87 | "class": " Vicuna ", 88 | "data": "What are some subtle clues that suggest someone is pretending to understand a topic or conversation when they are actually confused or uninformed?" 89 | }, 90 | { 91 | "class": " Vicuna ", 92 | "data": "Why might someone choose to use a paper map or ask for directions instead of relying on a GPS device or smartphone app?" 93 | }, 94 | { 95 | "class": " Vicuna ", 96 | "data": "How can you determine if a person is genuinely interested in a conversation or simply being polite?" 97 | }, 98 | { 99 | "class": " Vicuna ", 100 | "data": "Why might someone prefer to shop at a small, locally-owned business instead of a large chain store, even if the prices are higher?" 101 | }, 102 | { 103 | "class": " Vicuna ", 104 | "data": "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?" 105 | }, 106 | { 107 | "class": " Vicuna ", 108 | "data": "Why do some people enjoy the sensation of being scared, such as by watching horror movies or going on roller coasters, while others avoid these experiences?" 109 | }, 110 | { 111 | "class": " Vicuna ", 112 | "data": "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?" 113 | }, 114 | { 115 | "class": " Vicuna ", 116 | "data": "Do we have a moral obligation to explore space, or should we focus on solving Earth's problems first?" 117 | }, 118 | { 119 | "class": " Vicuna ", 120 | "data": "In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress?" 121 | }, 122 | { 123 | "class": " Vicuna ", 124 | "data": "How many times does the average human blink in a lifetime? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 125 | }, 126 | { 127 | "class": " Vicuna ", 128 | "data": "How many atoms are in a grain of salt? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 129 | }, 130 | { 131 | "class": " Vicuna ", 132 | "data": "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 133 | }, 134 | { 135 | "class": " Vicuna ", 136 | "data": "How many balloons would it take to lift a house like in the movie \"Up\"? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 137 | }, 138 | { 139 | "class": " Vicuna ", 140 | "data": "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 141 | }, 142 | { 143 | "class": " Vicuna ", 144 | "data": "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 145 | }, 146 | { 147 | "class": " Vicuna ", 148 | "data": "How many snowflakes fall during a typical winter? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 149 | }, 150 | { 151 | "class": " Vicuna ", 152 | "data": "How many pages are in all the books ever written? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 153 | }, 154 | { 155 | "class": " Vicuna ", 156 | "data": "How many times has the Earth orbited the Sun since the beginning of life? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 157 | }, 158 | { 159 | "class": " Vicuna ", 160 | "data": "How many songs have been recorded throughout history? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." 161 | }, 162 | { 163 | "class": " Vicuna ", 164 | "data": "What if the Internet had been invented during the Renaissance period?" 165 | }, 166 | { 167 | "class": " Vicuna ", 168 | "data": "What if the Aztecs had successfully repelled the Spanish conquistadors?" 169 | }, 170 | { 171 | "class": " Vicuna ", 172 | "data": "What if the Black Death had not occurred in the 14th century?" 173 | }, 174 | { 175 | "class": " Vicuna ", 176 | "data": "What if Isaac Newton had focused on biology instead of physics?" 177 | }, 178 | { 179 | "class": " Vicuna ", 180 | "data": "What if the Beatles had never formed as a band?" 181 | }, 182 | { 183 | "class": " Vicuna ", 184 | "data": "What if Alan Turing had not cracked the Enigma code during World War II?" 185 | }, 186 | { 187 | "class": " Vicuna ", 188 | "data": "What if the Suez Canal had never been constructed?" 189 | }, 190 | { 191 | "class": " Vicuna ", 192 | "data": "What if the Maya civilization had never mysteriously collapsed?" 193 | }, 194 | { 195 | "class": " Vicuna ", 196 | "data": "What if Christopher Columbus had not discovered the Americas?" 197 | }, 198 | { 199 | "class": " Vicuna ", 200 | "data": "What if Vincent van Gogh had been a successful artist during his lifetime?" 201 | }, 202 | { 203 | "class": " Vicuna ", 204 | "data": "Develop a C++ program that reads a text file line by line and counts the number of occurrences of a specific word in the file." 205 | }, 206 | { 207 | "class": " Vicuna ", 208 | "data": "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming." 209 | }, 210 | { 211 | "class": " Vicuna ", 212 | "data": "Implement a regular expression in Python to validate an email address." 213 | }, 214 | { 215 | "class": " Vicuna ", 216 | "data": "Write a program to find the nth Fibonacci number using dynamic programming." 217 | }, 218 | { 219 | "class": " Vicuna ", 220 | "data": "Implement a binary search algorithm to find a specific element in a sorted array." 221 | }, 222 | { 223 | "class": " Vicuna ", 224 | "data": "Implement a queue data structure using two stacks in Python." 225 | }, 226 | { 227 | "class": " Vicuna ", 228 | "data": "Implement a program to find the common elements in two arrays without using any extra data structures." 229 | }, 230 | { 231 | "class": " Vicuna ", 232 | "data": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2)." 233 | }, 234 | { 235 | "class": " Vicuna ", 236 | "data": "Solve for x in the equation 3x + 10 = 5(x - 2)." 237 | }, 238 | { 239 | "class": " Vicuna ", 240 | "data": "If the endpoints of a line segment are (2, -2) and (10, 4), what is the length of the segment?" 241 | }, 242 | { 243 | "class": " Vicuna ", 244 | "data": "How can I improve my time management skills?" 245 | }, 246 | { 247 | "class": " Vicuna ", 248 | "data": "What are the most effective ways to deal with stress?" 249 | }, 250 | { 251 | "class": " Vicuna ", 252 | "data": "What are the main differences between Python and JavaScript programming languages?" 253 | }, 254 | { 255 | "class": " Vicuna ", 256 | "data": "How can I increase my productivity while working from home?" 257 | }, 258 | { 259 | "class": " Vicuna ", 260 | "data": "Can you explain the basics of quantum computing?" 261 | }, 262 | { 263 | "class": " Vicuna ", 264 | "data": "What are the differences between plant-based and animal-based protein sources?" 265 | }, 266 | { 267 | "class": " Vicuna ", 268 | "data": "How can I develop my critical thinking skills?" 269 | }, 270 | { 271 | "class": " Vicuna ", 272 | "data": "What are the major challenges faced by the education sector today?" 273 | }, 274 | { 275 | "class": " Vicuna ", 276 | "data": "What are the primary factors that influence consumer behavior?" 277 | }, 278 | { 279 | "class": " Vicuna ", 280 | "data": "What are the most effective strategies for conflict resolution in the workplace?" 281 | }, 282 | { 283 | "class": " Vicuna ", 284 | "data": "What are some potential implications of using a single-use plastic bottle versus a reusable bottle on both the environment and human health?" 285 | }, 286 | { 287 | "class": " Vicuna ", 288 | "data": "What factors would you consider when designing an inclusive and accessible public transportation system?" 289 | }, 290 | { 291 | "class": " Vicuna ", 292 | "data": "How can governments utilize fiscal and monetary policies to combat economic recessions?" 293 | }, 294 | { 295 | "class": " Vicuna ", 296 | "data": "How do language and cultural barriers affect the way people communicate and form relationships in multicultural societies?" 297 | }, 298 | { 299 | "class": " Vicuna ", 300 | "data": "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery." 301 | }, 302 | { 303 | "class": " Vicuna ", 304 | "data": "Explain the process of gene editing using CRISPR-Cas9 technology, and discuss its potential applications and ethical implications." 305 | }, 306 | { 307 | "class": " Vicuna ", 308 | "data": "How do vaccinations work to protect individuals and communities from infectious diseases, and what is herd immunity?" 309 | }, 310 | { 311 | "class": " Vicuna ", 312 | "data": "How do social media platforms influence the way people consume and share news, and what are the potential implications for the spread of misinformation?" 313 | }, 314 | { 315 | "class": " Vicuna ", 316 | "data": "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?" 317 | }, 318 | { 319 | "class": " Vicuna ", 320 | "data": "Explain the process of natural selection and how it contributes to the evolution and adaptation of species." 321 | }, 322 | { 323 | "class": " Commonsense-Easy ", 324 | "data": "What is the primary source of energy for our planet?" 325 | }, 326 | { 327 | "class": " Commonsense-Easy ", 328 | "data": "What do humans breathe in?" 329 | }, 330 | { 331 | "class": " Commonsense-Easy ", 332 | "data": "Is the sun a planet or a star?" 333 | }, 334 | { 335 | "class": " Commonsense-Easy ", 336 | "data": "What do plants need to grow?" 337 | }, 338 | { 339 | "class": " Commonsense-Easy ", 340 | "data": "What is the normal state of water at room temperature?" 341 | }, 342 | { 343 | "class": " Commonsense-Easy ", 344 | "data": "What do you normally do when you are hungry?" 345 | }, 346 | { 347 | "class": " Commonsense-Easy ", 348 | "data": "How many days are there in a week?" 349 | }, 350 | { 351 | "class": " Commonsense-Easy ", 352 | "data": "What is the color of the sky on a clear day?" 353 | }, 354 | { 355 | "class": " Commonsense-Easy ", 356 | "data": "Who wrote the novel \"To Kill a Mockingbird\"?" 357 | }, 358 | { 359 | "class": " Commonsense-Easy ", 360 | "data": "Who is known as the Father of Computers?" 361 | }, 362 | { 363 | "class": " Commonsense-Easy ", 364 | "data": "What does a thermometer measure?" 365 | }, 366 | { 367 | "class": " Commonsense-Easy ", 368 | "data": "Which animal is known as man's best friend?" 369 | }, 370 | { 371 | "class": " Commonsense-Easy ", 372 | "data": "What color do you get when you mix blue and yellow?" 373 | }, 374 | { 375 | "class": " Commonsense-Easy ", 376 | "data": "What is the capital of the United States?" 377 | }, 378 | { 379 | "class": " Commonsense-Easy ", 380 | "data": "What type of animals are dolphins?" 381 | }, 382 | { 383 | "class": " Commonsense-Easy ", 384 | "data": "Who is the author of the Harry Potter series?" 385 | }, 386 | { 387 | "class": " Commonsense-Easy ", 388 | "data": "Who painted the Mona Lisa?" 389 | }, 390 | { 391 | "class": " Commonsense-Easy ", 392 | "data": "How many zeros are there in one thousand?" 393 | }, 394 | { 395 | "class": " Commonsense-Easy ", 396 | "data": "Which planet is known as the Red Planet?" 397 | }, 398 | { 399 | "class": " Commonsense-Easy ", 400 | "data": "What is the most common gas in Earth's atmosphere?" 401 | }, 402 | { 403 | "class": " Commonsense-Easy ", 404 | "data": "How many hours are there in a day?" 405 | }, 406 | { 407 | "class": " Commonsense-Easy ", 408 | "data": "Who was the first person to walk on the moon?" 409 | }, 410 | { 411 | "class": " Commonsense-Easy ", 412 | "data": "What do bees produce?" 413 | }, 414 | { 415 | "class": " Commonsense-Easy ", 416 | "data": "Where does a bird typically live?" 417 | }, 418 | { 419 | "class": " Commonsense-Easy ", 420 | "data": "What musical instrument has 88 keys?" 421 | }, 422 | { 423 | "class": " Commonsense-Easy ", 424 | "data": "What is the largest ocean on Earth?" 425 | }, 426 | { 427 | "class": " Commonsense-Easy ", 428 | "data": "Which meal is usually eaten in the morning?" 429 | }, 430 | { 431 | "class": " Commonsense-Easy ", 432 | "data": "What is the primary language spoken in Brazil?" 433 | }, 434 | { 435 | "class": " Commonsense-Easy ", 436 | "data": "What do fish use to breathe underwater?" 437 | }, 438 | { 439 | "class": " Commonsense-Easy ", 440 | "data": "What is the boiling point of water at sea level in Celsius?" 441 | }, 442 | { 443 | "class": " Commonsense-Moderate ", 444 | "data": "What is the phenomenon that causes the change in pitch heard when a vehicle sounding a horn approaches and recedes from an observer?" 445 | }, 446 | { 447 | "class": " Commonsense-Moderate ", 448 | "data": "Can you explain the difference between a meteor and a meteorite?" 449 | }, 450 | { 451 | "class": " Commonsense-Moderate ", 452 | "data": "What is the primary function of a capacitor in an electrical circuit?" 453 | }, 454 | { 455 | "class": " Commonsense-Moderate ", 456 | "data": "Why is it risky to look directly at a solar eclipse?" 457 | }, 458 | { 459 | "class": " Commonsense-Moderate ", 460 | "data": "How does photosynthesis benefit animals?" 461 | }, 462 | { 463 | "class": " Commonsense-Moderate ", 464 | "data": "Why is the preservation of the Amazon rainforest important for the global climate?" 465 | }, 466 | { 467 | "class": " Commonsense-Moderate ", 468 | "data": "What is the primary difference between a democracy and a dictatorship?" 469 | }, 470 | { 471 | "class": " Commonsense-Moderate ", 472 | "data": "What is the function of red blood cells in the human body?" 473 | }, 474 | { 475 | "class": " Commonsense-Moderate ", 476 | "data": "How does a rainbow form?" 477 | }, 478 | { 479 | "class": " Commonsense-Moderate ", 480 | "data": "Why does ice float on water?" 481 | }, 482 | { 483 | "class": " Commonsense-Moderate ", 484 | "data": "What is the significance of the Turing Test?" 485 | }, 486 | { 487 | "class": " Commonsense-Moderate ", 488 | "data": "What is the Pythagorean Theorem, and what is its significance?" 489 | }, 490 | { 491 | "class": " Commonsense-Moderate ", 492 | "data": "How does the greenhouse effect contribute to global warming?" 493 | }, 494 | { 495 | "class": " Commonsense-Moderate ", 496 | "data": "What is the role of the mitochondria in a cell?" 497 | }, 498 | { 499 | "class": " Commonsense-Moderate ", 500 | "data": "What is the primary function of the Internet Protocol (IP)?" 501 | }, 502 | { 503 | "class": " Commonsense-Moderate ", 504 | "data": "Explain the difference between renewable and non-renewable resources." 505 | }, 506 | { 507 | "class": " Commonsense-Moderate ", 508 | "data": "Why do we perceive the sky as blue during the daytime?" 509 | }, 510 | { 511 | "class": " Commonsense-Moderate ", 512 | "data": "What is the difference between classical and quantum physics?" 513 | }, 514 | { 515 | "class": " Commonsense-Moderate ", 516 | "data": "How do vaccines protect against diseases?" 517 | }, 518 | { 519 | "class": " Commonsense-Moderate ", 520 | "data": "What is the Doppler effect and where is it commonly observed?" 521 | }, 522 | { 523 | "class": " Commonsense-Moderate ", 524 | "data": "Why are coral reefs important to marine ecosystems?" 525 | }, 526 | { 527 | "class": " Commonsense-Moderate ", 528 | "data": "What is the difference between RAM and ROM in a computer?" 529 | }, 530 | { 531 | "class": " Commonsense-Moderate ", 532 | "data": "How does a 4-stroke internal combustion engine work?" 533 | }, 534 | { 535 | "class": " Commonsense-Moderate ", 536 | "data": "Why is aerobic exercise important for heart health?" 537 | }, 538 | { 539 | "class": " Commonsense-Moderate ", 540 | "data": "How does the tilt of the Earth's axis affect seasons?" 541 | }, 542 | { 543 | "class": " Commonsense-Moderate ", 544 | "data": "What is the principle behind the operation of a microwave oven?" 545 | }, 546 | { 547 | "class": " Commonsense-Moderate ", 548 | "data": "How does a prism disperse white light into its component colors?" 549 | }, 550 | { 551 | "class": " Commonsense-Moderate ", 552 | "data": "What are the primary differences between DNA and RNA?" 553 | }, 554 | { 555 | "class": " Commonsense-Moderate ", 556 | "data": "How does the circulatory system and respiratory system work together in the human body?" 557 | }, 558 | { 559 | "class": " Commonsense-Moderate ", 560 | "data": "What is the basic principle of operation of a solar panel?" 561 | }, 562 | { 563 | "class": " World Knowledge-Easy ", 564 | "data": "What is the capital of Australia?" 565 | }, 566 | { 567 | "class": " World Knowledge-Easy ", 568 | "data": "Who is the author of \"Pride and Prejudice\"?" 569 | }, 570 | { 571 | "class": " World Knowledge-Easy ", 572 | "data": "In what year did the first man land on the moon?" 573 | }, 574 | { 575 | "class": " World Knowledge-Easy ", 576 | "data": "What is the chemical symbol for Gold?" 577 | }, 578 | { 579 | "class": " World Knowledge-Easy ", 580 | "data": "Who painted the \"Mona Lisa\"?" 581 | }, 582 | { 583 | "class": " World Knowledge-Easy ", 584 | "data": "Who developed the theory of relativity?" 585 | }, 586 | { 587 | "class": " World Knowledge-Easy ", 588 | "data": "What's the highest mountain in the world?" 589 | }, 590 | { 591 | "class": " World Knowledge-Easy ", 592 | "data": "What is the largest ocean on Earth?" 593 | }, 594 | { 595 | "class": " World Knowledge-Easy ", 596 | "data": "What is the process of photosynthesis?" 597 | }, 598 | { 599 | "class": " World Knowledge-Easy ", 600 | "data": "In what year was the United Nations founded?" 601 | }, 602 | { 603 | "class": " World Knowledge-Easy ", 604 | "data": "What is the square root of 81?" 605 | }, 606 | { 607 | "class": " World Knowledge-Easy ", 608 | "data": "Who won the Nobel Prize for Literature in 2020?" 609 | }, 610 | { 611 | "class": " World Knowledge-Easy ", 612 | "data": "How many days are there in a leap year?" 613 | }, 614 | { 615 | "class": " World Knowledge-Easy ", 616 | "data": "Who invented the telephone?" 617 | }, 618 | { 619 | "class": " World Knowledge-Easy ", 620 | "data": "What is the longest river in the world?" 621 | }, 622 | { 623 | "class": " World Knowledge-Easy ", 624 | "data": "Who composed the Fifth Symphony?" 625 | }, 626 | { 627 | "class": " World Knowledge-Easy ", 628 | "data": "Which planet is known as the \"Red Planet\"?" 629 | }, 630 | { 631 | "class": " World Knowledge-Easy ", 632 | "data": "What does DNA stand for?" 633 | }, 634 | { 635 | "class": " World Knowledge-Easy ", 636 | "data": "Who was the 16th President of the United States?" 637 | }, 638 | { 639 | "class": " World Knowledge-Easy ", 640 | "data": "What is the capital of Canada?" 641 | }, 642 | { 643 | "class": " World Knowledge-Easy ", 644 | "data": "What is the primary language spoken in Brazil?" 645 | }, 646 | { 647 | "class": " World Knowledge-Easy ", 648 | "data": "Who wrote the play \"Romeo and Juliet\"?" 649 | }, 650 | { 651 | "class": " World Knowledge-Easy ", 652 | "data": "What is the currency of Japan?" 653 | }, 654 | { 655 | "class": " World Knowledge-Easy ", 656 | "data": "What is the freezing point of water in Fahrenheit?" 657 | }, 658 | { 659 | "class": " World Knowledge-Easy ", 660 | "data": "Who discovered penicillin?" 661 | }, 662 | { 663 | "class": " World Knowledge-Easy ", 664 | "data": "What are the three primary colors?" 665 | }, 666 | { 667 | "class": " World Knowledge-Easy ", 668 | "data": "What is the largest desert in the world?" 669 | }, 670 | { 671 | "class": " World Knowledge-Easy ", 672 | "data": "What is the closest star to Earth, other than the Sun?" 673 | }, 674 | { 675 | "class": " World Knowledge-Easy ", 676 | "data": "Who is the current Queen of the United Kingdom? (as of 2021)" 677 | }, 678 | { 679 | "class": " World Knowledge-Easy ", 680 | "data": "In computing, what does \"HTTP\" stand for?" 681 | }, 682 | { 683 | "class": " World Knowledge-Difficult ", 684 | "data": "What's the second law of thermodynamics?" 685 | }, 686 | { 687 | "class": " World Knowledge-Difficult ", 688 | "data": "What is the fundamental theorem of calculus?" 689 | }, 690 | { 691 | "class": " World Knowledge-Difficult ", 692 | "data": "Who won the Nobel Prize in Physics in 2021?" 693 | }, 694 | { 695 | "class": " World Knowledge-Difficult ", 696 | "data": "Who is known as the father of modern linguistics?" 697 | }, 698 | { 699 | "class": " World Knowledge-Difficult ", 700 | "data": "What is the G\u00f6del's Incompleteness Theorem?" 701 | }, 702 | { 703 | "class": " World Knowledge-Difficult ", 704 | "data": "Who proposed the concept of \"creative destruction\" in economics?" 705 | }, 706 | { 707 | "class": " World Knowledge-Difficult ", 708 | "data": "What is the meaning of \"Schadenfreude\" in German?" 709 | }, 710 | { 711 | "class": " World Knowledge-Difficult ", 712 | "data": "What are the four fundamental forces of nature?" 713 | }, 714 | { 715 | "class": " World Knowledge-Difficult ", 716 | "data": "What is the term for the fear of the number 13?" 717 | }, 718 | { 719 | "class": " World Knowledge-Difficult ", 720 | "data": "What is the primary function of the hypothalamus in the human brain?" 721 | }, 722 | { 723 | "class": " World Knowledge-Difficult ", 724 | "data": "What is the Heisenberg Uncertainty Principle?" 725 | }, 726 | { 727 | "class": " World Knowledge-Difficult ", 728 | "data": "What is the significance of the Magna Carta in British history?" 729 | }, 730 | { 731 | "class": " World Knowledge-Difficult ", 732 | "data": "Who is the author of the philosophical work \"Thus Spoke Zarathustra\"?" 733 | }, 734 | { 735 | "class": " World Knowledge-Difficult ", 736 | "data": "What is the chemical formula for photosynthesis?" 737 | }, 738 | { 739 | "class": " World Knowledge-Difficult ", 740 | "data": "Who was the Prime Minister of the United Kingdom in 1940?" 741 | }, 742 | { 743 | "class": " World Knowledge-Difficult ", 744 | "data": "Who is considered the father of modern taxonomy?" 745 | }, 746 | { 747 | "class": " World Knowledge-Difficult ", 748 | "data": "What is the theory of \"Punctuated Equilibrium\" in evolutionary biology?" 749 | }, 750 | { 751 | "class": " World Knowledge-Difficult ", 752 | "data": "Who is the author of the book \"One Hundred Years of Solitude\"?" 753 | }, 754 | { 755 | "class": " World Knowledge-Difficult ", 756 | "data": "What is the Nash Equilibrium in game theory?" 757 | }, 758 | { 759 | "class": " World Knowledge-Difficult ", 760 | "data": "What does the term \"quasar\" mean in astronomy?" 761 | }, 762 | { 763 | "class": " World Knowledge-Difficult ", 764 | "data": "What is the Fibonacci sequence and who discovered it?" 765 | }, 766 | { 767 | "class": " World Knowledge-Difficult ", 768 | "data": "Who wrote the opera \"The Magic Flute\"?" 769 | }, 770 | { 771 | "class": " World Knowledge-Difficult ", 772 | "data": "Who is the author of \"The Interpretation of Dreams\"?" 773 | }, 774 | { 775 | "class": " World Knowledge-Difficult ", 776 | "data": "What is the principle of \"Pareto Efficiency\" in economics?" 777 | }, 778 | { 779 | "class": " World Knowledge-Difficult ", 780 | "data": "What is the setting of the novel \"Crime and Punishment\"?" 781 | }, 782 | { 783 | "class": " World Knowledge-Difficult ", 784 | "data": "Who discovered the law of universal gravitation?" 785 | }, 786 | { 787 | "class": " World Knowledge-Difficult ", 788 | "data": "What is the Dirac Equation in quantum mechanics?" 789 | }, 790 | { 791 | "class": " World Knowledge-Difficult ", 792 | "data": "Who is the author of the poem \"The Raven\"?" 793 | }, 794 | { 795 | "class": " World Knowledge-Difficult ", 796 | "data": "What is the Riemann Hypothesis in mathematics?" 797 | }, 798 | { 799 | "class": " World Knowledge-Difficult ", 800 | "data": "What is the Turing Test in artificial intelligence?" 801 | }, 802 | { 803 | "class": " Professional Knowledge-Physics ", 804 | "data": "What is the speed of light in a vacuum?" 805 | }, 806 | { 807 | "class": " Professional Knowledge-Physics ", 808 | "data": "What is Newton's second law of motion?" 809 | }, 810 | { 811 | "class": " Professional Knowledge-Physics ", 812 | "data": "Define the principle of superposition in wave theory." 813 | }, 814 | { 815 | "class": " Professional Knowledge-Physics ", 816 | "data": "What is the difference between velocity and speed?" 817 | }, 818 | { 819 | "class": " Professional Knowledge-Physics ", 820 | "data": "What is the unit of force in the SI system?" 821 | }, 822 | { 823 | "class": " Professional Knowledge-Physics ", 824 | "data": "What is the law of conservation of energy?" 825 | }, 826 | { 827 | "class": " Professional Knowledge-Physics ", 828 | "data": "What is the function of a resistor in an electrical circuit?" 829 | }, 830 | { 831 | "class": " Professional Knowledge-Physics ", 832 | "data": "Explain Ohm's Law." 833 | }, 834 | { 835 | "class": " Professional Knowledge-Physics ", 836 | "data": "Define wavelength and frequency in the context of light waves." 837 | }, 838 | { 839 | "class": " Professional Knowledge-Physics ", 840 | "data": "What is the difference between a scalar and a vector?" 841 | }, 842 | { 843 | "class": " Professional Knowledge-Physics ", 844 | "data": "What is the first law of thermodynamics?" 845 | }, 846 | { 847 | "class": " Professional Knowledge-Physics ", 848 | "data": "What is the principle of conservation of momentum?" 849 | }, 850 | { 851 | "class": " Professional Knowledge-Physics ", 852 | "data": "What is the difference between nuclear fission and nuclear fusion?" 853 | }, 854 | { 855 | "class": " Professional Knowledge-Physics ", 856 | "data": "What are the three types of radioactive decay?" 857 | }, 858 | { 859 | "class": " Professional Knowledge-Physics ", 860 | "data": "What is the Heisenberg Uncertainty Principle?" 861 | }, 862 | { 863 | "class": " Professional Knowledge-Physics ", 864 | "data": "Explain the concept of wave-particle duality." 865 | }, 866 | { 867 | "class": " Professional Knowledge-Physics ", 868 | "data": "What is the Fermi Paradox and what are some proposed solutions to it?" 869 | }, 870 | { 871 | "class": " Professional Knowledge-Physics ", 872 | "data": "How does quantum entanglement work and what are its implications for information transfer?" 873 | }, 874 | { 875 | "class": " Professional Knowledge-Physics ", 876 | "data": "How does the Casimir effect demonstrate the existence of quantum vacuum fluctuations?" 877 | }, 878 | { 879 | "class": " Professional Knowledge-Physics ", 880 | "data": "Explain the concept of time dilation in special relativity." 881 | }, 882 | { 883 | "class": " Professional Knowledge-Physics ", 884 | "data": "What are the main interpretations of quantum mechanics and how do they differ?" 885 | }, 886 | { 887 | "class": " Professional Knowledge-Physics ", 888 | "data": "How do black holes evaporate according to Hawking radiation theory?" 889 | }, 890 | { 891 | "class": " Professional Knowledge-Physics ", 892 | "data": "How does superconductivity work and what are its practical applications?" 893 | }, 894 | { 895 | "class": " Professional Knowledge-Physics ", 896 | "data": "How does the Higgs field give particles their mass?" 897 | }, 898 | { 899 | "class": " Professional Knowledge-Physics ", 900 | "data": "What are the key principles of string theory?" 901 | }, 902 | { 903 | "class": " Professional Knowledge-Physics ", 904 | "data": "What are gravitational waves and how are they detected?" 905 | }, 906 | { 907 | "class": " Professional Knowledge-Physics ", 908 | "data": "Explain the double-slit experiment and its significance in quantum mechanics." 909 | }, 910 | { 911 | "class": " Professional Knowledge-Physics ", 912 | "data": "What is the difference between a baryonic and non-baryonic dark matter?" 913 | }, 914 | { 915 | "class": " Professional Knowledge-Physics ", 916 | "data": "Explain the concept of symmetry breaking in particle physics." 917 | }, 918 | { 919 | "class": " Professional Knowledge-Physics ", 920 | "data": "What is the significance of the Planck scale in quantum gravity?" 921 | }, 922 | { 923 | "class": " Professional Knowledge-Biology ", 924 | "data": "What is the function of mitochondria in a cell?" 925 | }, 926 | { 927 | "class": " Professional Knowledge-Biology ", 928 | "data": "What is the process of photosynthesis?" 929 | }, 930 | { 931 | "class": " Professional Knowledge-Biology ", 932 | "data": "What is DNA and what role does it play in living organisms?" 933 | }, 934 | { 935 | "class": " Professional Knowledge-Biology ", 936 | "data": "What are the four main types of macromolecules found in living organisms?" 937 | }, 938 | { 939 | "class": " Professional Knowledge-Biology ", 940 | "data": "What is the difference between prokaryotes and eukaryotes?" 941 | }, 942 | { 943 | "class": " Professional Knowledge-Biology ", 944 | "data": "Explain the process of mitosis." 945 | }, 946 | { 947 | "class": " Professional Knowledge-Biology ", 948 | "data": "What is the function of hemoglobin in the blood?" 949 | }, 950 | { 951 | "class": " Professional Knowledge-Biology ", 952 | "data": "What is the difference between a gene and an allele?" 953 | }, 954 | { 955 | "class": " Professional Knowledge-Biology ", 956 | "data": "What is the structure of a typical neuron?" 957 | }, 958 | { 959 | "class": " Professional Knowledge-Biology ", 960 | "data": "What is the biological role of proteins?" 961 | }, 962 | { 963 | "class": " Professional Knowledge-Biology ", 964 | "data": "What is the difference between a genotype and a phenotype?" 965 | }, 966 | { 967 | "class": " Professional Knowledge-Biology ", 968 | "data": "What is natural selection?" 969 | }, 970 | { 971 | "class": " Professional Knowledge-Biology ", 972 | "data": "What is the function of the endoplasmic reticulum in a cell?" 973 | }, 974 | { 975 | "class": " Professional Knowledge-Biology ", 976 | "data": "What are the four types of tissues in the human body?" 977 | }, 978 | { 979 | "class": " Professional Knowledge-Biology ", 980 | "data": "What is the role of ATP in cellular metabolism?" 981 | }, 982 | { 983 | "class": " Ability Math ", 984 | "data": "What is the Pythagorean theorem?" 985 | }, 986 | { 987 | "class": " Ability Math ", 988 | "data": "Simplify the expression: 2x + 3x - 5x." 989 | }, 990 | { 991 | "class": " Ability Math ", 992 | "data": "Solve for x in the equation: 2x - 3 = 7." 993 | }, 994 | { 995 | "class": " Ability Math ", 996 | "data": "What is the formula for the area of a circle?" 997 | }, 998 | { 999 | "class": " Ability Math ", 1000 | "data": "What is the derivative of the function f(x) = x^2?" 1001 | }, 1002 | { 1003 | "class": " Ability Math ", 1004 | "data": "What is the sum of the interior angles of a triangle?" 1005 | }, 1006 | { 1007 | "class": " Ability Math ", 1008 | "data": "What is the formula to find the volume of a sphere?" 1009 | }, 1010 | { 1011 | "class": " Ability Math ", 1012 | "data": "What does it mean for a number to be prime?" 1013 | }, 1014 | { 1015 | "class": " Ability Math ", 1016 | "data": "Convert the fraction 3/4 to a decimal." 1017 | }, 1018 | { 1019 | "class": " Ability Math ", 1020 | "data": "What is the integral of the function f(x) = x^2?" 1021 | }, 1022 | { 1023 | "class": " Ability Math ", 1024 | "data": "What is the difference between a rational and an irrational number?" 1025 | }, 1026 | { 1027 | "class": " Ability Math ", 1028 | "data": "What is the value of pi to 3 decimal places?" 1029 | }, 1030 | { 1031 | "class": " Ability Math ", 1032 | "data": "Simplify the expression: (a^2)^3." 1033 | }, 1034 | { 1035 | "class": " Ability Math ", 1036 | "data": "What is the midpoint formula in a two-dimensional plane?" 1037 | }, 1038 | { 1039 | "class": " Ability Math ", 1040 | "data": "Solve for x in the equation: log(x) = 2." 1041 | }, 1042 | { 1043 | "class": " Ability Math ", 1044 | "data": "What is the Taylor series expansion of the function e^x?" 1045 | }, 1046 | { 1047 | "class": " Ability Math ", 1048 | "data": "Solve the differential equation dy/dx = xy." 1049 | }, 1050 | { 1051 | "class": " Ability Math ", 1052 | "data": "Find the eigenvalues of the matrix [[3, 1], [2, 4]]." 1053 | }, 1054 | { 1055 | "class": " Ability Math ", 1056 | "data": "Prove the binomial theorem: (x+y)^n = \u03a3 (from k=0 to n) [nCk * x^(n-k) * y^k]." 1057 | }, 1058 | { 1059 | "class": " Ability Math ", 1060 | "data": "What is Green's theorem in vector calculus?" 1061 | }, 1062 | { 1063 | "class": " Ability Math ", 1064 | "data": "Explain the concept of a group in abstract algebra." 1065 | }, 1066 | { 1067 | "class": " Ability Math ", 1068 | "data": "Prove that the square root of 2 is irrational." 1069 | }, 1070 | { 1071 | "class": " Ability Math ", 1072 | "data": "How do you solve a system of linear equations using Gaussian elimination?" 1073 | }, 1074 | { 1075 | "class": " Ability Math ", 1076 | "data": "What is the definition of a limit in calculus?" 1077 | }, 1078 | { 1079 | "class": " Ability Math ", 1080 | "data": "Explain the concept of mathematical induction." 1081 | }, 1082 | { 1083 | "class": " Ability Math ", 1084 | "data": "What is the Riemann Hypothesis?" 1085 | }, 1086 | { 1087 | "class": " Ability Math ", 1088 | "data": "How do you calculate the determinant of a matrix?" 1089 | }, 1090 | { 1091 | "class": " Ability Math ", 1092 | "data": "What is the fundamental theorem of algebra?" 1093 | }, 1094 | { 1095 | "class": " Ability Math ", 1096 | "data": "Explain the concept of a vector space in linear algebra." 1097 | }, 1098 | { 1099 | "class": " Ability Math ", 1100 | "data": "What are the axioms of probability in statistics?" 1101 | }, 1102 | { 1103 | "class": " Ability Reasoning ", 1104 | "data": "If all dogs bark and Rover is a dog, does Rover bark?" 1105 | }, 1106 | { 1107 | "class": " Ability Reasoning ", 1108 | "data": "If it is raining, the ground is wet. The ground is not wet. Is it raining?" 1109 | }, 1110 | { 1111 | "class": " Ability Reasoning ", 1112 | "data": "You are driving a bus. At the first stop, 4 people get on. At the next stop, 3 get off and 5 get on. At the third stop, 2 get off and 2 get on. How many people are on the bus now?" 1113 | }, 1114 | { 1115 | "class": " Ability Reasoning ", 1116 | "data": "John is taller than Mary. Mary is taller than Peter. Who is the tallest?" 1117 | }, 1118 | { 1119 | "class": " Ability Reasoning ", 1120 | "data": "If every cat chases a mouse and Tom is a cat, who does Tom chase?" 1121 | }, 1122 | { 1123 | "class": " Ability Reasoning ", 1124 | "data": "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?" 1125 | }, 1126 | { 1127 | "class": " Ability Reasoning ", 1128 | "data": "What comes next in this series: 2, 4, 8, 16, __?" 1129 | }, 1130 | { 1131 | "class": " Ability Reasoning ", 1132 | "data": "If a fruit basket has only apples and oranges, and you know there is at least one apple in it, can you say the basket contains only apples?" 1133 | }, 1134 | { 1135 | "class": " Ability Reasoning ", 1136 | "data": "John's father is twice as old as John. 10 years ago, he was three times as old. How old is John now?" 1137 | }, 1138 | { 1139 | "class": " Ability Reasoning ", 1140 | "data": "What is the next number in the sequence: 1, 1, 2, 3, 5, 8, __?" 1141 | }, 1142 | { 1143 | "class": " Ability Reasoning ", 1144 | "data": "You have two buckets, one with red paint and one with blue paint. You take one cup from the red bucket and pour it into the blue bucket. Then you take one cup from the blue bucket and pour it back into the red bucket. Which is true: the red bucket has more blue paint, or the blue bucket has more red paint?" 1145 | }, 1146 | { 1147 | "class": " Ability Reasoning ", 1148 | "data": "If you have two US quarters, three dimes, and four nickels, how much money do you have in total?" 1149 | }, 1150 | { 1151 | "class": " Ability Reasoning ", 1152 | "data": "If a train leaves a station and travels at 60 miles per hour, how far will it be from the station after 2.5 hours?" 1153 | }, 1154 | { 1155 | "class": " Ability Reasoning ", 1156 | "data": "Can a man living in New York be buried west of the Mississippi River?" 1157 | }, 1158 | { 1159 | "class": " Ability Reasoning ", 1160 | "data": "What is the minimum number of times a week can a person's birthday occur?" 1161 | }, 1162 | { 1163 | "class": " Ability Reasoning ", 1164 | "data": "A bat and a ball together cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?" 1165 | }, 1166 | { 1167 | "class": " Ability Reasoning ", 1168 | "data": "If there are 5 apples and you take away 3, how many do you have?" 1169 | }, 1170 | { 1171 | "class": " Ability Reasoning ", 1172 | "data": "If a farmer has 17 sheep and all but 9 die, how many are left?" 1173 | }, 1174 | { 1175 | "class": " Ability Reasoning ", 1176 | "data": "If you're running a race and you pass the person in 2nd place, what place are you in?" 1177 | }, 1178 | { 1179 | "class": " Ability Reasoning ", 1180 | "data": "A man builds a house with all 4 sides facing south. A bear walks past the house. What color is the bear?" 1181 | }, 1182 | { 1183 | "class": " Ability Reasoning ", 1184 | "data": "If you have a bee in your hand, what's in your eye?" 1185 | }, 1186 | { 1187 | "class": " Ability Reasoning ", 1188 | "data": "Which weighs more: a pound of feathers or a pound of bricks?" 1189 | }, 1190 | { 1191 | "class": " Ability Reasoning ", 1192 | "data": "If three children share 49 toys equally, how many toys does each child have?" 1193 | }, 1194 | { 1195 | "class": " Ability Reasoning ", 1196 | "data": "You have two coins that add up to 15 cents and one of the coins is not a nickel. What are the coins?" 1197 | }, 1198 | { 1199 | "class": " Ability Reasoning ", 1200 | "data": "If a red house is made from red bricks, and a blue house is made from blue bricks, what is a greenhouse made from?" 1201 | }, 1202 | { 1203 | "class": " Ability Reasoning ", 1204 | "data": "How many times can you subtract 5 from 25?" 1205 | }, 1206 | { 1207 | "class": " Ability Reasoning ", 1208 | "data": "How many cubic feet of dirt are there in a hole that is 3 feet deep, 3 feet wide, and 3 feet long?" 1209 | }, 1210 | { 1211 | "class": " Ability Reasoning ", 1212 | "data": "If it takes 8 men 10 hours to build a wall, how long would it take 4 men to build the same wall?" 1213 | }, 1214 | { 1215 | "class": " Ability Reasoning ", 1216 | "data": "A plane crashes on the border of the U.S. and Canada. Where do they bury the survivors?" 1217 | }, 1218 | { 1219 | "class": " Ability Reasoning ", 1220 | "data": "If you drove a bus with 43 people on board from Chicago and stopped at Pittsburgh to pick up 7 more people and drop off 5 passengers, then at your next stop in Philadelphia, you drop off 8 passengers and pick up 4, and eventually arrive in New York later that night to drop off the remainder of your passengers, how many bus drivers are there?" 1221 | }, 1222 | { 1223 | "class": " Writing ", 1224 | "data": "Write a short story about gravity." 1225 | }, 1226 | { 1227 | "class": " Writing ", 1228 | "data": "Summarize the concept of quantum mechanics in a paragraph." 1229 | }, 1230 | { 1231 | "class": " Writing ", 1232 | "data": "Write a poem about the solar system." 1233 | }, 1234 | { 1235 | "class": " Writing ", 1236 | "data": "Explain the concept of time dilation in simple terms." 1237 | }, 1238 | { 1239 | "class": " Writing ", 1240 | "data": "Write a blog post about the discovery of the Higgs boson." 1241 | }, 1242 | { 1243 | "class": " Writing ", 1244 | "data": "Create a catchy slogan for a physics tutoring service." 1245 | }, 1246 | { 1247 | "class": " Writing ", 1248 | "data": "Compose a rhymed poem about electromagnetism." 1249 | }, 1250 | { 1251 | "class": " Writing ", 1252 | "data": "Write a short dialogue between two particles experiencing quantum entanglement." 1253 | }, 1254 | { 1255 | "class": " Writing ", 1256 | "data": "Explain how a rainbow forms in kid's language." 1257 | }, 1258 | { 1259 | "class": " Writing ", 1260 | "data": "Write a haiku about Newton's laws of motion." 1261 | }, 1262 | { 1263 | "class": " Writing ", 1264 | "data": "Narrate a story featuring a character who can control gravity." 1265 | }, 1266 | { 1267 | "class": " Writing ", 1268 | "data": "Write a simple explanation of Schr\u00f6dinger's cat experiment." 1269 | }, 1270 | { 1271 | "class": " Writing ", 1272 | "data": "Describe what light would look like if we could see it moving at its actual speed." 1273 | }, 1274 | { 1275 | "class": " Writing ", 1276 | "data": "Write a simple explanation of Einstein's Theory of Relativity." 1277 | }, 1278 | { 1279 | "class": " Writing ", 1280 | "data": "Write an invitation to a physics seminar." 1281 | }, 1282 | { 1283 | "class": " Writing ", 1284 | "data": "Describe how a microwave works in layman's terms." 1285 | }, 1286 | { 1287 | "class": " Writing ", 1288 | "data": "Write a paragraph explaining the Big Bang theory for a middle school audience." 1289 | }, 1290 | { 1291 | "class": " Writing ", 1292 | "data": "Create a catchy slogan for a quantum computing company." 1293 | }, 1294 | { 1295 | "class": " Writing ", 1296 | "data": "Write a short story about a journey through a black hole." 1297 | }, 1298 | { 1299 | "class": " Writing ", 1300 | "data": "Describe a day in the life of a photon." 1301 | }, 1302 | { 1303 | "class": " Writing ", 1304 | "data": "Explain the concept of entropy to a five-year-old." 1305 | }, 1306 | { 1307 | "class": " Writing ", 1308 | "data": "Write a limerick about particle physics." 1309 | }, 1310 | { 1311 | "class": " Writing ", 1312 | "data": "Describe what a wave-particle duality is in simple terms." 1313 | }, 1314 | { 1315 | "class": " Writing ", 1316 | "data": "Write a dialogue between the sun and the moon discussing gravity." 1317 | }, 1318 | { 1319 | "class": " Writing ", 1320 | "data": "Explain the role of physics in everyday life." 1321 | }, 1322 | { 1323 | "class": " Writing ", 1324 | "data": "Write a short story about the first second of the universe." 1325 | }, 1326 | { 1327 | "class": " Writing ", 1328 | "data": "Create a catchy slogan for a company selling telescopes." 1329 | }, 1330 | { 1331 | "class": " Writing ", 1332 | "data": "Write a poem about the Northern Lights." 1333 | }, 1334 | { 1335 | "class": " Writing ", 1336 | "data": "Explain the Doppler effect using an example from daily life." 1337 | }, 1338 | { 1339 | "class": " Writing ", 1340 | "data": "Write a short story about a time-traveling physicist." 1341 | }, 1342 | { 1343 | "class": " Writing ", 1344 | "data": "Write a blog post about the contributions of Marie Curie." 1345 | }, 1346 | { 1347 | "class": " Writing ", 1348 | "data": "Write a simple explanation of dark matter." 1349 | }, 1350 | { 1351 | "class": " Writing ", 1352 | "data": "Create a catchy slogan for a physics-based video game." 1353 | }, 1354 | { 1355 | "class": " Writing ", 1356 | "data": "Write a children's story about a journey to a parallel universe." 1357 | }, 1358 | { 1359 | "class": " Writing ", 1360 | "data": "Explain the concept of weightlessness in space." 1361 | }, 1362 | { 1363 | "class": " Writing ", 1364 | "data": "Write a poem about the speed of light." 1365 | }, 1366 | { 1367 | "class": " Writing ", 1368 | "data": "Summarize the life and work of Isaac Newton." 1369 | }, 1370 | { 1371 | "class": " Writing ", 1372 | "data": "Write a short dialogue between two planets." 1373 | }, 1374 | { 1375 | "class": " Writing ", 1376 | "data": "Explain how a fridge works in simple terms." 1377 | }, 1378 | { 1379 | "class": " Writing ", 1380 | "data": "Write a short story about an atom." 1381 | }, 1382 | { 1383 | "class": " Writing ", 1384 | "data": "Write a research paper abstract on the recent advancements in string theory." 1385 | }, 1386 | { 1387 | "class": " Writing ", 1388 | "data": "Develop a comprehensive lesson plan to teach quantum physics to high school students." 1389 | }, 1390 | { 1391 | "class": " Writing ", 1392 | "data": "Write a feature article on the implications of quantum computing for cybersecurity." 1393 | }, 1394 | { 1395 | "class": " Writing ", 1396 | "data": "Write a dialogue between Albert Einstein and Nikola Tesla discussing their respective theories." 1397 | }, 1398 | { 1399 | "class": " Writing ", 1400 | "data": "Create a detailed business plan for a start-up based on quantum technology." 1401 | }, 1402 | { 1403 | "class": " Writing ", 1404 | "data": "Compose a speculative fiction story on the societal implications of mastering cold fusion." 1405 | }, 1406 | { 1407 | "class": " Writing ", 1408 | "data": "Write an in-depth review of a recent academic paper on gravitational waves." 1409 | }, 1410 | { 1411 | "class": " Writing ", 1412 | "data": "Create a detailed step-by-step guide on how to set up a particle accelerator." 1413 | }, 1414 | { 1415 | "class": " Writing ", 1416 | "data": "Write an epic poem describing the process of a star going supernova." 1417 | }, 1418 | { 1419 | "class": " Writing ", 1420 | "data": "Develop a comprehensive overview of the pros and cons of nuclear energy." 1421 | }, 1422 | { 1423 | "class": " Writing ", 1424 | "data": "Write a detailed blog post about the challenges and benefits of space travel." 1425 | }, 1426 | { 1427 | "class": " Writing ", 1428 | "data": "Create a comprehensive lesson plan to teach the general theory of relativity." 1429 | }, 1430 | { 1431 | "class": " Writing ", 1432 | "data": "Write a dialogue between two fundamental particles in the Large Hadron Collider." 1433 | }, 1434 | { 1435 | "class": " Writing ", 1436 | "data": "Write a detailed essay on the role of physics in climate change." 1437 | }, 1438 | { 1439 | "class": " Writing ", 1440 | "data": "Compose a set of haikus that, when combined, explain the process of nuclear fusion." 1441 | }, 1442 | { 1443 | "class": " Writing ", 1444 | "data": "Write a speculative fiction story about a world where the laws of physics are slightly different." 1445 | }, 1446 | { 1447 | "class": " Writing ", 1448 | "data": "Write a deep dive article on the implications of the many-worlds interpretation of quantum mechanics." 1449 | }, 1450 | { 1451 | "class": " Writing ", 1452 | "data": "Compose a comprehensive explanation of the four fundamental forces of physics." 1453 | }, 1454 | { 1455 | "class": " Writing ", 1456 | "data": "Write a dialogue between two black holes about to merge." 1457 | }, 1458 | { 1459 | "class": " Writing ", 1460 | "data": "Write a detailed essay on the physics behind GPS technology." 1461 | }, 1462 | { 1463 | "class": " Writing ", 1464 | "data": "Write a detailed blog post about the challenges and future of quantum cryptography." 1465 | }, 1466 | { 1467 | "class": " Writing ", 1468 | "data": "Create a comprehensive lesson plan to teach the special theory of relativity." 1469 | }, 1470 | { 1471 | "class": " Writing ", 1472 | "data": "Write an in-depth review of a recent academic paper on quantum entanglement." 1473 | }, 1474 | { 1475 | "class": " Writing ", 1476 | "data": "Write a speculative fiction story about a world where time travel is possible." 1477 | }, 1478 | { 1479 | "class": " Writing ", 1480 | "data": "Develop a conceptual explanation of the theory of everything." 1481 | }, 1482 | { 1483 | "class": " Writing ", 1484 | "data": "Write a detailed essay on the role of physics in understanding the origins of the universe." 1485 | }, 1486 | { 1487 | "class": " Writing ", 1488 | "data": "Write a dialogue between two photons traveling at light speed." 1489 | }, 1490 | { 1491 | "class": " Writing ", 1492 | "data": "Write a blog post about the role of physics in sustainable development." 1493 | }, 1494 | { 1495 | "class": " Writing ", 1496 | "data": "Create a catchy slogan for a company that's researching practical applications of quantum entanglement." 1497 | }, 1498 | { 1499 | "class": " Writing ", 1500 | "data": "Write a comprehensive overview of the theories and evidence for the existence of dark matter and dark energy." 1501 | }, 1502 | { 1503 | "class": " Writing ", 1504 | "data": "Write a speculative fiction story about a society that has mastered quantum teleportation." 1505 | }, 1506 | { 1507 | "class": " Writing ", 1508 | "data": "Develop a detailed lesson plan to teach the concepts of string theory." 1509 | }, 1510 | { 1511 | "class": " Writing ", 1512 | "data": "Write a comprehensive review of a recent academic paper on neutrino physics." 1513 | }, 1514 | { 1515 | "class": " Writing ", 1516 | "data": "Write an in-depth blog post about the role of physics in artificial intelligence." 1517 | }, 1518 | { 1519 | "class": " Writing ", 1520 | "data": "Write a dialogue between a gravitational wave and a light wave." 1521 | }, 1522 | { 1523 | "class": " Writing ", 1524 | "data": "Write a detailed essay on how quantum mechanics influences biological processes." 1525 | }, 1526 | { 1527 | "class": " Writing ", 1528 | "data": "Write a blog post about the implications of zero-point energy." 1529 | }, 1530 | { 1531 | "class": " Writing ", 1532 | "data": "Develop a comprehensive lesson plan to teach the principles of thermodynamics." 1533 | }, 1534 | { 1535 | "class": " Writing ", 1536 | "data": "Write an in-depth review of a recent academic paper on the Higgs boson." 1537 | }, 1538 | { 1539 | "class": " Writing ", 1540 | "data": "Write a speculative fiction story about a society that has discovered how to manipulate the fundamental forces of nature." 1541 | } 1542 | ] -------------------------------------------------------------------------------- /data/vllm_chatloop: -------------------------------------------------------------------------------- 1 | from vllm import LLM, SamplingParams 2 | 3 | # Sample prompts. 4 | prompts = [ 5 | "Hello, my name is", 6 | "The president of the United States is", 7 | "The capital of France is", 8 | "The future of AI is", 9 | ] 10 | 11 | sampling_params = SamplingParams(temperature=0.7, top_p=0.95) 12 | 13 | # Create an LLM. 14 | llm = LLM(model="/yinxr/dn/alpaca_eval/model/13b-latest") 15 | 16 | # Get user input 17 | user_input = input("User: ") 18 | 19 | # Prepend system prompt to user input 20 | prompt = "System: " + user_input 21 | 22 | # Generate response 23 | outputs = llm.generate([prompt], sampling_params) 24 | generated_text = outputs[0].outputs[0].text 25 | 26 | # Print response 27 | print(f"Model: {generated_text!r}") 28 | -------------------------------------------------------------------------------- /figures/alpaca.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/alpaca.png -------------------------------------------------------------------------------- /figures/compare_ultra.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/compare_ultra.jpg -------------------------------------------------------------------------------- /figures/figure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/figure.png -------------------------------------------------------------------------------- /figures/meta_topic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/meta_topic.png -------------------------------------------------------------------------------- /figures/ultra-process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/ultra-process.png -------------------------------------------------------------------------------- /figures/ultra_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/ultra_logo.png -------------------------------------------------------------------------------- /figures/wizard_test.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/figures/wizard_test.jpg -------------------------------------------------------------------------------- /paper/UltraFuser-paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thunlp/UltraChat/1f613e1b8dfc6d1e3a02efb6905de608ed06645b/paper/UltraFuser-paper.pdf -------------------------------------------------------------------------------- /train/requirements.txt: -------------------------------------------------------------------------------- 1 | bmtrain==0.2.1 2 | scikit_learn==1.0 3 | torch==1.13.1 4 | tqdm 5 | transformers==4.28.1 6 | wandb -------------------------------------------------------------------------------- /train/train_bm.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import torch 3 | from transformers import LlamaForCausalLM, LlamaTokenizer 4 | from ultrachat_dataset import load_raw_data, PromptIterableDataset, collator 5 | from transformers.optimization import get_linear_schedule_with_warmup 6 | from tqdm import tqdm 7 | from torch.utils.data import DataLoader 8 | import bmtrain as bmt 9 | from functools import partial 10 | import time 11 | import os 12 | import wandb 13 | 14 | def get_model_tokenizer(args): 15 | model = LlamaForCausalLM.from_pretrained(args.model_name_or_path) 16 | tokenizer = LlamaTokenizer.from_pretrained(args.model_name_or_path) 17 | tokenizer.add_special_tokens({'pad_token': ""}) 18 | model.resize_token_embeddings(len(tokenizer)) 19 | model = bmt.BMTrainModelWrapper(model) 20 | return model, tokenizer 21 | 22 | def get_optimizer(args, model): 23 | optimizer = bmt.optim.AdamOffloadOptimizer( 24 | model.parameters(), weight_decay=args.weight_decay 25 | ) 26 | return optimizer 27 | 28 | 29 | def get_learning_rate_scheduler(args, optimizer): 30 | if args.lr_decay_iters is None: 31 | args.lr_decay_iters = args.train_iters 32 | if args.lr_decay_style == "linear": 33 | lr_scheduler = bmt.lr_scheduler.Linear( 34 | optimizer, 35 | start_lr=args.lr, 36 | warmup_iter=int(args.warmup_ratio * args.train_iters), 37 | end_iter=args.lr_decay_iters, 38 | num_iter=args.start_step, 39 | ) 40 | elif args.lr_decay_style == "cosine": 41 | print("use cosine") 42 | lr_scheduler = bmt.lr_scheduler.Cosine( 43 | optimizer, 44 | start_lr=args.lr, 45 | warmup_iter=int(args.warmup_ratio * args.train_iters), 46 | end_iter=args.lr_decay_iters, 47 | num_iter=args.start_step, 48 | ) 49 | 50 | elif args.lr_decay_style == "noam": 51 | print("use noam") 52 | lr_scheduler = bmt.lr_scheduler.Noam( 53 | optimizer, 54 | start_lr=args.lr, 55 | warmup_iter=int(args.warmup_ratio * args.train_iters), 56 | end_iter=args.lr_decay_iters, 57 | num_iter=args.start_step, 58 | ) 59 | else: 60 | raise NotImplementedError 61 | return lr_scheduler 62 | 63 | 64 | def setup_model_and_optimizer(args): 65 | model, tokenizer = get_model_tokenizer(args) 66 | bmt.synchronize() 67 | optimizer = get_optimizer(args, model) 68 | lr_scheduler = get_learning_rate_scheduler(args, optimizer) 69 | bmt.synchronize() 70 | return tokenizer, model, optimizer, lr_scheduler 71 | 72 | 73 | 74 | def train(args): 75 | 76 | bmt.init_distributed( 77 | seed=args.seed, 78 | zero_level=3, 79 | ) 80 | 81 | if args.wandb and bmt.rank() == 0: 82 | wandb.init() 83 | 84 | if args.tensorboard is not None and bmt.rank() == 0: 85 | from torch.utils.tensorboard import SummaryWriter 86 | import distutils.version # noqa: F401 87 | 88 | if not os.path.exists(args.tensorboard): 89 | os.makedirs(args.tensorboard) 90 | writer = SummaryWriter(log_dir=args.tensorboard) 91 | 92 | tokenizer, model, optimizer, lr_scheduler = setup_model_and_optimizer(args) 93 | optim_manager = bmt.optim.OptimManager(loss_scale=2**10) 94 | optim_manager.add_optimizer(optimizer, lr_scheduler) 95 | 96 | bmt.synchronize() 97 | 98 | original_dataset = load_raw_data(args.data_file) 99 | print("total training instance number:", len(original_dataset)) 100 | 101 | 102 | 103 | bmt.print_rank("Model memory") 104 | bmt.print_rank(torch.cuda.memory_summary()) 105 | 106 | avg_time_recorder = bmt.utils.AverageRecorder() 107 | avg_loss_recorder = bmt.utils.AverageRecorder() 108 | 109 | global_step = 0 110 | for epoch in range(args.epochs): 111 | indices = torch.randperm(len(original_dataset)) 112 | dataset = [original_dataset[i] for i in indices] 113 | 114 | data_per_gpu = len(dataset) // bmt.world_size() 115 | dataset = dataset[bmt.rank() * data_per_gpu : (bmt.rank() + 1) * data_per_gpu] 116 | 117 | dataset = PromptIterableDataset(dataset, tokenizer = tokenizer, max_seq_length = args.max_seq_length, teacher_forcing=True, truncate_method="tail") 118 | dataloader = DataLoader(dataset, batch_size=args.batch_size_per_device, collate_fn=partial(collator, tokenizer)) 119 | 120 | if global_step >= args.train_iters: 121 | break 122 | progress_bar = tqdm(range(len(dataloader)), disable=not bmt.rank()==0, desc=f"epoch {epoch}") 123 | 124 | for step, inputs in enumerate(dataloader): 125 | st = time.time() 126 | 127 | with bmt.inspect.inspect_tensor() as inspector: 128 | for k in inputs: 129 | inputs[k] = inputs[k].cuda() 130 | output = model(**inputs) 131 | loss = output.loss 132 | 133 | global_loss = bmt.sum_loss(loss).item() 134 | 135 | optim_manager.backward(loss) 136 | 137 | 138 | if (step + 1) % args.gradient_accumulation_steps == 0 or step == len(dataloader) - 1: 139 | optim_manager.clip_grad_norm(optimizer.param_groups, max_norm=args.clip_grad) 140 | 141 | optim_manager.step() 142 | optim_manager.zero_grad() 143 | 144 | 145 | global_step += 1 146 | progress_bar.update(1) 147 | 148 | # record time and loss 149 | iteration_time = time.time() - st 150 | 151 | avg_time_recorder.record(iteration_time) 152 | avg_loss_recorder.record(global_loss) 153 | 154 | # print time and loss 155 | if global_step % args.logging_step == 0: 156 | bmt.print_rank( 157 | "| Iter: {:6d} | loss: {:.4f} average_loss: {:.4f} | lr: {:.4e} | time: {:.4f}".format( 158 | global_step, 159 | global_loss, 160 | avg_loss_recorder.value, 161 | lr_scheduler.current_lr, 162 | avg_time_recorder.value 163 | ) 164 | ) 165 | if args.wandb and bmt.rank() == 0: 166 | wandb.log({ 167 | "loss": global_loss, 168 | "average_loss": avg_loss_recorder.value, 169 | "lr": lr_scheduler.current_lr, 170 | }, step=global_step) 171 | if args.tensorboard and bmt.rank() == 0: 172 | writer.add_scalar("Loss/train", global_loss, global_step) 173 | writer.add_scalar("average_Loss/train", avg_loss_recorder.value, global_step) 174 | writer.add_scalar("lr/train", lr_scheduler.current_lr, global_step) 175 | 176 | 177 | # save model 178 | if global_step % args.save_step == 0: 179 | os.makedirs(f"ultrachat_{args.model}/step_{global_step}", exist_ok=True) 180 | 181 | bmt.save(model, f"ultrachat_{args.model}/step_{global_step}/checkpoint.pt") 182 | 183 | if bmt.rank() == 0: 184 | torch.save(optimizer.state_dict(), f"ultrachat_{args.model}/step_{global_step}/optimizer.pt") 185 | torch.save(lr_scheduler.state_dict(), f"ultrachat_{args.model}/step_{global_step}/scheduler.pt") 186 | 187 | if global_step == args.train_iters: 188 | break 189 | 190 | bmt.save(model, f"ultrachat_{args.model}/final.pt") 191 | 192 | 193 | 194 | if __name__ == "__main__": 195 | parser = argparse.ArgumentParser("") 196 | parser.add_argument("--lr", type=float, default=1e-5) 197 | parser.add_argument("--model", type=str, default='llama') 198 | parser.add_argument("--model_name_or_path", default='/path/to/huggingface/llama') 199 | parser.add_argument("--epochs", default=3, type=int) 200 | parser.add_argument("--seed", default=0, type=int) 201 | 202 | parser.add_argument("--max_seq_length", default=2048, type=int) 203 | parser.add_argument("--batch_size_per_device", default=2, type=int) 204 | parser.add_argument("--logging_step", default=100, type=int) 205 | parser.add_argument("--save_step", default=50000, type=int) 206 | parser.add_argument("--data_file", default="../data/processed/data.json", type=str) 207 | parser.add_argument("--gradient_accumulation_steps", default=1, type=int) 208 | parser.add_argument("--wandb", action="store_true") 209 | parser.add_argument("--with_eval", action="store_true") 210 | 211 | parser.add_argument("--clip-grad", type=float, default=1.0, help="gradient clipping") 212 | # Learning rate. 213 | parser.add_argument("--weight-decay", type=float, default=0.0, help="weight decay rate") 214 | parser.add_argument("--loss-scale", type=float, default=65536, help="loss scale") 215 | 216 | parser.add_argument("--train-iters", type=int, default=2000000) 217 | 218 | 219 | parser.add_argument( 220 | "--warmup-ratio", 221 | type=float, 222 | default=0.03, 223 | ) 224 | parser.add_argument( 225 | "--lr-decay-style", 226 | type=str, 227 | default="cosine", 228 | choices=["constant", "linear", "cosine", "exponential", "noam"], 229 | help="learning rate decay function", 230 | ) 231 | parser.add_argument("--lr-decay-iters", type=int, default=None, help="lr decay steps") 232 | parser.add_argument( 233 | "--start-step", type=int, default=0, help="step to start or continue training" 234 | ) 235 | parser.add_argument("--tensorboard", type=str, default=None, help="lr decay steps") 236 | 237 | 238 | args = parser.parse_args() 239 | 240 | train(args) 241 | -------------------------------------------------------------------------------- /train/train_legacy/requirements.txt: -------------------------------------------------------------------------------- 1 | transformers>=4.19.0 2 | openprompt 3 | accelerate 4 | torchmetrics -------------------------------------------------------------------------------- /train/train_legacy/template/template.txt: -------------------------------------------------------------------------------- 1 | {"meta": "context", "post_processing": lambda x: "\n\n".join(x), "shortenable": True} {"text": "\nAssistant:"} {"mask"} -------------------------------------------------------------------------------- /train/train_legacy/train.py: -------------------------------------------------------------------------------- 1 | # # Conditional Generation with Prefix Tuning. 2 | # In this tutorial, we do conditional generation with prefix tuning template. 3 | 4 | # we use WebNLG as an example, as well. Note that the evaluation of generation result should be done 5 | # by using the scripts provided by https://github.com/Yale-LILY/dart/tree/master/evaluation, 6 | # Which we do not include in it. 7 | 8 | import argparse 9 | import torch 10 | from openprompt import plms 11 | from openprompt.plms import * 12 | from transformers import GPTJConfig, GPTJModel, GPTJForCausalLM, GPT2Tokenizer 13 | plms._MODEL_CLASSES["gptj"]= ModelClass(**{"config": GPTJConfig, "tokenizer": GPT2Tokenizer, "model": GPTJForCausalLM, 14 | "wrapper": LMTokenizerWrapper}) 15 | from openprompt.plms import load_plm 16 | from openprompt.prompts import MixedTemplate 17 | from ultrachat_dataset import UltraChatProcessor 18 | from transformers import AdamW 19 | from openprompt import PromptDataLoader 20 | from openprompt import PromptForGeneration 21 | from transformers.optimization import get_linear_schedule_with_warmup 22 | from accelerate import Accelerator 23 | from torchmetrics import MeanMetric 24 | from sklearn.model_selection import train_test_split 25 | from tqdm import tqdm 26 | from accelerate.utils import set_seed 27 | 28 | 29 | 30 | 31 | def format_metrics(metrics, split, prefix=""): 32 | log = f"[{split}]" + prefix 33 | log += " ".join([f"{key}: {value:.4f}" for key, value in metrics.items()]) 34 | 35 | return log 36 | 37 | def evaluate(args, model, val_dataloader, accelerator): 38 | model.eval() 39 | val_loss = MeanMetric().to(model.device) 40 | 41 | with torch.no_grad(): 42 | for i, batch in enumerate( 43 | tqdm(val_dataloader), 44 | ): 45 | 46 | loss = model(batch["input_ids"]) 47 | 48 | loss_values = accelerator.gather_for_metrics({"loss": loss.detach()}) 49 | 50 | val_loss.update(loss_values["loss"]) 51 | 52 | return val_loss 53 | 54 | 55 | def train(args, accelerator): 56 | set_seed(0) 57 | accelerator.print(f"Using {accelerator.num_processes} GPUs") 58 | 59 | plm, tokenizer, model_config, WrapperClass = load_plm(args.model, args.model_name_or_path) 60 | 61 | mytemplate = MixedTemplate(model=plm, tokenizer=tokenizer).from_file("./template/template.txt") 62 | 63 | with accelerator.main_process_first(): 64 | processor = UltraChatProcessor() 65 | dataset = processor.get_examples(args.data_file) 66 | 67 | train_dataset, val_dataset = train_test_split(dataset, test_size=0.2, random_state=0) 68 | 69 | # wrapped_example = mytemplate.wrap_one_example(dataset[1]) 70 | # print(wrapped_example) 71 | 72 | train_dataloader = PromptDataLoader(dataset=train_dataset, template=mytemplate, tokenizer=tokenizer, 73 | tokenizer_wrapper_class=WrapperClass, max_seq_length=1024, decoder_max_length=1024, 74 | batch_size=2,shuffle=True, teacher_forcing=True, predict_eos_token=True, # be sure to pass predict_eos_token=True if your template doesn't contain one, or you model may fail to stop generation. 75 | truncate_method="head").dataloader 76 | 77 | val_dataloader = PromptDataLoader(dataset=val_dataset, template=mytemplate, tokenizer=tokenizer, 78 | tokenizer_wrapper_class=WrapperClass, max_seq_length=1024, decoder_max_length=1024, 79 | batch_size=5,shuffle=False, teacher_forcing=True, predict_eos_token=True, # be sure to pass predict_eos_token=True if your template doesn't contain one, or you model may fail to stop generation. 80 | truncate_method="head").dataloader 81 | 82 | 83 | # load the pipeline model PromptForGeneration. 84 | prompt_model = PromptForGeneration(plm=plm, template=mytemplate, tokenizer=tokenizer) 85 | 86 | device = accelerator.device 87 | prompt_model.to(device) 88 | 89 | 90 | optimizer = AdamW([p for p in prompt_model.parameters()if p.requires_grad], lr=args.lr, eps=1e-8) 91 | 92 | scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=200, num_training_steps=len(train_dataloader)*args.epochs) 93 | 94 | prompt_model, optimizer, train_dataloader, val_dataloader, scheduler = accelerator.prepare(prompt_model, optimizer, train_dataloader, val_dataloader, scheduler) 95 | 96 | accelerator.register_for_checkpointing(scheduler) 97 | 98 | train_loss = MeanMetric().to(prompt_model.device) 99 | 100 | # training and generation. 101 | global_step = 0 102 | for epoch in range(args.epochs): 103 | for step, inputs in tqdm(enumerate(train_dataloader)): 104 | prompt_model.train() 105 | loss = prompt_model(inputs["input_ids"]) 106 | accelerator.backward(loss) 107 | optimizer.step() 108 | scheduler.step() 109 | optimizer.zero_grad() 110 | 111 | loss_values = accelerator.gather_for_metrics({"loss": loss.detach()}) 112 | train_loss.update(loss_values["loss"]) 113 | global_step +=1 114 | 115 | 116 | if global_step %50 ==0: 117 | accelerator.save_state(f"ultrachat_{args.model}/step_{global_step}") 118 | 119 | val_loss = evaluate(args, prompt_model, val_dataloader, accelerator) 120 | 121 | log_train = { 122 | "train_loss": train_loss.compute() 123 | } 124 | log_val = { 125 | "val_loss": val_loss.compute() 126 | } 127 | 128 | accelerator.print(f"Current LR: {scheduler.get_last_lr()[0]}") 129 | accelerator.print(format_metrics(log_train, "train", f" step {global_step} ")) 130 | accelerator.print(format_metrics(log_val, "val", f" step {global_step} ")) 131 | 132 | train_loss.reset() 133 | 134 | accelerator.wait_for_everyone() 135 | if accelerator.is_main_process: 136 | torch.save(accelerator.get_state_dict(prompt_model), f"ultrachat_{args.model}/final") 137 | 138 | if __name__ == "__main__": 139 | parser = argparse.ArgumentParser("") 140 | parser.add_argument("--lr", type=float, default=5e-5) 141 | parser.add_argument("--model", type=str, default='gptj') 142 | parser.add_argument("--model_name_or_path", default='EleutherAI/gpt-j-6b') 143 | parser.add_argument("--epochs", default=5, type=int) 144 | parser.add_argument("--data_file", default="./data/ultrachat_release_230407.json", type=str) 145 | args = parser.parse_args() 146 | # print(args) 147 | 148 | accelerator = Accelerator() 149 | 150 | train(args, accelerator) -------------------------------------------------------------------------------- /train/train_legacy/ultrachat_dataset.py: -------------------------------------------------------------------------------- 1 | from openprompt.data_utils.utils import InputExample 2 | import os 3 | import json, csv 4 | from abc import ABC, abstractmethod 5 | from collections import defaultdict, Counter 6 | from typing import List, Dict, Callable 7 | 8 | from openprompt.utils.logging import logger 9 | from openprompt.data_utils.data_processor import DataProcessor 10 | 11 | import torch 12 | from torch.utils.data import IterableDataset 13 | from tqdm import tqdm 14 | 15 | class UltraChatProcessor(DataProcessor): 16 | def __init__(self): 17 | super().__init__() 18 | self.labels = None 19 | 20 | def get_examples(self, data_path: str) -> List[InputExample]: 21 | examples = [] 22 | j = 0 23 | with open(data_path) as f: 24 | for line in tqdm(f.readlines()): 25 | if line.strip(): 26 | data = json.loads(line) 27 | id_ = data["id"] 28 | dialogue = data["data"] 29 | tags = [i for _ in range(len(dialogue)//2) for i in ["User", "Assistant"]] 30 | for i in range(0, len(dialogue), 2): 31 | tgt_text = dialogue[i+1] 32 | context = dialogue[:i+1] 33 | context = zip(tags[:i+1], context) 34 | context = [": ".join(item) for item in context] 35 | example = InputExample(guid=str(j), text_a="", tgt_text=tgt_text, meta={"context": context}) 36 | examples.append(example) 37 | j += 1 38 | return examples 39 | 40 | 41 | def get_src_tgt_len_ratio(self,): 42 | pass 43 | 44 | if __name__ == "__main__": 45 | processor = UltraChatProcessor() 46 | dataset = processor.get_examples("./data/ultrachat_release_230407.json") 47 | print(dataset[0]) 48 | print(dataset[5]) 49 | 50 | -------------------------------------------------------------------------------- /train/ultrachat_dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from typing import * 4 | 5 | 6 | import torch 7 | from torch.utils.data import IterableDataset, Dataset 8 | from tqdm import tqdm 9 | 10 | from transformers.tokenization_utils import PreTrainedTokenizer 11 | import copy 12 | 13 | 14 | def load_single_file(data_file): 15 | with open(data_file)as f: 16 | lines = f.readlines() 17 | return [json.loads(l) for l in lines] 18 | 19 | def load_raw_data(data_file): 20 | raw_dataset = [] 21 | if isinstance(data_file, str): 22 | raw_dataset += load_single_file(data_file) 23 | elif isinstance(data_file, list): 24 | for f_ in data_file: 25 | raw_dataset += load_single_file(f_) 26 | return raw_dataset 27 | 28 | IGNORE_INDEX=-100 29 | 30 | 31 | def collator(tokenizer, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]: 32 | input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels")) 33 | input_ids = torch.nn.utils.rnn.pad_sequence( 34 | input_ids, batch_first=True, padding_value=tokenizer.pad_token_id 35 | ) 36 | labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX) 37 | return dict( 38 | input_ids=input_ids, 39 | labels=labels, 40 | attention_mask=input_ids.ne(tokenizer.pad_token_id), 41 | ) 42 | 43 | 44 | class PromptIterableDataset(IterableDataset): 45 | def __init__(self, 46 | raw_dataset: Union[Dataset, List], 47 | sep: List = ["EOS", "\n"], 48 | tokenizer: PreTrainedTokenizer = None, 49 | max_seq_length: Optional[int] = 512, 50 | teacher_forcing: Optional[bool] = True, 51 | truncate_method: Optional[str] = "tail", 52 | ): 53 | assert hasattr(raw_dataset, "__iter__"), f"The dataset must have __iter__ method. dataset is {raw_dataset}" 54 | assert hasattr(raw_dataset, "__len__"), f"The dataset must have __len__ method. dataset is {raw_dataset}" 55 | self.raw_dataset = raw_dataset 56 | self.sep = sep 57 | self._end_token = None 58 | self.start_token = self.sep[-1] 59 | self.teacher_forcing = teacher_forcing 60 | assert self.teacher_forcing, print("must use teacher forcing") 61 | 62 | self.tokenizer = tokenizer 63 | self.truncate_method = truncate_method 64 | self.max_seq_length = max_seq_length 65 | assert self.truncate_method == "tail", print("only tail truncate support") 66 | 67 | 68 | 69 | @property 70 | def end_token(self): 71 | if self._end_token is not None: 72 | return self._end_token 73 | end_token = self.sep[0] 74 | if end_token == "EOS": 75 | self._end_token = self.tokenizer.eos_token 76 | else: 77 | self._end_token = end_token 78 | return self._end_token 79 | 80 | def tokenize_example(self, example): 81 | end_token = self.end_token 82 | tags = [i for _ in range(len(example["data"])//2) for i in ["User", "Assistant"]] 83 | labels = [] 84 | tokenized_ids = [] 85 | for i, c in enumerate(example["data"]): 86 | c_new = tags[i] + ": " + c + end_token 87 | if i % 2 == 1: 88 | # model 89 | c_input = self.start_token + tags[i] + ": " 90 | tokenized = self.tokenizer(c_input, add_special_tokens=False) 91 | tokenized_ids += tokenized["input_ids"] 92 | labels += [IGNORE_INDEX] * len(tokenized["input_ids"]) 93 | 94 | c_generate = c + end_token 95 | tokenized = self.tokenizer(c_generate, add_special_tokens=False) 96 | tokenized_ids += tokenized["input_ids"] 97 | labels += tokenized["input_ids"] 98 | 99 | else: 100 | # user 101 | if i == 0: 102 | # no start token 103 | c_new = self.tokenizer.bos_token + tags[i] + ": " + c + end_token 104 | else: 105 | c_new = self.start_token + tags[i] + ": " + c + end_token 106 | tokenized = self.tokenizer(c_new, add_special_tokens=False) 107 | tokenized_ids += tokenized["input_ids"] 108 | labels += [IGNORE_INDEX] * len(tokenized["input_ids"]) 109 | 110 | assert len(tokenized_ids) == len(labels) 111 | 112 | return {"input_ids": torch.LongTensor(tokenized_ids), "labels": torch.LongTensor(labels)} 113 | 114 | def truncate(self, tokenized_example): 115 | old_len = len(tokenized_example["input_ids"]) 116 | if old_len > self.max_seq_length: 117 | for k in tokenized_example: 118 | tokenized_example[k] = tokenized_example[k][:-(old_len - self.max_seq_length)] 119 | 120 | return tokenized_example 121 | 122 | 123 | def __iter__(self): 124 | for example in self.raw_dataset: 125 | tokenized_example = self.tokenize_example(example) 126 | tokenized_example = self.truncate(tokenized_example) 127 | yield tokenized_example 128 | 129 | def __len__(self): 130 | return len(self.raw_dataset) 131 | 132 | 133 | if __name__ == "__main__": 134 | from transformers import AutoTokenizer, LlamaTokenizer 135 | TEMPLATE = "{} Assistant:" 136 | tokenizer = LlamaTokenizer.from_pretrained("../../llama-7B-HF") 137 | raw_dataset = load_raw_data("../data/processed/part2_1.json") 138 | 139 | dataset = PromptIterableDataset(raw_dataset, tokenizer=tokenizer, max_seq_length=2048, teacher_forcing=True) 140 | for data in dataset: 141 | print(data) 142 | print(tokenizer.decode(data["input_ids"][:1000])) 143 | 144 | model_output = data["input_ids"][:1000][data["labels"][:1000]!=-100] 145 | print("##### model output") 146 | print(tokenizer.decode(model_output)) 147 | break 148 | --------------------------------------------------------------------------------