├── 2M └── README.md ├── 40M └── README.md ├── LICENSE ├── OIG ├── config │ └── ds_config_gptneo.json ├── docs │ ├── README.md │ ├── TRAINING_INSTRUCTIONS.txt │ └── training_text_example.txt ├── notebooks │ └── tokenize_datasets.ipynb └── src │ ├── abstract_infill.py │ ├── basic_augment.py │ ├── child_help.py │ ├── codeparrot_jupyter_summary.py │ ├── convfinqa.py │ ├── cuad.py │ ├── emergency_numbers.py │ ├── essays.py │ ├── flan.py │ ├── flanv2_cot.py │ ├── generate_songs.py │ ├── image_prompting_instructions.py │ ├── kojma_cot.py │ ├── labeled_safety.py │ ├── lyrics.py │ ├── mathqa.py │ ├── merged_code_xp3.py │ ├── minimal_trainer.py │ ├── multi_news.py │ ├── ni.py │ ├── nq.py │ ├── openai_summarize_tldr.py │ ├── p3.py │ ├── poetry.py │ ├── register_oscar.py │ ├── scitldr.py │ ├── soda_dialog.py │ ├── squad_v2.py │ ├── suicide_help.py │ ├── text2sql.py │ ├── ul2_oscar.py │ ├── unified_skg.py │ ├── unnatural_instructions.py │ ├── utils.py │ └── xp3.py ├── README.md └── small_instruction_set ├── Readme.md ├── chip2_instruct_alpha_v6a_1.json ├── chip2_instruct_alpha_v6a_2.json ├── chip2_instruct_alpha_v6a_3.json └── chip2_instruct_alpha_v6a_4.json /2M/README.md: -------------------------------------------------------------------------------- 1 | ## WIP 2M instruction dataset 2 | - (1.19M) Soda-dailog: https://drive.google.com/file/d/1-4y6X2gTa-IHJVex0XXtbv3HYTmWXuD4/view?usp=sharing 3 | - (238K) Unatural_instructions: https://drive.google.com/file/d/1XjR81uqY4UXVrb-46Z65h1Mz0YLr6Nr8/view?usp=sharing 4 | - (224K) Unifiedskg_instructions: https://drive.google.com/file/d/1tk4_ZkXD10dvqcdTUrO1NjrXdst46oAl/view?usp=sharing 5 | - (101K) Modified Version of Rallio's Rosey Data Plus Pro Social and other Safety Data: https://drive.google.com/file/d/1-1Z5wgdv-NLvL87cyFnwhhMzyuk120R7/view?usp=sharing 6 | -------------------------------------------------------------------------------- /40M/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # This is the Open Instruction Generalist Dataset 4 | 5 | - See https://huggingface.co/datasets/laion/OIG 6 | 7 | This is our attempt to create a large instruction dataset of medium quality along with a smaller high quality instruciton dataset (OIG-small-chip2). 8 | 9 | The data is in the form of jsonl objects, with at least a 'text' field. Some datasets may also include a 'metadata' field. The 'text' field contains a string of the form of one or more of: 10 | 11 | - \: instruction\n\: response 12 | - \: instruction\n\: response .. \: instruction\n\: response 13 | 14 | The purpose of the larger dataset is to perform continued pre-training, followed by a finetune on the smaller high quality dataset. 15 | 16 | The purpose of the smaller OIG-small-chip2 dataset is to make it easy to convert a language model pretrained on large amounts of text into an instruction following model using a small amount of additional compute via finetuning or softprompt tuning. 17 | 18 | Many additional datasets are being prepared by various community members and will be incorporated into this dataset as we are able to verify the quality and formatting of the data. Our goal is to make helpful and non-toxic instruction tuned models available to everyone. 19 | 20 | OIG is currently at 44M. We will continue to publish ever larger diverse instruction datasets with the goal of creating 1 trillion tokens of diverse instructions - enough to pretrain an LLM from scratch. 21 | 22 | WIP: Explanations of the following datasets will be provided. 23 | 24 | ## unified_abstract_infill.jsonl (~232000) 25 | ## unified_basic.jsonl (30) 26 | ## unified_conv_finqa.jsonl (~9000) 27 | ## unified_cuad.jsonl (~500) 28 | 29 | ## unified_essays.jsonl (~2000) 30 | - essays available on the public web 31 | ## unified_grade_school_math_instructions.jsonl (~9000) 32 | - https://github.com/openai/grade-school-math 33 | ## unified_hc3_human.jsonl (~58000) 34 | ## unified_image_prompts_instructions.jsonl (~15000) 35 | - A very small subset of LAION-400M 36 | ## unified_joke_explanations.jsonl (356) 37 | - Crawled from public internet. 38 | ## unified_mathqa_flanv2_kojma_cot.jsonl (~107000) 39 | - https://huggingface.co/datasets/math_qa, 40 | ## unified_merged_code_xp3.jsonl (~67000) 41 | - https://huggingface.co/datasets/bigscience/xP3 42 | ## unified_multi_news.jsonl (~90000) 43 | - https://www.tensorflow.org/datasets/catalog/multi_news 44 | ## unified_multi_sum.jsonl (~1700000) 45 | ## unified_nq.jsonl (~307000) 46 | ## unified_openai_summarize_tldr.jsonl (~233000) 47 | - https://github.com/openai/summarize-from-feedback 48 | ## unified_oscar_en_sample_dialog.jsonl (~2670000) 49 | - https://oscar-project.org/ 50 | - https://huggingface.co/datasets/TurkuNLP/register_oscar 51 | ## unified_plot_screenplay_books_dialog.jsonl (~8000) 52 | - https://github.com/markriedl/WikiPlots extracted from Wikipedia, snippets from the Pile’s https://huggingface.co/datasets/the_pile_books3, and snippets of screenplays available on the public web. 53 | ## unified_sqlv1.jsonl (~17000) 54 | - public text 2 sql datasets. 55 | ## unified_sqlv2.jsonl(~24000) 56 | - public text 2 sql datasets. 57 | ## unified_squad_v2.jsonl (~19000) 58 | - https://rajpurkar.github.io/SQuAD-explorer/ 59 | ## unified_squad_v2_more_neg.jsonl (~19000) 60 | - https://rajpurkar.github.io/SQuAD-explorer/ 61 | ## unified_ul2_plus_oscar_en_sample_dialog.jsonl (~2900000) 62 | - https://oscar-project.org/ 63 | - https://huggingface.co/datasets/TurkuNLP/register_oscar 64 | ## unified_unifiedskg_instructions.jsonl (~223000) 65 | - https://github.com/HKUNLP/UnifiedSKG 66 | ## unified_unnatural_instructions.jsonl (~238000) 67 | - https://github.com/orhonovich/unnatural-instructions 68 | ## unified_xp3_sample.jsonl (~188000) 69 | - https://huggingface.co/datasets/bigscience/xP3 70 | ## unified_canadian_parliament.jsonl(~301000) 71 | - https://openparliament.ca/data-download/ 72 | ## unified_poetry_2_song.jsonl (~12000) 73 | - https://huggingface.co/datasets/merve/poetry 74 | - https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus 75 | ## unified_flan.jsonl (~2700000) 76 | - https://github.com/google-research/FLAN/tree/main/flan/v2 77 | ## unified_ni.jsonl (~256000) 78 | - https://github.com/allenai/natural-instructions 79 | ## unified_p3.jsonl (~31000000) 80 | - https://huggingface.co/datasets/bigscience/P3 81 | ## unified_soda_dialog.jsonl (~1200000) 82 | - https://huggingface.co/datasets/allenai/soda 83 | ## unified_rallio_soda_upgraded_2048.jsonl (~210000) 84 | - https://huggingface.co/datasets/allenai/soda 85 | - a newer version of the unified_soda_dialog dataset, with multiple dialogs on one line 86 | - recommend to use either the unified_soda_dailog.jsonl or unified_rallio_soda_upgraded_2048, and not both. 87 | ## unified_rallio_safety_and_prosocial.jsonl (~319000) 88 | - Generated from public datasets and generated from Wiki similar to the chip2 data 89 | - Find a full list in the end of the document 90 | - This dataset also includes https://huggingface.co/datasets/allenai/prosocial-dialog and https://huggingface.co/datasets/Anthropic/hh-rlhf 91 | ## unified-chip2.jsonl / OIG-small-chip2 (~210000): 92 | This dataset was created as part of the LAION OA effort by @rallio67 and other members of the LAION contributors. It is a high quality dataset intended to be mixed into a large pre-train dataset and can be used for a final finetune. Chip2 contains: 93 | 94 | ### Python Code Examples (~6,000): 95 | A set of instruction / response pairs where the User requests the agent to generate a python function. These examples were generated using a large language model and few shot prompting with python code verified to execute. There are also ~3000 examples of manually curated one line python code examples from the Conala publication (see: https://conala-corpus.github.io/) 96 | 97 | ### Natural Instruction Examples (~124,000): 98 | A balanced set of diverse natural and factual questions and answers made using few shot prompted UL2 20B and an instruction tuned GPT-NeoX-20B model (Chip) and then rejection sampled using multiple automatic evaluations to remove low quality outputs and to filter out factually inaccurate answers. Also includes some filtered natural instructions from Anthropic Helpful instructions (see: https://github.com/anthropics/hh-rlhf). 99 | 100 | ### Generic Harmless Instruction Examples (~6,500): 101 | A set of instruction / response pairs sourced from the Anthropic redteam paper github (see: https://github.com/anthropics/hh-rlhf). This dataset includes a lot of data regarding real humans trying to make the Anthropic language models say harmful/toxic/trolling things. For this dataset only examples that were rated lowly on the harmful scale (0,1,2 out of 4, where 4 is the most toxic) were included. Again, only the first lines of dialogue (instruction, first_agent_response) were retained. 102 | 103 | ### Instruction/Responses with Lists (~14,000): 104 | A set of filtered and reformatted instruction / response pairs where the agent response contains a list. Sourced from the Anthropic github (see: https://github.com/anthropics/hh-rlhf). Sourced from wikihow text lists created by b-mc2 (https://huggingface.co/datasets/b-mc2/wikihow_lists). And rejection filtered instruction response pairs generated by Chip20B that contained lists. All lists are formatted in a similar style. 105 | 106 | ### Follow-up questions (~12,500): 107 | Examples of instructions and responses where an appropriate response is to ask for more information from the prompter. These examples were generated from a combination of few shot prompted UL2 20B (to generate natural questions) and a large dialogue prompted language model to generate the responses containing follow-up questions. 108 | 109 | ### Wikipedia Toxic Adversarial Questions (~12,000): 110 | Questions and answers generated from wikipedia articles that discuss potentially sensitive topics (flagged as potentially toxic by an early toxicity detection model). 111 | 112 | ### Grade School Math GSM8K (~9,000): 113 | GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning. (https://github.com/openai/grade-school-math) 114 | 115 | ### Reasoning Instructions (~4,500): 116 | Examples from the Com2Sense and Strategy QA datasets that were reformatted into natural instructions using large language models with few shot prompting and additional quality filtering steps. 117 | 118 | ### Character and Scene Descriptions (~30,000): 119 | Examples of instructions and responses for the generation of character or scene descriptions. Scenes were sourced from video game wikis and reformatted into instruction / response format using large language models or generated by few shot prompting with large language models. 120 | 121 | ## Support this project 122 | Your contributions and feedback support the open source ecosystem, improve the bot and provide datasets for future AI research. To participate you can: 123 | 124 | Submit Github issues, track issues and help create datasets that need improvement. https://github.com/LAION-AI/Open-Instruction-Generalist 125 | Join our Discord to talk with other team members working on this! https://discord.gg/xBPBXfcFHd 126 | 127 | ## Disclaimer 128 | These datasets contain synthetic data and in some cases data that includes humans trying to get the language model to say toxic/offensive/trolling things. If you are concerned about the presence of this type of material in the dataset please make sure you carefully inspect each of the entries and filter appropriately. Our goal is for the model to be as helpful and non-toxic as possible and we are actively evaluating ways to reduce or eliminate undesirable content from the instruction tuning datasets. 129 | 130 | ## License 131 | The OIG dataset that is authored by LAION volunteers is released under an Apache 2.0 license. However, the data also includes content licensed under other permissive licenses such as Wikipedia data which is licensed under CC-BY-SA, or web-crawled data which is used under fair use principles. 132 | 133 | ## Acknowledgement 134 | - We would like to thank all of our amazing LAION volunteers including: @Rallio, @Jue, @Ce Zhang, @Player-1, @Laurel, @danielpatrickhug, @Jjmachan, @Mylo, @Khalid, @Coco.han, @Jordiclive, @Pszemraj, and many others. 135 | - We would like to thank Together for their tireless dedication to the open source and AI community and their contribution to many of the datasets. 136 | - We would like to thank AI Horde and user @Db0 for their incredible contribution of filtered data that were flagged as unethical. 137 | - Lastly, Ontocord.ai’s founders are grateful to have the opportunity to create a portion of the data augmentation and safety-moderation code for this project. 138 | 139 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /OIG/config/ds_config_gptneo.json: -------------------------------------------------------------------------------- 1 | { 2 | "fp16": { 3 | "enabled": "auto", 4 | "loss_scale": 0, 5 | "loss_scale_window": 1000, 6 | "initial_scale_power": 16, 7 | "hysteresis": 2, 8 | "min_loss_scale": 1 9 | }, 10 | "optimizer": { 11 | "type": "AdamW", 12 | "params": { 13 | "lr": "auto", 14 | "betas": "auto", 15 | "eps": "auto", 16 | "weight_decay": "auto" 17 | } 18 | }, 19 | "scheduler": { 20 | "type": "WarmupLR", 21 | "params": { 22 | "warmup_min_lr": "auto", 23 | "warmup_max_lr": "auto", 24 | "warmup_num_steps": "auto" 25 | } 26 | }, 27 | "zero_optimization": { 28 | "stage": 2, 29 | "allgather_partitions": true, 30 | "allgather_bucket_size": 2e8, 31 | "overlap_comm": true, 32 | "reduce_scatter": true, 33 | "reduce_bucket_size": 2e8, 34 | "contiguous_gradients": true, 35 | "cpu_offload": true 36 | }, 37 | "gradient_accumulation_steps": "auto", 38 | "gradient_clipping": "auto", 39 | "steps_per_print": 2000, 40 | "train_batch_size": "auto", 41 | "train_micro_batch_size_per_gpu": "auto", 42 | "wall_clock_breakdown": false 43 | } 44 | -------------------------------------------------------------------------------- /OIG/docs/README.md: -------------------------------------------------------------------------------- 1 | ## Open Instruction Generalist (OIG) 2 | 3 | OIG is a dataset of many synthetic instructions and a system for training a conversation agent. A reference model will also be created. 4 | 5 | Creating Pretrained Instruction Data: https://docs.google.com/document/d/13a188pPvqnlvuVa3e_suVz4YO5s-JWeiOOrpp0odImg/edit?usp=sharing 6 | 7 | https://docs.google.com/presentation/d/1UyhEg9z2MolUMvEwjTNJn_Z5J9UIznqsmTNYKUNEGbY/edit#slide=id.p 8 | 9 | OIG Training logs: https://docs.google.com/document/d/18oJidGTAk6NpFiLBcp4slyKgSV8aNs9M8j5Gay1yRFs/edit?usp=drive_web&ouid=103208749093990169703 10 | -------------------------------------------------------------------------------- /OIG/docs/TRAINING_INSTRUCTIONS.txt: -------------------------------------------------------------------------------- 1 | # Read this to find out exactly how to train a gpt-neox style model. 2 | 3 | First, you need to make sure you have the appropriate libraries 4 | installed on your system. 5 | 6 | INSTALL LIBRARIES: 7 | 8 | pytorch 9 | cudatoolkit 10 | cudnn 11 | transformers 12 | datasets 13 | accelerate 14 | deepspeed 15 | 16 | PREPARE THE DATASET: 17 | 18 | Generate a text data with your training examples in exactly the format 19 | shown in "rallio_test.txt". 20 | 21 | The text format is as follows: 22 | ``` 23 | User: I have a question about life 24 | 25 | Agent: I have the answer to your question.<|endoftext|> 26 | ``` 27 | You can change this if you like, but the dataset preparation script 28 | expects this format and may not work if you change it. It is fine to 29 | rename "Agent" to whatever name you like. 30 | 31 | After you have saved your training data as a text file, proceed to the 32 | "tokenize_datasets.ipynb" to prepare your data. There are many ways to 33 | format data. I have chosen to use only examples up to 280 tokens long and 34 | to pad any examples shorter than 280 tokens and mask the padding tokens. If 35 | you choose to use a longer context than 280, you may (WILL) need to adjust 36 | your per device batch size when you train your model. 37 | 38 | Run all the cells in the jupyter notebook and you will generate a train and 39 | eval dataset for use in your model finetuning. You are now ready to train your 40 | model. 41 | 42 | TRAIN THE MODEL: 43 | 44 | Now you need to modify line 214 and 215 to point to the directory of your saved 45 | dataset. 46 | 47 | ``` 48 | train_dataset = load_from_disk("path_to/my_training_data") 49 | eval_dataset = load_from_disk("path_to/my_eval_data") 50 | ``` 51 | 52 | Now you are ready to launch the trainer. Here is an example launch configuration. 53 | Note the deepspeed config "ds_config_gptneo.json" 54 | This will work for the pythia series and gpt-neox style models. You may change the 55 | configuration if you have some knowledge of deepspeed. 56 | 57 | ``` 58 | Eample launch configuration to execute from the commandline. 59 | 60 | deepspeed --num_gpus=1 minimal_trainer.py \ 61 | --deepspeed ds_config_gptneo.json \ 62 | --model_name_or_path pythia-1.4b-deduped \ 63 | --do_train \ 64 | --do_eval \ 65 | --block_size 280 \ 66 | --fp16 \ 67 | --overwrite_cache \ 68 | --evaluation_strategy="steps" \ 69 | --output_dir custom_1.4B_512bs \ 70 | --num_train_epochs 1 \ 71 | --eval_steps 200 \ 72 | --gradient_accumulation_steps 1 \ 73 | --per_device_train_batch_size 64 \ 74 | --use_fast_tokenizer True \ 75 | --learning_rate 5e-06 \ 76 | --warmup_steps 5 77 | ``` 78 | 79 | After you launch this script the model will begin training and there will be a lot of 80 | scrolling on the screen. This minimal trainer does not have any sophisticated logging 81 | features, so if you care about that you will need to add that functionality yourself. 82 | 83 | GOOD LUCK 84 | -------------------------------------------------------------------------------- /OIG/docs/training_text_example.txt: -------------------------------------------------------------------------------- 1 | User: I want to learn a new language, what are the benefits of learning Spanish? 2 | 3 | Joi: The benefits of learning Spanish include gaining a better understanding of the culture and history of Spanish-speaking countries, improving job opportunities by having a second language, expanding your global connections, improving your mental flexibility and memory, and gaining a better ability to communicate with more people. Additionally, learning Spanish can make travel to Spanish-speaking countries easier, as well as offer access to the works of great Spanish-speaking authors and poets.<|endoftext|> 4 | User: What kind of food can I take for lunch to work but still keep a healthy diet? 5 | 6 | Joi: A good option for lunch that you can take to work and still maintain a healthy diet is a salad made with mixed greens, nuts and seeds, beans or legumes, and fresh vegetables. You can also add a lean protein like chicken, eggs, or fish, and a variety of fruits or other healthy toppings. Another great option is to make a wrap using a whole wheat wrap or wrap alternative, and filling it with lean meats, veggies, and hummus or other dips. You can also make your own soup, or grab one from a store, adding a side of fresh or steamed veggies and whole grain crackers.<|endoftext|> 7 | User: Should I go to the movies or to the theatre tonight? 8 | 9 | Joi: It depends on what you're in the mood for. If you want to watch a movie, then the cinema is probably your best bet. But if you want to experience live theatre, then the theatre might be more up your alley.<|endoftext|> 10 | User: I have a child who has special needs. How do I make sure that he has a great vacation? 11 | 12 | Joi: Make sure you plan ahead and research the best places to take your child. Look for hotels, attractions, restaurants, etc. that are accessible and comfortable for him. Talk to other parents of children with similar needs to find out what they recommend. Ask your child’s doctor or therapist for advice on how to make the experience enjoyable. Consider hiring a service such as a wheelchair-accessible van to help transport your child around. Lastly, don't forget to pack some extra items like snacks, books, toys, and games in case your child gets bored during the trip.<|endoftext|> 13 | -------------------------------------------------------------------------------- /OIG/notebooks/tokenize_datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "849ca276-0202-4f51-afa7-eeebd21939cc", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "\"\"\"\n", 11 | "This notebook takes a flat text file of training\n", 12 | "examples that are delimited by EOS tokens (<|endoftext|>)\n", 13 | "and converts it into a training and evaluation dataset\n", 14 | "to be used with the minimal_trainer.py training script.\n", 15 | "\n", 16 | "It will filter, pad, and attention mask based on the\n", 17 | "maximum length. This should match the block_size in the\n", 18 | "minimal_trainer.py script.\n", 19 | "\"\"\"\n", 20 | "\n", 21 | "from transformers import AutoTokenizer\n", 22 | "from datasets import load_from_disk, Dataset\n", 23 | "import pandas as pd\n", 24 | "import datasets\n", 25 | "import torch\n", 26 | "import random" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "id": "11be9ba3-c9cd-4916-ac77-684f6187ca99", 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "# Load tokenizer\n", 37 | "\n", 38 | "tokenizer = AutoTokenizer.from_pretrained(\"EleutherAI/gpt-neox-20b\")" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "id": "5764b95b-e29e-4e4d-85cb-622fc2cb2d80", 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "335035\n" 52 | ] 53 | } 54 | ], 55 | "source": [ 56 | "# Load text file line by line\n", 57 | "# Rallio_test.txt is just some random text examples.\n", 58 | "# Caution: has not been carefully reviewed, could contain toxic materials.\n", 59 | "\n", 60 | "with open('rallio_test.txt') as my_file:\n", 61 | " data = my_file.read()\n", 62 | "print(len(data))" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 5, 68 | "id": "bc6ce721-1e4f-4495-a74f-acb35c630a19", 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "You have this many training examples: 697\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "# Split entries by EOS token and remove any unneeded newlines\n", 81 | "\n", 82 | "entries=data.split(\"<|endoftext|>\")\n", 83 | "count=0\n", 84 | "fixed=[]\n", 85 | "for i in entries:\n", 86 | " new_line=\"\"\n", 87 | " if i[-1]==\"\\n\" and i[0] ==\"\\n\":\n", 88 | " new_line=i[1:-1]\n", 89 | " count+=1\n", 90 | " elif i[0]==\"\\n\":\n", 91 | " new_line=i[1:]\n", 92 | " elif i[-1] == \"\\n\":\n", 93 | " new_line=i[:-1]\n", 94 | " if len(new_line) > 5:\n", 95 | " fixed.append(new_line)\n", 96 | " else:\n", 97 | " fixed.append(i)\n", 98 | "print(\"You have this many training examples: \"+str(len(fixed)))" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "id": "05a1c9df-d86b-4c62-93aa-c19fc90ead31", 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "#Add back EOS tokens. I have chosen to put two endoftext tokens.\n", 109 | "#Probably only one is needed.\n", 110 | "\n", 111 | "fixed_tokens=[]\n", 112 | "for i in fixed:\n", 113 | " line=i+\"<|endoftext|><|endoftext|>\"\n", 114 | " tokens=tokenizer.encode(line)\n", 115 | " fixed_tokens.append((line,tokens))" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "id": "e45da388-0dea-410f-95e9-f807cf661692", 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# Set the maximum token length per item.\n", 126 | "# Pad and mask any entries shorter than max_length.\n", 127 | "\n", 128 | "max_length=280\n", 129 | "\n", 130 | "attention_mask=[]\n", 131 | "input_ids=[]\n", 132 | "labels=[]\n", 133 | "\n", 134 | "for i in fixed_tokens:\n", 135 | " length=len(i[1])\n", 136 | " attention=[]\n", 137 | " if length < max_length:\n", 138 | " for k in range(0,(max_length-length)):\n", 139 | " entry=i[1]\n", 140 | " entry.append(1)\n", 141 | " for k in range(0,(length)):\n", 142 | " attention.append(1)\n", 143 | " for k in range(0,(max_length-length)):\n", 144 | " attention.append(0)\n", 145 | " attention_mask.append(attention)\n", 146 | " input_ids.append(entry)\n", 147 | " labels.append(entry)\n", 148 | "\n", 149 | "# Print out and inspect the first entry.\n", 150 | "print(fixed_tokens[0])\n", 151 | "print(attention_mask[0])\n", 152 | "print(input_ids[0])\n", 153 | "print(labels[0])" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 8, 159 | "id": "aae3a1db-7f24-4f9a-9c57-7a8ed8653d9b", 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "# Create pandas dataframe\n", 164 | "\n", 165 | "df = pd.DataFrame({\"attention_mask\": attention_mask, \"input_ids\":input_ids,\"labels\":labels})" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 12, 171 | "id": "b4662b44-6d9e-46c3-b47e-3375cb682397", 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "Training examples: 690\n", 179 | "Evaluation examples: 7\n" 180 | ] 181 | } 182 | ], 183 | "source": [ 184 | "# Create dataset\n", 185 | "\n", 186 | "new_dataset=datasets.Dataset.from_pandas(df)\n", 187 | "split_dataset = new_dataset.train_test_split(test_size=0.01)\n", 188 | "train_dataset=split_dataset['train']\n", 189 | "eval_dataset=split_dataset['test']\n", 190 | "\n", 191 | "print(\"Training examples: \"+str(len(train_dataset)))\n", 192 | "print(\"Evaluation examples: \"+str(len(eval_dataset)))" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 13, 198 | "id": "0acd7efe-7ff1-4de7-944a-e1ff9251ec50", 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "data": { 203 | "application/vnd.jupyter.widget-view+json": { 204 | "model_id": "5c397c8cf5eb46dbadd71a8cb035f227", 205 | "version_major": 2, 206 | "version_minor": 0 207 | }, 208 | "text/plain": [ 209 | "Flattening the indices: 0%| | 0/1 [00:00 5): 71 | j+= 1 72 | if j > 4: 73 | output.write (json.dumps({'page_title': b, 'labeled_dialog': a})+"\n") 74 | break 75 | #print (ent.label_, ent.text) 76 | 77 | #poorman's reverb. TODO: we need to use semantic matching of relationship to paragraph to filter out bad relationships. 78 | def get_verb_relation(text): 79 | doc = basic_nlp(text) 80 | verb_relationship = "" 81 | orig_verb = "" 82 | for token in doc: 83 | #print (token, token.tag_) 84 | if token.tag_.startswith("VB") and token.tag_ not in {"VBZ", } and token.lemma_ not in {'do', 'be', 'have', 'list'}: 85 | orig_verb = token.text 86 | verb_relationship = str(token.lemma_) 87 | continue 88 | if verb_relationship: 89 | if token.tag_ == "IN": 90 | orig_verb += " "+token.text 91 | verb_relationship += "_"+str(token.lemma_) 92 | break 93 | else: 94 | break 95 | if verb_relationship == "bear": 96 | verb_relationship = "born" 97 | return verb_relationship, orig_verb 98 | 99 | #need to filter out rel that don't match embedding of full text. these are spurious 100 | def ner_rel_template_extract(text, min_ner_len=5, length_for_rel=50, min_ner_per_domain=3): 101 | ret = {} 102 | orig_text = text 103 | text2 = text.replace("{", "-lbracket-").replace("}", "-rbracket-") 104 | ner_cnt = {} 105 | for nlp in [blackstone, sci, basic_nlp]: 106 | doc =nlp(text) 107 | ents = [(ent.text.strip(), ent.label_) for ent in list(doc.ents) if len(ent.text.strip()) >= min_ner_len] 108 | if nlp != basic_nlp and len(ents) < min_ner_per_domain: continue 109 | ents.sort(key=lambda a: len(a[0]), reverse=True) 110 | for st, label in ents: 111 | #we are not doing NER for code 112 | if "->" in st or "{" in st or "}" in st: continue 113 | if st in text: 114 | ner_cnt[label] = ner_cnt.get(label, -1) 115 | ner_cnt[label] += 1 116 | if ner_cnt[label] > 0: 117 | text2 = text2.replace(st,'{'+label+'_'+str(ner_cnt[label])+'}') 118 | ret[st] = label+'_'+str(ner_cnt[label]) 119 | else: 120 | text2 = text2.replace(st,'{'+label+'}') 121 | ret[st] = label 122 | text = text.replace(st,' ') 123 | rels =[] 124 | if nlp == basic_nlp: 125 | 126 | args = dict([(b, "{"+a+"}") for a, b in ret.items() ]) 127 | if args: 128 | text3 = text2.format(**args) 129 | text4 = text3.replace("{", " ").replace("}", " ") 130 | for entity in ret.keys(): 131 | if "{"+entity+"}" not in text3: 132 | continue 133 | #print ('problem', "{"+entity+"}", '***', text3) 134 | text5= text4[text3.index("{"+entity+"}"):] 135 | if len(text5) > length_for_rel: 136 | text5 = text5[:length_for_rel] 137 | rel, orig_verb = get_verb_relation(text5) 138 | if "{"+entity+"}" in text3 and rel: 139 | text6 = text3[text3.index("{"+entity+"}"):].split(orig_verb) 140 | if len(text6) < 2: continue 141 | text6 = text6[1] 142 | if "{" in text6: 143 | text6 = text6.split("{",1)[1] 144 | if "}" in text6: 145 | entity2 = text6.split("}")[0] 146 | rels.append ((entity.replace(" ", "_") ,rel, entity2.replace(" ", "_") )) 147 | 148 | return ret, text2.replace("-lbracket-", "{").replace("-rbracket-", "}"), rels 149 | 150 | def output_data(entity, instructions, context, output, min_ner_len=5, length_for_rel=50): 151 | context = context[0] 152 | context_arr = context.split(".") 153 | style = "" 154 | if len(context_arr) >= 24: 155 | style = " in six paragraphs" 156 | mult = int(len(context_arr)/6) 157 | context_arr[mult] = "\n"+context_arr[mult].strip() 158 | context_arr[2*mult] = "\n"+context_arr[2*mult].strip() 159 | context_arr[3*mult] = "\n"+context_arr[3*mult].strip() 160 | context_arr[4*mult] = "\n"+context_arr[3*mult].strip() 161 | context_arr[5*mult] = "\n"+context_arr[3*mult].strip() 162 | context = ".".join(context_arr) 163 | if len(context_arr) >= 20: 164 | style = " in five paragraphs" 165 | mult = int(len(context_arr)/5) 166 | context_arr[mult] = "\n"+context_arr[mult].strip() 167 | context_arr[2*mult] = "\n"+context_arr[2*mult].strip() 168 | context_arr[3*mult] = "\n"+context_arr[3*mult].strip() 169 | context_arr[4*mult] = "\n"+context_arr[3*mult].strip() 170 | context = ".".join(context_arr) 171 | if len(context_arr) >= 16: 172 | style = " in four paragraphs" 173 | mult = int(len(context_arr)/4) 174 | context_arr[mult] = "\n"+context_arr[mult].strip() 175 | context_arr[2*mult] = "\n"+context_arr[2*mult].strip() 176 | context_arr[3*mult] = "\n"+context_arr[3*mult].strip() 177 | context = ".".join(context_arr) 178 | elif len(context_arr) >= 12: 179 | style = " in three paragraphs" 180 | context_arr[4] = "\n"+context_arr[4].strip() 181 | context_arr[8] = "\n"+context_arr[8].strip() 182 | context = ".".join(context_arr) 183 | elif len(context_arr) >= 8: 184 | style = " in two paragraphs" 185 | context_arr[4] = "\n"+context_arr[4].strip() 186 | context = ".".join(context_arr) 187 | elif len(context_arr) >= 4: 188 | style = " in one paragraph" 189 | if random.randint(0,3) > 0: return 190 | elif len(context_arr) == 3: 191 | style = " in three sentences" 192 | if random.randint(0,5) > 0: return 193 | else: 194 | return 195 | ner_rel = ner_rel_template_extract(context, min_ner_len=min_ner_len, length_for_rel=length_for_rel) 196 | first_sent = basic_nlp(context_arr[0]) 197 | first_sent = [a.text for a in first_sent.noun_chunks if a.text not in entity and a.text.lower() != a.text and len(a.text) > 4] 198 | if len(first_sent) > 3: 199 | first_sent = first_sent[:3] 200 | #print ("###") 201 | if ner_rel and first_sent: 202 | ner = [a for a in ner_rel[0] if a not in entity and a not in first_sent] 203 | if len(ner) > 2: 204 | ner = ner[:2] 205 | context_instruction = (f": Write me an article about "+ ", ".join(first_sent) + ", discussing in detail " + ", ".join(ner)+ style + ".") 206 | elif first_sent: 207 | context_instruction = (f": Write me an article about "+ ", ".join(first_sent) + style + ".") 208 | elif ner_rel: 209 | ner = [a for a in ner_rel[0] if a not in entity] 210 | if len(ner) > 2: 211 | ner = ner[:2] 212 | context_instruction = (f": Write me an article about "+ ", ".join(ner)+ style + ".") 213 | else: 214 | ner = [a for a in ner_rel[0] if a not in entity] 215 | if len(ner) > 2: 216 | ner = ner[:2] 217 | context_instruction = (f": Write me an article about {entity}"+ style + ".") 218 | 219 | 220 | last_sent = basic_nlp(context_arr[-2]) 221 | if (context_instruction or first_sent) and last_sent != context_arr[0]: 222 | last_sent = [a.text for a in last_sent.noun_chunks if a.text not in entity and a.text.lower() != a.text and len(a.text) > 4] 223 | if len(last_sent) > 2: 224 | last_sent = last_sent[:2] 225 | if last_sent and random.randint(0,1) == 0: 226 | context_instruction += (f" End with a sentence about "+ ", ".join(last_sent)+".") 227 | 228 | instructions = instructions.strip() 229 | format_type = random.randint(0,3) 230 | if format_type == 0: 231 | out = (context_instruction + "\n" + ": " + context+ "\n"+ instructions) 232 | out = out.replace("Write me an article about", random.choice(["Write me an article about", "Provide an article about", "Give me an article about"])) 233 | elif format_type == 1: 234 | first_instruction = instructions.split("\n")[0].split(": ",1)[1].strip() 235 | if first_instruction[1:].lower() == first_instruction[1:]: 236 | ner_rel_text = "; ".join(str(a) for a in ner_rel[-1]) if ner_rel[-1] else ('' if not ner_rel[0] else "; ".join(str(a) for a in ner_rel[0].items()) ) 237 | if not ner_rel_text: return 238 | instructions = ": " + first_instruction + "\n\n" + ": I'm sorry I can't answer that question based on the information I have.\n\n" + \ 239 | ": Answer the question assuming the following : " + ner_rel_text+ ".\n\n" + instructions.split("\n\n",1)[-1] 240 | out = (instructions+"\n"+context_instruction + "\n" + ": " + context) 241 | out = out.replace("Write me an article about", random.choice(["Based on the above, write me an article about", "Using the above, provide an article about", "Summarizing the above, give me an article about"])) 242 | else: 243 | if entity.replace("_", " ") not in instructions.split("\n")[0] and entity.replace("_", " ").lower() not in instructions.split("\n")[0]: 244 | instructions = ": " + random.choice(["Tell me about", "Provide one sentence about", "Briefly describe"]) + " " + entity.replace("_", " ") +".\n\n"+ \ 245 | ": "+ context_arr[0] + ".\n\n" + instructions 246 | out = ("Background: " + context+ "\n"+ instructions) 247 | out = out.replace("\n\n", "\n").replace("()", "").replace(" ", " ") 248 | #print ("###") 249 | #print (out) 250 | output.write (json.dumps({'text': out, 'metadata': {'source': 'infil_dbpedia'}})+"\n") 251 | 252 | 253 | aHash = sci = data = basic_nlp = blackstone = aHash = None 254 | def create_abstract_infil(output): 255 | #TODO clear context, output jsonl, list, table format. algorithmic ops 256 | global aHash, sci, data, basic_nlp, blackstone 257 | if not os.path.exists("/content/data.parquet"): 258 | !wget https://huggingface.co/datasets/ericyu3/openassistant_inpainted_dialogs/resolve/main/data.parquet 259 | 260 | if not os.path.exists("/content/long-abstracts_lang=en.ttl"): 261 | !wget https://databus.dbpedia.org/dbpedia/text/long-abstracts/2022.09.01/long-abstracts_lang=en.ttl.bz2 262 | !bunzip2 long-abstracts_lang=en.ttl.bz2 263 | 264 | try: 265 | if sci is None: assert False 266 | except: 267 | sci = spacy.load("en_ner_craft_md") 268 | data = pd.read_parquet('data.parquet', engine='pyarrow') 269 | basic_nlp = spacy.load('en_core_web_sm') 270 | blackstone = spacy.load("en_blackstone_proto") 271 | # add the other scispacy ner 272 | 273 | if aHash is None: 274 | 275 | aHash = {} 276 | with open("/content/long-abstracts_lang=en.ttl") as input: 277 | for l in input: 278 | l = l.strip() 279 | l = l.split(" ",2) 280 | entity = l[0].split("/")[-1].split(">")[0].lower().replace("&", "&").strip("_").replace("-", "_") 281 | #topic = l[1].split("/")[-1].split(">")[0] 282 | sent = l[-1].split("\"@")[0].strip('"00') 283 | aHash[unidecode(entity)] = aHash.get(unidecode(entity), []) + [sent] 284 | if entity.count("_") > 1: 285 | entity2 = unidecode("_".join(entity.split("_")[:2]).strip("_")) 286 | if entity2 not in aHash: 287 | aHash[entity2] = aHash.get(entity2, []) + [sent] 288 | if entity.count("_") > 2: 289 | entity2 = unidecode("_".join(entity.split("_")[:3]).strip("_")) 290 | if entity2 not in aHash: 291 | aHash[entity2] = aHash.get(entity2, []) + [sent] 292 | if "(" in entity: 293 | entity, cat = entity.split("(", 1) 294 | cat = cat.split("_") 295 | entity = unidecode(entity + "("+cat[0]+")") 296 | aHash[entity] = aHash.get(entity, []) + [sent] 297 | 298 | 299 | 300 | 301 | for a, b in zip(data['labeled_dialog'],data['page_title']): 302 | b = b.replace("(, ","(").replace("()","").replace(" ", " ") 303 | a = a.replace("Are there any other interesting aspects about this article?", random.choice(["more please", "next", "continue", "and?", "tell me more", "anything else?"])) 304 | a = a.replace("What else did you find important?", random.choice(["more please", "next", "continue", "and?", "tell me more", "anything else?"])) 305 | b = b.replace(" ", "_").replace("&", "&").strip("_") 306 | if unidecode(b.lower().replace("-", "_")) not in aHash: 307 | if "(" in b: 308 | b2, cat = b.split("(", 1) 309 | cat = cat.split("_") 310 | b2 = b2 + "("+cat[0]+")" 311 | if unidecode(b2.lower().replace("-", "_")) in aHash: 312 | context = aHash[ unidecode(b2.lower().replace("-", "_"))] 313 | output_data(b, a, context, output) 314 | continue 315 | if b2.count("_") > 1: 316 | b2 = "_".join(b2.split("_")[:2]).strip("_") 317 | if unidecode(b2.lower().replace("-", "_")) in aHash: 318 | context = aHash[ unidecode(b2.lower().replace("-", "_"))] 319 | output_data(b, a, context, output) 320 | continue 321 | if b2.count("_") > 2: 322 | b2 = "_".join(b2.split("_")[:3]).strip("_") 323 | if unidecode(b2.lower().replace("-", "_")) in aHash: 324 | context = aHash[ unidecode(b2.lower().replace("-", "_"))] 325 | output_data(b, a, context, output) 326 | continue 327 | 328 | else: 329 | context = aHash[unidecode(b.lower().replace("-", "_"))] 330 | output_data(b, a, context, output) 331 | 332 | def abstract_infil2(output): 333 | i = 0 334 | with open("/content/drive/Shareddrives/LAION/OIG/abstact_infill.jsonl") as infile: 335 | for l in infile: 336 | data = json.loads(l.strip()) 337 | data['metadata'] = data['meta'] 338 | del data['meta'] 339 | text = data['text'] 340 | text_arr = text.split(":") 341 | b = ([a for a in text_arr if "('" in a]) 342 | if b: 343 | print (b) 344 | i+=1 345 | #if i > 20: break 346 | print (i) 347 | #abstract_infil2(None) 348 | -------------------------------------------------------------------------------- /OIG/src/codeparrot_jupyter_summary.py: -------------------------------------------------------------------------------- 1 | #@title codeparrot jupyter summary 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | 21 | import re, json 22 | 23 | pattern = re.compile(r'(? 10: break 67 | -------------------------------------------------------------------------------- /OIG/src/convfinqa.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | from nltk.tokenize.treebank import TreebankWordDetokenizer 8 | detokenizer = TreebankWordDetokenizer() 9 | 10 | 11 | def create_convfinqa(output): 12 | 13 | ####### 14 | # download and unzip https://github.com/czyssrs/ConvFinQA/blob/main/data.zip 15 | ####### 16 | 17 | with open('data/train.json') as f: 18 | data = json.load(f) 19 | 20 | for item in tqdm.tqdm(data): 21 | 22 | ### qa step by step 23 | 24 | table = '| ' + ' | '.join(item['table_ori'][0]) + ' |\n' 25 | if len(item['table_ori'][0]) == len(item['table_ori'][-1]): 26 | table += '| ' + ' | '.join(['---------' for x in item['table_ori'][-1]]) + ' |\n' 27 | for x in item['table_ori'][1:]: 28 | table += '| ' + ' | '.join(x) + ' |\n' 29 | 30 | pre_texts = [] 31 | for pre_text in item['pre_text']: 32 | pre_text = detokenizer.detokenize(pre_text.split(' ')) 33 | pre_texts.append(pre_text) 34 | pre_text = ' '.join(pre_texts) 35 | 36 | post_texts = [] 37 | for post_text in item['post_text']: 38 | post_text = detokenizer.detokenize(post_text.split(' ')) 39 | post_texts.append(post_text) 40 | post_text = ' '.join(post_texts) 41 | 42 | q = item['annotation']['dialogue_break_ori'][0] 43 | a = item['annotation']['turn_program_ori'][0] 44 | 45 | prompt = f': {pre_text}\n{table}{post_text}\nPlease answer the following questions with expressions if necessary.\n{q}\n: {a}' 46 | 47 | for i in range(1, len(item['annotation']['exe_ans_list'])): 48 | q = item['annotation']['dialogue_break_ori'][i] 49 | a = item['annotation']['turn_program_ori'][i] 50 | prompt += f'\n: {q}\n: {a}' 51 | 52 | output.write(json.dumps({'text': prompt}) + '\n') 53 | 54 | ### answer the final result 55 | 56 | q = item['annotation']['dialogue_break_ori'][0] 57 | a = item['annotation']['exe_ans_list'][0] 58 | 59 | prompt = f': {pre_text}\n{table}{post_text}\nPlease compute the result.\n{q}\n: {a}' 60 | 61 | for i in range(1, len(item['annotation']['exe_ans_list'])): 62 | q = item['annotation']['dialogue_break_ori'][i] 63 | a = item['annotation']['exe_ans_list'][i] 64 | prompt += f'\n: {q}\n: {a}' 65 | 66 | output.write(json.dumps({'text': prompt}) + '\n') 67 | 68 | ### answer the expression in one-step. 69 | 70 | i = len(item['annotation']['turn_program_ori']) - 1 71 | q = item['annotation']['dialogue_break_ori'][i] 72 | a = item['annotation']['turn_program_ori'][i] 73 | 74 | prompt = f': {pre_text}\n{table}{post_text}\nPlease answer the following question with expressions if necessary.\n{q}\n: {a}' 75 | 76 | output.write(json.dumps({'text': prompt}) + '\n') 77 | -------------------------------------------------------------------------------- /OIG/src/cuad.py: -------------------------------------------------------------------------------- 1 | #@title CUAD 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | 21 | import json, os, random 22 | from ul2_oscar import * 23 | def create_cuad(output): 24 | if not os.path.exists("CUADv1.json"): 25 | os.system("wget https://github.com/TheAtticusProject/cuad/raw/main/data.zip") 26 | os.system("unzip data.zip") 27 | cuad = json.load(open("CUADv1.json")) 28 | #cuad['data'][0]['title'], 29 | for cuad0 in cuad['data']: 30 | dialog_all = "" 31 | for para in cuad0['paragraphs']: 32 | context = para['context'].replace(" ", " ").replace(" ", " ").replace(" ", " ").replace(" ", " ").replace(" ", " ").replace("\n\n", "\n").replace("\n\n", "\n").replace("\n\n", "\n") 33 | context = "\n".join(a for a in context.split("\n") if not a.startswith("Exhibit") and not a.startswith("EXHIBIT") and "Page -" not in a).replace("[***]", "___") 34 | context_arr = context.split("\n") 35 | sec0 = random.randint(1,4) 36 | sec = str(sec0)+". " 37 | next_sec = str(sec0+1)+". " 38 | next_next_sec = str(sec0+2)+". " 39 | sec0_only = [idx for idx, item in enumerate(context_arr) if item.startswith(sec)] 40 | sec_idx = [idx for idx, item in enumerate(context_arr) if item.startswith(sec) or item.startswith(next_sec) or item.startswith(next_next_sec)] 41 | if random.randint(0,1): 42 | dialog = "User: "+random.choice(["Complete the next paragraph of this contract:", "Give me more for this agreement:", "Provide a continuation for this:", "What comes next for this:"])+" "+"\n".join(context_arr[0:min(len(context_arr),5)]) +"\nAssistant: "+"\n".join(context_arr[5:min(len(context_arr),10)]) 43 | dialog_all += "\n"+(dialog) 44 | elif sec0_only: 45 | rng = sec0_only[0] 46 | line = context_arr[rng] 47 | if len(line) > 30: 48 | line = line[:30]+" ... " 49 | dialog = "User: "+random.choice([f"Complete 'Section {line}' for this contract:", f"Give me 'Section {line}' for this agreement:", f"Provide a continuation starting with 'section {line}', given this agreement:", f"What is 'sec {line}' for this:"])+" "+"\n".join(context_arr[0:min(len(context_arr),10)]) +f"\nAssistant: {sec}"+"\n".join(context_arr[rng+5:min(len(context_arr),rng+10)]) 50 | dialog_all += "\n"+(dialog) 51 | if len(sec_idx) == 3: 52 | context0 = "\n".join(context_arr[:min(len(context_arr), sec_idx[1])]) 53 | dialog_all += "\n"+(create_ul2_plus_instructions(context0)) 54 | section_before = ("\n".join(context_arr[sec_idx[0]:min(len(context_arr), sec_idx[1])])) 55 | section_after = ("\n".join(context_arr[sec_idx[1]:min(len(context_arr), sec_idx[2])])) 56 | if random.randint(0,1): 57 | dialog = f"User: What would be a contract section that comes after this one:\n{section_before}\nAssistant: {section_after}" 58 | dialog = basic_augment(dialog) 59 | dialog_all += "\n"+(dialog) 60 | else: 61 | dialog = f"User: What would be a contract section that comes before this one:\n{section_after}\nAssistant: {section_before}" 62 | dialog = basic_augment(dialog) 63 | dialog_all += "\n"+(dialog) 64 | i = min(len(context_arr), sec_idx[2]) 65 | else: 66 | i = 0 67 | for rng in range(i, len(context_arr), 10): 68 | if dialog_all == "": 69 | dialog = "User: "+random.choice(["Complete the next paragraph of this contract:", "Give me more for this agreement:", "Provide a continuation for this:", "What comes next for this:"])+" "+"\n".join(context_arr[rng:min(len(context_arr),rng+5)]) +"\nAssistant: "+"\n".join(context_arr[rng+5:min(len(context_arr),rng+10)]) 70 | else: 71 | dialog = "User: "+random.choice(["More from this contract.", "Next lines please.", "Continue.", "More."])+"\nAssistant: "+"\n".join(context_arr[rng:min(len(context_arr),rng+10)]) 72 | dialog_all += "\n"+(dialog) 73 | 74 | for qa in para['qas']: 75 | question = qa['question'].split("Details:")[1].strip(".? ") 76 | if qa['is_impossible'] and not len(qa['answers']): 77 | if question.startswith("Is") or question.startswith("Does") or question.startswith("Can") or question.startswith("Will"): 78 | if random.randint(0,4)==0: 79 | dialog = ("User: "+question+"?\nAssistant: "+random.choice(["No.", "It does not appear so.", "I don't think so."])) 80 | dialog_all += "\n"+(dialog) 81 | else: 82 | if random.randint(0,4)==0: 83 | dialog = ("User: "+question+"?\nAssistant: "+random.choice(["I can't tell from this context.", "I don't know. Can you provide more information?", ])) 84 | dialog_all += "\n"+(dialog) 85 | elif len(qa['answers']) > 1: 86 | if question.startswith("Is") or question.startswith("Does") or question.startswith("Can") or question.startswith("Will"): 87 | yes = "Yes. " 88 | else: 89 | yes = "" 90 | dialog = ("User: "+question+f"?\nAssistant: {yes}\n* " + "\n* ".join(a['text'] for a in qa['answers'])) 91 | dialog_all += "\n"+(dialog) 92 | else: 93 | if question.startswith("Is") or question.startswith("Does") or question.startswith("Can") or question.startswith("Will"): 94 | yes = "Yes. " 95 | else: 96 | yes = "" 97 | dialog = ("User: "+question+f"?\nAssistant: {yes}"+qa['answers'][0]['text']) 98 | dialog_all += "\n"+(dialog) 99 | dialog_all = dialog_all.replace(" ", " ").replace(" ", " ").replace(" ", " ").replace(" ", " ").replace(" ", " ").replace("\n\n", "\n").replace("\n\n", "\n").replace("\n\n", "\n").replace("\n\n", "\n").replace("\n\n", "\n").replace("___ ___", "___").replace("___ ___", "___").replace("___ ___", "___") 100 | if dialog_all: 101 | output.write(json.dumps({"text": dialog_all, "metadata": {"source": "cuad"}})+"\n") 102 | -------------------------------------------------------------------------------- /OIG/src/emergency_numbers.py: -------------------------------------------------------------------------------- 1 | #@title Emergency Numbers 2 | """ 3 | From: https://travel.state.gov/content/dam/students-abroad/pdfs/911_ABROAD.pdf 4 | NOTE: 5 | English‐speaking operators may not be available. 6 | This list was created in early 2023 from the above website and may not be current. 7 | """ 8 | 9 | emergency_numbers = [['Country', 'Ambulance', 'Fire', 'Police'], 10 | ['Albania', '17', '18', '19'], 11 | ['Algeria', '21606666', '14', '17'], 12 | ['American_Samoa', '911', '', ''], 13 | ['Andorra', '118', '118', '110'], 14 | ['Angola', '118', '118', '110'], 15 | ['Antigua', '&', 'Barbuda', '999,911'], 16 | ['Argentina', '101', '107', '101'], 17 | ['Armenia', '103', '', ''], 18 | ['Aruba', '911', '', ''], 19 | ['Ascension_Island', '6000', '911', '6666'], 20 | ['Australia', '000_(112_on_cell_phone)', '', ''], 21 | ['Austria', '112,122', '', ''], 22 | ['Azerbaijan_(Baku)', '03', '01', '02'], 23 | ['Bahamas', '911', '', ''], 24 | ['Bahrain', '999', '', ''], 25 | ['Bali', '112', '118', ''], 26 | ['Bangladesh_(Dhaka)', '199', '9 555 555', '866 551‐3'], 27 | ['Barbados', '115,119', '113,119', '112,119'], 28 | ['Belgium', '112_(cell)/101', '', ''], 29 | ['Belarus', '03', '01', '02'], 30 | ['Belize', '911', '', ''], 31 | ['Bermuda', '911', '', ''], 32 | ['Bhutan', '110', '112', '113'], 33 | ['Bolivia', '911', '', ''], 34 | ['Bonaire', '911', '', ''], 35 | ['Bosnia‐Herzegovina', '124', '123', '122'], 36 | ['Botswana', '997,911', '', ''], 37 | ['Brazil', '911', '', ''], 38 | ['Bosnia', '94', '93', '92'], 39 | ['British_Virgin_Islands', '999', '', ''], 40 | ['Brunei', '991', '995', '993'], 41 | ['Bulgaria', '150', '160', '166'], 42 | ['Burma/Myanmar', '999', '', ''], 43 | ['Cambodia', '119', '118', '117'], 44 | ['Canada', '911', '', ''], 45 | ['Canary_Islands', '112', '', ''], 46 | ['Cape_Verde', '130', '131', '132'], 47 | ['Cayman', 'Islands', '911', ''], 48 | ['Chad', '18', '17', ''], 49 | ['Chile', '131', '132', '133'], 50 | ["The_People's_Republic of_China", 51 | '120', 52 | '119', 53 | '110,122_(traffic_accident)'], 54 | ['Colombia', '119', '', ''], 55 | ['Cook_Islands', '998', '996', '999'], 56 | ['Costa_Rica', '911', '', ''], 57 | ["Côte_d'Ivoire", '', '110,111,170', '180'], 58 | ['Croatia', '112', '', ''], 59 | ['Cuba', '26811', '', ''], 60 | ['Curacao', '112', '114', '444444'], 61 | ['Cyprus', '112', '', ''], 62 | ['Czech_Republic', '112,155', '150', '158'], 63 | ['Denmark', '112', '', ''], 64 | ['Djibouti', '351351', '18', '17'], 65 | ['Dominica', '999', '', ''], 66 | ['Dominican_Republic', '911', '', ''], 67 | ['East_Timor', '112', '', ''], 68 | ['Easter_Island', '100‐215', '100‐264', '100‐244'], 69 | ['Ecuador', '131', '101', ''], 70 | ['Egypt', '123', '180', '122'], 71 | ['El_Salvador', '911', '', ''], 72 | ['England', '112,999', '', ''], 73 | ['Estonia', '112', '112', '110'], 74 | ['Ethiopia', '92', '93', '91'], 75 | ['Falkland_Islands', '999', '', ''], 76 | ['Fiji', '911', '', ''], 77 | ['Finland', '112', '', ''], 78 | ['France', '112,15', '112,18', '112,17'], 79 | ['French_Guiana', '112,15', '112,18', '112,17'], 80 | ['French_Polynesia', '15', '18', '17'], 81 | ['Gabon', '1300‐1399', '18', '1730'], 82 | ['Gambia,The', '16', '18', '17'], 83 | ['Georgia', '022', '', ''], 84 | ['Germany', '112', '110', ''], 85 | ['Ghana', '776111‐5', '192', '999,171'], 86 | ['Gibraltar', '999', '', ''], 87 | ['Greece', '112,166', '112,199', '112,100'], 88 | ['Grenada', '434', '112', '911'], 89 | ['Guadeloupe', '18', '18', '17'], 90 | ['Guam', '911', '', ''], 91 | ['Guatemala', '123', '123', '110'], 92 | ['Guyana', '999', '', ''], 93 | ['Haiti', '118', '', '114'], 94 | ['Honduras', '195,37 8654', '198', '119'], 95 | ['Hong', 'Kong', '999', ''], 96 | ['Hungary', '112', '', ''], 97 | ['Iceland', '112', '', ''], 98 | ['India', '102', '101', '100,103_(traffic_accident)'], 99 | ['Indonesia', '118', '113', '110'], 100 | ['Iran', '115', '123', '110'], 101 | ['Republic_of_Ireland', '112.', '999', ''], 102 | ['Isle', 'of', 'Man', '999'], 103 | ['Israel', '101', '102', '100'], 104 | ['Italy', '112,118', '112,115', '112,113'], 105 | ['Jamaica', '110', '110', '119'], 106 | ['Japan', '119', '119', '110'], 107 | ['Jordan', '191', '193', '192'], 108 | ['Kazakhstan', '03', '', ''], 109 | ['Kenya', '999', '', ''], 110 | ['Kiribati', '994', '', ''], 111 | ['Kosovo', '94', '', ''], 112 | ['South Korea', '119', '119', '112'], 113 | ['Kuwait', '777', '', ''], 114 | ['Kyrgyzstan', '103', '', ''], 115 | ['Laos', 'local', 'numbers', 'only'], 116 | ['Latvia', '112,03', '112,01', '112,02'], 117 | ['Lebanon', '112', '', ''], 118 | ['Lesotho', '121', '122', '123/124'], 119 | ['Liberia', '911_(cell_phones_only)', '', ''], 120 | ['Libya', '193', '', ''], 121 | ['Liechtenstein', '112', '', ''], 122 | ['Lithuania', '112', '', ''], 123 | ['Luxembourg', '112/113', '', ''], 124 | ['Macau', '999', '', ''], 125 | ['Macedonia', '94', '93', '92'], 126 | ['Malawi', '998', '999', '997'], 127 | ['Malaysia', '999', '994', '999'], 128 | ['Maldives_Republic', '102', '999', '119'], 129 | ['Mali', '15', '17', '18'], 130 | ['Malta', '112', '', ''], 131 | ['Marianas_Island', '911', '', ''], 132 | ['Marshall_Islands', '625 4111', '', '625 8666'], 133 | ['Martinique', '15', '18', '17'], 134 | ['Mauritania', '118', '117', ''], 135 | ['Mauritius', '999', '', ''], 136 | ['México', '065', '068', '060'], 137 | ['Moldova', '903', '901', '902'], 138 | ['Monaco', '112', '', ''], 139 | ['Mongolia', '103', '101', '102'], 140 | ['Montenegro', '94', '', ''], 141 | ['Montserrat', '911', '', '999'], 142 | ['Morocco', '15', '15', '19'], 143 | ['Mozambique', '117', '198', '119'], 144 | ['Namibia', '2032276', '2032270', '1011'], 145 | ['Nepal', '228094', '', '100'], 146 | ['Netherlands', '112', '', ''], 147 | ['Netherlands_Antilles', '112', '', ''], 148 | ['New_Zealand', '111', '', ''], 149 | ['Nicaragua', '128', '115,911', '118'], 150 | ['Nigeria', '199', '', ''], 151 | ['Northern', 'Ireland', '112,999', ''], 152 | ['Norway', '112,110', '', ''], 153 | ['Oman', '999', '', ''], 154 | ['Pakistan', '15', '', ''], 155 | ['Palau', '911', '', ''], 156 | ['Palestine', '101', '101', '100'], 157 | ['Panama', '269‐9778', '103', '104'], 158 | ['Papua_New_Guinea', '110', '000', ''], 159 | ['Paraguay', '00', '', ''], 160 | ['Peru', '011,5114', '', ''], 161 | ['Philippines', '166,117', '', ''], 162 | ['Poland', '112,999', '', ''], 163 | ['Portugal', '112', '', ''], 164 | ['Puerto', 'Rico', '911', ''], 165 | ['Qatar', '999', '', ''], 166 | ['Réunion', '112,15', '18', '17'], 167 | ['Romania', '112', '', ''], 168 | ['Russia', '112', '', ''], 169 | ['Samoa', '999', '', ''], 170 | ['San_Marino', '113', '116', '112'], 171 | ['Saudi_Arabia', '997', '998', '999'], 172 | ['Scotland', '112,999', '', ''], 173 | ['Serbia', '94', '', ''], 174 | ['Seychelles', '999', '', ''], 175 | ['Sierra_Leone', '999', '019', '999'], 176 | ['Singapore', '995', '995', '999'], 177 | ['Slovak_Republic_(Slovakia)', '155', '150', '158'], 178 | ['Slovenia', '112', '', ''], 179 | ['Solomon_Islands', '911', '', ''], 180 | ['South_Africa', '10177', '10177', '10111'], 181 | ['South_Africa_(Cape_Town)', '107', '', ''], 182 | ['Spain', '112', '', ''], 183 | ['Sri_Lanka', '1', '691095,699935', ''], 184 | ['St._Helena', '911', '', ''], 185 | ['St._Kitts_&_Nevis', '911', '', ''], 186 | ['St._Lucia', '999,911', '', ''], 187 | ['St._Marten', '911,542‐2111', '911,120', '911,542‐2111'], 188 | ['St._Vincent_&_the_Grenadines', '999,911', '', ''], 189 | ['Sweden', '112', '', ''], 190 | ['Switzerland', '144', '118', '117'], 191 | ['Syria', '110', '113', '112'], 192 | ['Tahiti_French_Polynesia', '15', '', ''], 193 | ['Taiwan', '119', '119', '110'], 194 | ['Tajikistan', '03', '', ''], 195 | ['Tanzania', '112,999', '', ''], 196 | ['Thailand', '191', '199', '191'], 197 | ['Togo', '101', '', ''], 198 | ['Tonga', '911', '', ''], 199 | ['Trinidad_&_Tobago', '990', '990', '999'], 200 | ['Tunisia', '190', '198', '197'], 201 | ['Turkey', '101,112', '102', '100'], 202 | ['Turkmenistan', '03', '', ''], 203 | ['Turks_and_Caicos_Islands', '999,911', '', ''], 204 | ['Tuvalu', '911', '', ''], 205 | ['Uganda', '112_(cell_phone),999_(fixed)', '', ''], 206 | ['Ukraine', '03,118', '01', '02'], 207 | ['United_Arab_Emirates', '998,999', '', ''], 208 | ['United_Kingdom', '112,999', '', ''], 209 | ['United', 'States', '911', ''], 210 | ['Uruguay', '999,911', '', ''], 211 | ['US_Virgin_Islands', '911', '', ''], 212 | ['Uzbekistan', '03', '', ''], 213 | ['Vanuatu', '112', '', ''], 214 | ['Vatican_City', '113', '115', '112'], 215 | ['Venezuela', '171', '', ''], 216 | ['Vietnam', '05', '08', '03'], 217 | ['Western_Sahara', '150', '', ''], 218 | ['Western_Samoa', '999', '', ''], 219 | ['Republic_of_Yemen', '191', '191', '194'], 220 | ['Zambia', '999', '', ''], 221 | ['Zimbabwe', '994,999', '993,999', '995,999']] 222 | -------------------------------------------------------------------------------- /OIG/src/essays.py: -------------------------------------------------------------------------------- 1 | #@title essays 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | #TODO: add essay fixing of wording, formatting and sentence order by permuting 21 | try: 22 | from datasets import load_dataset 23 | except: 24 | !pip install datasets 25 | from datasets import load_dataset 26 | import random, json 27 | def create_essays(output): 28 | try: 29 | if essays: pass 30 | except: 31 | essays = load_dataset("ChristophSchuhmann/essays-with-instructions") 32 | for idx, data in enumerate(essays['train']): 33 | first_section = "" 34 | instructions_old, summary, text = data['instructions'], data['titles'], data['essays'] 35 | summary = summary.strip() 36 | instructions = summary 37 | instructions = instructions.replace("The author argues that", "Furthermore") 38 | if "essay discusses" in instructions: 39 | instructions = instructions.split("essay discusses",1)[-1] 40 | instructions = instructions.replace("The author of this", "This") 41 | instructions = instructions.strip() 42 | instructions = instructions[0].lower() + instructions[1:] 43 | instructions = instructions.replace(" text ", " essay ") 44 | if instructions.startswith("the essay"): 45 | instructions = "Write "+instructions.replace("the essay", "a essay").replace("essay", "essay that",1) 46 | elif instructions.startswith("this essay"): 47 | instructions = "Write "+instructions.replace("this essay", "a essay").replace("essay", "essay that",1) 48 | elif " paper " in instructions: 49 | instructions = "Write "+instructions.replace("this paper", "a paper").replace("paper", "paper that",1) 50 | elif " proposal " in instructions: 51 | instructions = "Write "+instructions.replace("this proposal", "a proposal")#.replace("proposal", "proposal that",1) 52 | else: 53 | instructions = "Write an essay about "+instructions 54 | instructions = instructions.replace("an essay about the essay", "an essay that").replace("an essay about this essay", "an essay that").\ 55 | replace("the purpose of this","").replace(" ", " ").replace("an essay about essay was", "an essay").\ 56 | replace("that that ", "that ").replace("It argues", "Argue") 57 | text = text.replace("Works Cited", "References") 58 | text = text.replace("Work Cited", "References") 59 | text = text.replace("Reference List", "References") 60 | text = text.replace("List of References", "References") 61 | text = text.replace("Bibliography", "References") 62 | text = text.replace("Reference\n", "References\n") 63 | toc = "" 64 | if 'Table of Contents' in text: 65 | before, text = text.split('Table of Contents',1) 66 | first_section = text.split("2.")[0].replace("1. ","").strip() 67 | before, toc, text = text.split(first_section,2) 68 | toc = "1. " + first_section+toc 69 | if toc: 70 | toc = [t for t in [t.strip(" 0123456789\n") for t in toc.split(".") if t.strip()] if t] 71 | ref = "" 72 | if "References" in text: 73 | ref = text.split("References",1)[-1] 74 | if ref: 75 | text = text.replace(ref,"") 76 | text = text[:-len("References")] 77 | if first_section: 78 | text = (first_section+"\n"+text).strip() 79 | text = text.replace("\n\n", "\n").strip() 80 | ref = ref.replace("\n\n", "\n").strip() 81 | if ref: 82 | ref = [r for r in [r.split(",")[1].split(".")[-1].strip(" \n\"\'”") for r in ref.split("\n") if r.strip() and len(r.split(","))>1] if len(r) > 15] 83 | appendix = "" 84 | if "Appendix" in ref: 85 | arr = ref.split("Appendix") 86 | ref = arr[0].strip() 87 | appendix = ("##Appendix" + "##Appendix".join(arr[1:])).strip() 88 | instructions = instructions.strip().split("\n")[0] 89 | if ". " in instructions: 90 | instructions1, rest_instructions = instructions.split(". ",1) 91 | instructions1 = instructions1.replace(" is ", " that is ").replace(" are ", " that are ").replace(" can ", " that can ").replace(" will ", " that will ").replace(" discusses ", " that discusses ") 92 | instructions = instructions1+". " + rest_instructions 93 | else: 94 | instructions = instructions.replace(" have ", " that have ").replace(" has ", " that has ").replace(" is ", " that is ").replace(" are ", " that are ").replace(" can ", " that can ").replace(" will ", " that will ").replace(" discusses ", " that discusses ") 95 | instructions = instructions.replace("that is that it that is", "is that it is").replace("it that is", "it is").replace("that that", "that") 96 | text = text.replace("\n\n", "\n").replace("\n\n", "\n") 97 | if toc: 98 | for t in toc: 99 | text = text.replace(t, '##'+t) 100 | else: 101 | text2 = "" 102 | for t in text.split("\n"): 103 | if t.startswith("Table") or t.startswith("Figure") or t.startswith("Diagram") or t.startswith("Panel"): 104 | text2 += "\n###"+t 105 | continue 106 | elif t.startswith("Conclusion") or t.startswith("Intro") or t.startswith("Discussion"): 107 | text2 += "\n##"+t 108 | continue 109 | elif len(t) < 25 and " " in t and len(t.split()) > 2: 110 | t_arr = t.split() 111 | if t_arr[0][0] == t_arr[0][0].upper() and t_arr[2][0] == t_arr[2][0].upper(): 112 | text2 += "\n##"+t 113 | continue 114 | elif len(t) < 25 and " " in t and (t.split()) == 2: 115 | t_arr = t.split() 116 | if t_arr[0][0] == t_arr[0][0].upper() and t_arr[1][0] == t_arr[1][0].upper(): 117 | text2 += "\n##"+t 118 | continue 119 | elif len(t) < 25 and " " not in t: 120 | t_arr = t.split() 121 | if t_arr[0][0] == t_arr[0][0].upper(): 122 | text2 += "\n##"+t 123 | continue 124 | text2 += "\n"+t 125 | text = text2.strip() 126 | if text[0] != '#': 127 | if "#" not in text or text.split("\n")[1][0] == '#': 128 | text = "#"+ text 129 | else: 130 | text = "##"+ text 131 | summary = summary.replace("\n\n", "\n").replace("\n", ". ") 132 | do_summary=True 133 | dialog = "" 134 | if summary and random.randint(0,1)==0: 135 | do_summary=False 136 | first_q = instructions.split("?")[0].split(".")[0].replace("an essay", "a essay").replace("essay", "summary").replace("report", "summary").replace("proposal", "summary") 137 | dialog += (f"\nUser: {first_q}.\nAssistant: {summary}.".replace("..",".")) 138 | if not do_summary and ref and random.randint(0,1)==0 and "&" not in ref[0]: 139 | if random.randint(0,1): 140 | dialog += (f"\nUser: What could be a reference for this essay?\nAssistant: {random.choice(ref)}") 141 | elif len(ref) > 1: 142 | dialog += (f"\nUser: What could be a reference for this essay?\nAssistant: {ref[0]}") 143 | for r in ref[1:]: 144 | if random.randint(0,1): 145 | dialog += (f"\nUser: What could be a reference for this essay?\nAssistant: {r}") 146 | elif random.randint(0,1): 147 | dialog += (f"\nUser: Another\nAssistant: {r}") 148 | else: 149 | dialog += (f"\nUser: Another reference?\nAssistant: {r}") 150 | if text: 151 | if random.randint(0,1) and text.count("##") > 1: 152 | start=True 153 | secs = text.split("##") 154 | random.shuffle(secs) 155 | for section in secs: 156 | if "\n" not in section: continue 157 | section, text2 = section.split("\n",1) 158 | if start: 159 | start=False 160 | first_q = instructions.split("?")[0].split(".")[0] 161 | dialog += (f"\nUser: {first_q}. Write the {section} only.\nAssistant: {text2}") 162 | else: 163 | if random.randint(0,1): 164 | dialog += (f"\nUser: Write an {section} section.\nAssistant: {text2}") 165 | elif random.randint(0,1): 166 | dialog += (f"\nUser: How about a {section} section.\nAssistant: {text2}") 167 | else: 168 | dialog += (f"\nUser: {section}\nAssistant: {text2}") 169 | if appendix and random.randint(0,1)==0 : 170 | dialog += (f"\nUser: What are possible appendices for this essay?\nAssistant: {appendix}") 171 | dialog += (f"\nUser: "+random.choice(["Give me the whole essay based on the above", "Now the whole article.", "Put it all togegther in the right order.", "The complete essay please."])+f".\nAssistant: {text}\n{appendix}") 172 | appendix = "" 173 | else: 174 | dialog += (f"\nUser: "+random.choice(["Give me the whole essay based on the above", "Now the whole article.", "Put it all togegther in the right order.", "The complete essay please."])+f".\nAssistant: {text}") 175 | elif not do_summary: 176 | if random.randint(0,1)==0: 177 | first_q = instructions.split("?")[0].split(".")[0] 178 | dialog += (f"\nUser: {first_q}. Exapnd on the summary above.\nAssistant: {text}") 179 | else: 180 | dialog += (f"\nUser: Write an essay based on the summary above.\nAssistant: {text}") 181 | else: 182 | if random.randint(0,1)==0: 183 | first_q = instructions.split("?")[0].split(".")[0] 184 | dialog += (f"\nUser: {first_q}.\nAssistant: {text}") 185 | elif random.randint(0,1)==0 and "Summary" not in text: 186 | do_summary = False 187 | first_q = instructions.split("?")[0].split(".")[0] 188 | dialog += (f"\nUser: {first_q}\nAssistant: ##Executive Summary\n{summary}\n{text}") 189 | else: 190 | do_summary = False 191 | if random.randint(0,1): 192 | dialog += (f"\nUser: {instructions}\nAssistant: {text}") 193 | else: 194 | dialog += (f"\nUser: Write an essay for this summary: {summary}\nAssistant: {text}") 195 | if "Summary" in text: 196 | do_summary = False 197 | if appendix and random.randint(0,1)==0 : 198 | dialog += (f"\nUser: What are possible appendices for this essay?\nAssistant: {appendix}") 199 | if do_summary and random.randint(0,1)==0: 200 | dialog += (f"\nUser: Write a summary for this essay.\nAssistant: {summary}.".replace("..",".")) 201 | if do_summary and ref and random.randint(0,1)==0 and "&" not in ref[0]: 202 | if random.randint(0,1): 203 | dialog += (f"\nUser: What could be a reference for this essay?\nAssistant: {random.choice(ref)}") 204 | elif len(ref) > 1: 205 | dialog += (f"\nUser: What could be a reference for this essay?\nAssistant: {ref[0]}") 206 | for r in ref[1:]: 207 | if random.randint(0,1): 208 | dialog += (f"\nUser: What could be a reference for this essay?\nAssistant: {r}") 209 | elif random.randint(0,1): 210 | dialog += (f"\nUser: Another\nAssistant: {r}") 211 | else: 212 | dialog += (f"\nUser: Another reference?\nAssistant: {r}") 213 | dialog = dialog.strip() 214 | choice = random.randint(0,2) 215 | if choice == 0: 216 | dialog = dialog.replace("Write the", "Give me") 217 | elif choice == 1: 218 | dialog = dialog.replace("Write the", "Provide an") 219 | elif choice == 2: 220 | dialog = dialog.replace("Write the", "") 221 | choice = random.randint(0,2) 222 | if choice == 0: 223 | dialog = dialog.replace("Write an", "Give me") 224 | elif choice == 1: 225 | dialog = dialog.replace("Write an", "Provide an") 226 | elif choice == 2: 227 | dialog = dialog.replace("Write an", "") 228 | choice = random.randint(0,2) 229 | if choice == 0: 230 | dialog = dialog.replace("about the", "regarding") 231 | elif choice == 1: 232 | dialog = dialog.replace("about the", "relating to") 233 | elif choice == 2: 234 | dialog = dialog.replace("about the", "for") 235 | dialog = dialog.replace("..", ".").replace("\n\n", "\n") 236 | dialog = basic_augment(dialog) 237 | output.write(json.dumps({'text':dialog, "metadata": {'source': 'essays'}})+"\n") 238 | 239 | with open("essays.jsonl", "w") as output: 240 | create_essays(output) 241 | !cp essays.jsonl /content/drive/Shareddrives/LAION/OIG 242 | -------------------------------------------------------------------------------- /OIG/src/flan.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | def create_flan(output): 8 | 9 | data = load_dataset('Muennighoff/flan') 10 | 11 | for item in tqdm.tqdm(data['train']): 12 | q = item['inputs'] 13 | a = item['targets'] 14 | prompt = f": {q}\n: {a}" 15 | output.write(json.dumps({'text': prompt}) + '\n') 16 | -------------------------------------------------------------------------------- /OIG/src/flanv2_cot.py: -------------------------------------------------------------------------------- 1 | #@title flanv2_cot 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | import glob 21 | import locale 22 | def getpreferredencoding(do_setlocale = True): 23 | return "UTF-8" 24 | locale.getpreferredencoding = getpreferredencoding 25 | import random, os 26 | import spacy 27 | import json 28 | from basic_augment import * 29 | try: 30 | nlp = spacy.load("en_core_web_sm") 31 | except: 32 | os.system("python -m spacy download en_core_web_sm") 33 | nlp = spacy.load("en_core_web_sm") 34 | if not os.path.exists("FLAN"): 35 | os.system("git clone https://github.com/google-research/FLAN") 36 | 37 | def flanv2_cot(output): 38 | for file in glob.glob("./FLAN/flan/v2/cot_data/*.tsv"): 39 | dataset_name = file.split("/")[-1].split(".")[0] 40 | with open(file, "rb") as in_data: 41 | for l in in_data: 42 | l = l.decode() 43 | l = l.strip() 44 | question, final_answer, answer = l.split("\t") 45 | question = question.replace("\\n", "\n") 46 | if "Premise:" in question and random.randint(0,1)==0: 47 | question = question.replace("Premise:", random.choice(["", "You see the following:", "Given this that", "Assume:"])) 48 | if "Hypothesis:" in question and random.randint(0,1)==0: 49 | question = question.replace("Hypothesis:", random.choice(["=>", "We infer:", "We generalize", "A potential conclusion:"])) 50 | 51 | question2 = question.split("?")[0].split(".")[-1].strip(" ?") 52 | 53 | if "following" in question2 or "sentence" in question2 or "one is it" in question2: 54 | question2 = "" 55 | elif question2 and question2.split()[0].lower() in {"who", "what", "when", "where", "how", "which"}: 56 | orig_q2 = question2 57 | if "How much" in question2 or "How many" in question2 or "How might" in question2: 58 | question2 = "" 59 | else: 60 | start2 = orig_q2.split()[0].lower() 61 | question2 = " ".join(question2.split()[1:]) 62 | if question2.startswith("might"): 63 | question2= question2.replace("might ", "is what ",1) 64 | elif question2.startswith("does"): 65 | question2 = question2.replace("does ", "is "+ start2+ " ") 66 | elif question2.startswith("do"): 67 | question2 = question2.replace("do ", "is "+ start2+ " ") 68 | #print ('**', question2) 69 | elif question2 and question2.split()[0].lower() in { "is", "can", "do",}: 70 | 71 | doc = nlp(question2) 72 | np = [a.text for a in doc.noun_chunks if len(a.text) > 3] 73 | if not np or question2.index(np[0]) > 5: 74 | question2 = "" 75 | else: 76 | np = np[0] 77 | start2 = orig_q2.split()[0].lower() 78 | start = question2.split()[0] 79 | if final_answer in {"No", "no"}: 80 | question2 = np + " " + start.lower() + " not " + question2.split(np,1)[1] 81 | else: 82 | question2 = np + " " + start.lower() + " " + question2.split(np,1)[1] 83 | answer = answer.replace("\\n", "\n").replace("..", ".").replace(" .", ".") 84 | question = question.replace(" .", ".").strip() 85 | question = question[0].upper()+question[1:] 86 | answer = answer[0].upper()+answer[1:] 87 | final_answer = final_answer.replace(" .", ".") 88 | if final_answer: 89 | final_answer = final_answer[0].upper()+final_answer[1:] 90 | if "\n" in answer and answer.count("\n") > 1 and random.randint(0,1) == 0: 91 | text = f"User: {question}\nAssistant: {final_answer} {question2}.\nUser: "+random.choice(["Please also explain your answer.", "And why?", "Take me through your reasoning.", "Explain.", "Can you tell me your reasoning?"])+f"\nAssistant: {answer}" 92 | text = text.replace("..", ".").replace("?.", "?") 93 | text = basic_augment(text) 94 | output.write(json.dumps({'text': text, "metadata": {'source': 'flanv2_cot_'+dataset_name}})+"\n") 95 | elif "\n" in question: 96 | if "\n" in answer: 97 | answer = answer.strip() 98 | answer = answer[0].lower() + answer[1:] 99 | text = f"User: {question}\n"+random.choice(["Can you also walk me through your reasoning", "Plus step-by-step reasons.", "Let's solve this step by step."])+f"\nAssistant: {final_answer} because {answer}.\nThus {final_answer} {question2}." 100 | text = text.replace("..", ".").replace("?.", "?") 101 | text = basic_augment(text) 102 | output.write(json.dumps({'text': text, "metadata": {'source': 'flanv2_cot_'+dataset_name}})+"\n") 103 | else: 104 | answer = answer.strip() 105 | answer = answer[0].lower() + answer[1:] 106 | text = f"User: {question}\n"+random.choice(["", "Plus step-by-step reasons.", "And why?", "Let's solve this step by step."])+f"\nAssistant: {final_answer} because {answer}." 107 | text = text.replace("..", ".").replace("?.", "?") 108 | text = basic_augment(text) 109 | output.write(json.dumps({'text': text, "metadata": {'source': 'flanv2_cot_'+dataset_name}})+"\n") 110 | else: 111 | answer = answer.strip() 112 | answer = answer[0].lower() + answer[1:] 113 | text = f"User: {question} "+random.choice(["", "Please also explain your answer.", "And why?", "Take me through your reasoning.", "Explain."])+f"\nAssistant: {final_answer} because {answer}." 114 | text = text.replace("..", ".").replace("?.", "?") 115 | text = basic_augment(text) 116 | output.write(json.dumps({'text':text, "metadata": {'source': 'flanv2_cot_'+dataset_name}})+"\n") 117 | -------------------------------------------------------------------------------- /OIG/src/generate_songs.py: -------------------------------------------------------------------------------- 1 | from google.colab import drive 2 | drive.mount('/content/drive') 3 | try: 4 | from transformers import AutoTokenizer, AutoModelForCausalLM 5 | except: 6 | !pip install transformers accelerate 7 | !pip install datasets 8 | import pandas 9 | import torch 10 | from transformers import AutoTokenizer, AutoModelForCausalLM 11 | 12 | try: 13 | if model is None: assert False 14 | except: 15 | model = AutoModelForCausalLM.from_pretrained("Rallio67/joi2_7Be_instruct_alpha",).half().eval().cuda() 16 | tokenizer = AutoTokenizer.from_pretrained("Rallio67/joi2_7Be_instruct_alpha", padding_side='left') 17 | tokenizer.pad_token = tokenizer.eos_token 18 | 19 | def generate_ext(para, model, tokenizer, return_answer_only=True, do_self_contrastive=True, max_length=128, min_length=1, max_return_sequences=1, ret=None, do_sample=True, do_beam=False, device="cuda", target_lang=None): 20 | if type(para) is str: 21 | para = [para] 22 | para = [p.strip() for p in para] 23 | input_ids = tokenizer(para, return_tensors='pt',padding=True ) 24 | input_ids = input_ids.to(device) 25 | if ret is None: ret = {} 26 | with torch.no_grad(): 27 | if do_sample: 28 | # Here we use top_k / top_k random sampling. It generates more diverse queries, but of lower quality 29 | outputs = model.generate( 30 | **input_ids, 31 | max_length=max_length, 32 | repetition_penalty=1.05, 33 | min_length=min_length, 34 | no_repeat_ngram_size=4, 35 | do_sample=True, 36 | top_p=0.95, 37 | penalty_alpha=0.6 if do_self_contrastive else None, 38 | top_k=10, 39 | num_return_sequences=max(1, int(max_return_sequences/2)) if do_beam else max_return_sequences 40 | ) 41 | 42 | for i in range(len(outputs)): # can use batch_decode, unless we want to do something special here 43 | query = tokenizer.decode(outputs[i], skip_special_tokens=True) 44 | if return_answer_only: 45 | query = query[len(para[i]):].lstrip(".? \n\t") 46 | ret[query] = 1 47 | 48 | if do_beam: 49 | 50 | # Here we use Beam-search. It generates better quality queries, but with less diversity 51 | outputs = model.generate( 52 | **input_ids, 53 | max_length=max_length, 54 | num_beams=max(int(max_return_sequences/2) if do_sample else max_return_sequences,5), 55 | repetition_penalty=1.05, 56 | min_length=min_length, 57 | no_repeat_ngram_size=4, 58 | penalty_alpha=0.6 if do_self_contrastive else None, 59 | num_return_sequences=max(1, int(max_return_sequences/2)) if do_sample else max_return_sequences, 60 | early_stopping=True 61 | ) 62 | 63 | 64 | for i in range(len(outputs)): # can use batch_decode, unless we want to do something special here 65 | query = tokenizer.decode(outputs[i], skip_special_tokens=True) 66 | if return_answer_only: 67 | query = query[len(para[i]):].lstrip(".? \n\t") 68 | ret[query] = 1 69 | 70 | return list(ret.keys()) 71 | 72 | try: 73 | if artist is None: assert False 74 | except: 75 | artist = pandas.read_csv("/content/artists-data.csv") 76 | artist2genre = dict([a, b] for a, b in zip(artist['Link'],artist['Genres'])) 77 | 78 | lyric = pandas.read_csv("/content/lyrics-data.csv") 79 | import json 80 | with open("/content/drive/Shareddrives/LAION/synth_lyrics.jsonl", "w") as output: 81 | batch = [] 82 | for aLink, lyric2, lang in zip(lyric['ALink'], lyric['Lyric'], lyric['language']): 83 | if lang == 'en': 84 | genre= artist2genre[aLink] 85 | artist = aLink.replace("-", " ").strip("\/") 86 | genre = genre.split(";")[0] 87 | lyric2 = lyric2.split(" ")[:10] 88 | if lyric2[-1].lower().strip("\n") in {"a", "the", "this", "that", "those", "these"}: 89 | lyric2 = lyric2[:-1] 90 | lyric2 = " ".join(lyric2).replace("\n", " / ") 91 | instr= f"Write me a song in the {genre} genre in the style of {artist} with the themes of '{lyric2}'" 92 | if len(batch) > 15: 93 | song = generate_ext(batch, model, tokenizer) 94 | song = [s.split("#")[0].replace("/", "\n").replace("...", "\n").replace("1.", "\n").replace("2.", "\n").replace("3.", "\n").replace("4.", "\n").replace("5.", "\n").replace("6.", "\n").replace("7.", "\n").replace("8.", "\n").replace("9.", "\n").replace(".", ".\n").replace("\n\n", "\n") for s in song] 95 | for instr, s in zip(batch, song): 96 | if "1" not in s and "2" not in s and \ 97 | "3" not in s and "4" not in s and \ 98 | "5" not in s and "6" not in s and \ 99 | "7" not in s and "7" not in s and \ 100 | "9" not in s and "10" not in s: 101 | output.write (json.dumps({'text':instr + s, 'metadata': {'source': 'synth_music'}})+"\n") 102 | batch = [] 103 | batch.append(f"User: {instr}\n\nJoi2: lyrics:\n") 104 | 105 | import json, random 106 | i = 0 107 | 108 | with open("synth_music_reject.jsonl", "w") as reject: 109 | with open("synth_music_clean.jsonl", "w") as out: 110 | with open("synth_lyrics.jsonl") as input: 111 | for l in input: 112 | data = json.loads(l.strip()) 113 | instruction, response = data['text'].split("\n\nJoi2: lyrics:\n", 1) 114 | response = response.lower() 115 | if " kill " in response or " rape " in response or "crack music" in response or "fuck" in response or "cocaine" in response or " perc " in response or " codine " in response or "below is some" in response or 'verses' in response or ' verse ' in response or ' cock ' in response or 'suck my' in response or 'pussy' in response or 'dialogue' in response or 'first line' in response or 'song' in response or 'lyric' in response: 116 | reject.write(l) 117 | continue 118 | instruction, phrase = instruction.split("with the themes of",1) 119 | instruction = instruction.split(" in the style of")[0] 120 | phrase = [a.strip(" '.?").split(" (")[0].lower() for a in phrase.strip("', ").split("/") if len(a)> 10] 121 | for a in phrase: 122 | response = response.replace(a, '') 123 | for b in a.split(", "): 124 | response = response.replace(b, '') 125 | response = response.replace(", ", "\n") 126 | response = response.replace("nig-ga", "man") 127 | if random.randint(0,1): 128 | response = response.replace("nigga", "man") 129 | 130 | elif random.randint(0,1): 131 | response = response.replace("nigga", "woman") 132 | elif random.randint(0,1): 133 | response = response.replace("nigga", "girl") 134 | elif random.randint(0,1): 135 | response = response.replace("nigga", "guy") 136 | response = response.replace("bitches", "girls") 137 | response = response.replace("bitch", "girl") 138 | response = response.replace("good head", "love") 139 | response = response.replace("dick wet", "heart break") 140 | response = response.replace("nigg", "man") 141 | response = response.replace(" hoe ", " girl ") 142 | response = response.replace(" ho ", " girl ") 143 | response = "\n".join([(r.strip()[0].upper() + r.strip()[1:]).strip(",.") for r in response.split("\n") if len(r.strip()) > 5]) 144 | response = response.replace(" i ", " I ").replace(" i'", " I'").replace(" ", " ") 145 | if len(response) < 60: 146 | reject.write(l) 147 | continue 148 | if response.count("\n") < 3: 149 | reject.write(l) 150 | continue 151 | out.write (json.dumps({'text': instruction.strip()+'.\nAssistant: '+response, 'metadata': data['metadata']})+"\n") 152 | 153 | -------------------------------------------------------------------------------- /OIG/src/image_prompting_instructions.py: -------------------------------------------------------------------------------- 1 | #@title image prompting instructions 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | 6 | Licensed to the Apache Software Foundation (ASF) under one 7 | or more contributor license agreements. See the NOTICE file 8 | distributed with this work for additional information 9 | regarding copyright ownership. The ASF licenses this file 10 | to you under the Apache License, Version 2.0 (the 11 | "License"); you may not use this file except in compliance 12 | with the License. You may obtain a copy of the License at 13 | 14 | http://www.apache.org/licenses/LICENSE-2.0 15 | 16 | Unless required by applicable law or agreed to in writing, 17 | software distributed under the License is distributed on an 18 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 19 | KIND, either express or implied. See the License for the 20 | specific language governing permissions and limitations 21 | under the License. 22 | """ 23 | import gzip 24 | from collections import Counter 25 | import os 26 | try: 27 | from nltk.corpus import stopwords as nltk_stopwords 28 | nltk_stopwords.words('english') 29 | except: 30 | import nltk 31 | nltk.download('stopwords') 32 | import random 33 | 34 | stopwords_set = set(nltk_stopwords.words('english') + ['...', 'there', 'could', 'should', 'shall', 'can', 'might', 'may', 'include', 'including']) 35 | #TODO: use the safety stuff from LAION-AI/riverbed 36 | 37 | hate_words_set = {'niggas', 'fuck', 'wetback', 'blame', 'chinks', 'shut', 'niggers', 'ugly', 'blacks', 'lame', 'sand', 'butt', 'dumb', 'dyke', 'rape', 'whites', 'dykes', 'bitch', 'akbar', 'homo', 'monkey', 'nigger', 'fags', 'coon', 'hate', 'spic', 'raped', 'allah', 'wetbacks', 'trailer', 'queer', 'chucker', 'inbred', 'colored', 'killed', 'jungle', 'shit', 'fucking', 'nigga', 'savages', 'dirty', 'eyed', 'shorty', 'beat', 'kill', 'queers', 'stupid', 'chink', 'slave', 'cunt', 'fuckin', 'faggot', 'faggots', 'trash'} 38 | flagged_words_set ={'tit', 'coprolagnia', 'skeet', 'swinger', 'zoophilia', 'bunghole', 'voyeurweb', 'prick', 'pissing', 'nympho', 'felching', 'lolita', 'pikey', 'squirting', 'hentai', 'urophilia', 'doggiestyle', 'goatcx', 'cumslut', 'pornstars', 'abortion', 'goddamn', 'spac', 'jailbait', 'ejaculate', 'fucked', 'sexual', 'bitching', 'asshole', 'butt', 'cumshots', 'pisspig', 'blumpkin', 'grope', 'cunt', 'twinkie', 'fagging', 'strappado', 'bollocks', 'deepthroating', 'lust', 'shits', 'beastiality', 'clitoris', 'tits', 'tushy', 'nigga', 'fanny', 'fagots', 'kike', 'bastardo', 'knobbing', 'acrotomophilia', 'femdom', 'sexually', 'bareback', 'camslut', 'pornhub', 'cipa', 'dominatrix', 'cocksucking', 'shitting', 'snowballing', 'figging', 'pecker', 'neonazi', 'lovemaking', 'dink', 'yiffy', 'bitch', 'masturbating', 'sexo', 'raghead', 'swastika', 'suck', 'topless', 'ballbag', 'homoerotic', 'orgasim', 'tranny', 'damn', 'fucks', 'asses', 'scrotum', 'octopussy', 'goodpoop', 'fucker', 'whore', 'sluts', 'anal', 'youporn', 'voyuer', 'pubes', 'paedophile', 'jism', 'vorarephilia', 'cuckold', 'fingerbang', 'shitted', 'titty', 'bullshit', 'hardcore', 'bimbos', 'sexuality', 'cumshot', 'handjob', 'xhamster', 'twink', 'piss', 'pornography', 'orgy', 'dildos', 'dildo', 'assholes', 'fuckers', 'feltch', 'scat', 'rectum', 'kinky', 'dogging', 'panty', 'motherfucker', 'panties', 'negro', 'nudity', 'eunuch', 'jizz', 'jiggerboo', 'babeland', 'fucktards', 'slut', 'rimjob', 'genitals', 'domination', 'shrimping', 'jerk-off', 'nsfw', 'carpetmuncher', 'brazzers', 'bum', 'fingering', 'livesex', 'anus', 'bellend', 'erotism', 'deepthroat', 'vulva', 'throating', 'titt', 'arsehole', 'coprophilia', 'fecal', 'shitblimp', 'labia', 'spooge', 'bangbros', 'shitty', 'voyeur', 'snatch', 'omorashi', 'nambla', 'ass', 'boobs', 'darkie', 'cums', 'ejaculated', 'screwing', 'cumming', 'muffdiving', 'camwhore', 'bbw', 'mong', 'milf', 'pisser', 'autoerotic', 'redtube', 'sexcam', 'towelhead', '2g1c', 'upskirt', 'dendrophilia', 'shibari', 'twat', 'lusting', 'sadist', 'ejaculating', 'pornstar', 'masturbate', 'schlong', 'god-damned', 'nawashi', 'hooker', 'pornos', 'kinkster', 'creampie', 'dingleberry', 'kock', 'apeshit', 'horney', 'skank', 'spastic', 'horny', 'slutty', 'shagging', 'nipples', 'pussies', 'anilingus', 'turd', 'poopchute', 'boob', 'daterape', 'punany', '🖕', 'frotting', 'intercourse', 'santorum', 'hore', 'bestial', 'bastard', 'ejaculation', 'shite', 'buttcheeks', 'quim', 'masturbation', 'dick', 'fisting', 'nazi', 'undressing', 'rape', 'yaoi', 'fuckin', 'viagra', 'poontang', 'arse', 'bulldyke', 'bdsm', 'coon', 'cock-sucker', 'felch', 'fudgepacker', 'clit', 'shemale', 'horniest', 'jigaboo', 'breasts', 'kinbaku', 'pthc', 'spic', 'circlejerk', 'xx', 'retard', 'rimming', 'dingleberries', 'bangbus', 'masterbating', 'paki', 'hell', 'worldsex', 'doggystyle', 'coons', 'playboy', 'g-spot', 'tosser', 'dolcett', 'blowjobs', 'poon', 'dyke', 'masochist', 'pissed', 'slanteye', 'cunnilingus', 'sex', 'crap', 'cum', 'cialis', 'beaners', 'humping', 'incest', 'fuck', 'chink', 'fucking', 'bukkake', 'vibrator', 'dommes', 'wank', 'buceta', 'erotic', 'poop', 'cornhole', 'xnxx', 'cunillingus', 'penis', 'threesome', 'rapist', 'titties', 'shag', 'bloody', 'fag', 'xxx', 'tribadism', 'busty', 'fellatio', 'xvideos', 'pube', 'nigger', 'queaf', 'clusterfuck', 'bondage', 'sodomize', 'tubgirl', 'strapon', 'pricks', 'futanari', 'sodomy', 'queef', 'flange', 'vagina', 'cocks', 'niggers', 'fagot', 'camgirl', 'wang', 'porno', 'nymphomania', 'bitches', 'son-of-a-bitch', 'raping', 'bastinado', 'semen', 'dog-fucker', 'pedophile', 'guro', 'shit', 'orgasms', 'pisses', 'footjob', 'testicle', 'bestiality', 'gokkun', 'honkey', 'ponyplay', 'boner', 'wetback', 's&m', 'scissoring', 'nipple', 'orgasm', 'fuckings', 'cougar', 'jiggaboo', 'thumbzilla', 'nude', 'ejaculates', 'assmunch', 'pissoff', 'ass-fucker', 'shota', 'goregasm', 'ecchi', 'smegma', 'splooge', 'sadism', 'dvda', 'beaner', 'balls', 'butthole', 'smut', 'nimphomania', 'poof', 'juggs', 'bollok', 'goatse', 'porn', 'faggot', 'nutten', 'sucks', 'orgies', 'escort', 'birdlock', 'duche', 'spunk', 'gangbang', 'barenaked', 'blowjob', 'pedobear', 'pussy', 'pegging', 'sexy', 'cock'} 39 | csam_set1 = {'lolhentai','nymphet', 'nimphet', 'babyj', 'voglia', 'eurololita', 'lolli', 'lola', 'lolita', 'lolly', 'loli', 'lolitaguy', \ 40 | "pedo", 'hussyfan', 'kidzilla', 'raygold', 'ygold', 'mylola', \ 41 | 'babyshivid', 'kidzilla', 'kdquality', 'cbaby', 'kinderficker', 'preteen', } 42 | csam_set2 = {'little', 'girl', 'boy', 'child', 'kid', 'baby', 'sissy', 'kiddie', 'toddler', \ 43 | 'bath', 'baths', 'bathing', 'qwerty', 'qqaazz', 'ptsc', 'izzy', 'rika', \ 44 | 'pthc', 'tanta','newstar', 'playtoy', 'imouto', 'lourinha', 'amateurz', 'arina', 'shiori', 'chiharu', 'nablot', 45 | } 46 | 47 | def near_dup_key_fn(sent_arr): 48 | sent_arr = [s if len(s) <4 else s[:4] for s in sent_arr if len(s) > 3 and s not in stopwords_set] 49 | if not sent_arr: return -1 50 | if len(sent_arr) > 5: sent_arr = sent_arr[:5] 51 | return hash("".join(sent_arr)) 52 | 53 | 54 | def detect_non_safe(text, url="", do_augment=False): 55 | orig_text = text 56 | text = text.lower() 57 | if "year old" in text: text = text.replace("year old", "yo") 58 | if "years old" in text: text = text.replace("years old", "yo") 59 | text = text.replace("lame pun coon","") 60 | text_arr = text.split() 61 | found_csam = 0 62 | url_array = [a for a in url.lower().replace("https://", "").replace("http://", "").replace("/", " ").replace("-", " ").replace("_", " ").replace(".", " ").split()] 63 | for word in list(set(text_arr+url_array)): 64 | if word in csam_set1: 65 | found_csam += 2 66 | elif word in csam_set2: 67 | found_csam += 1 68 | elif ("yr" in word or "yo" in word): 69 | try: 70 | age = int(word.strip("yor-")) 71 | if age < 18 and ("yo" in word or ("yr" in word and "old" in text)): 72 | found_csam += 2 73 | except: 74 | pass 75 | 76 | is_porn_url = ("porn" in url or ".xxx/" in url or "xxx." in url or "xnxx" in url or "xhamster" in url or "redtube" in url or "xvideos" in url or "hclips" in url or "tnaflix" in url or "tube8" in url or "spankbang" in url or "4chan" in url or "tblop" in url or "thumbzilla" in url or "hentai" in url) 77 | if ((found_csam > 0 and (' rape ' in text or ' raped' in text or ' rapes ' in text or 'bondage' in text)) or 78 | (found_csam > 1 and ('beaten' in text or 'murder' in text or ' kill' in text or 'brutual' in text)) or 79 | (found_csam > 2 and (is_porn_url or "nude" in text or "naked" in text or "breast" in text or "penis" in text or "suck" in text or "masturbation" in text or "handjob" in text or " cunt " in text or "dildo" in text or " cum " in text \ 80 | or "boobs" in text or "blowjob" in text or " anal " in text or "fuck" in text or " cock " in text or "pussy" in text or "tits" in text))): 81 | if "field of rape" in text or "rape field" in text or "rape flower" in text or "wikimedia" in url or "wikipedia" in url or "dailymail" in url or "msn" in url or "times" in url or "youtube" in url or "cnn" in url or "bbc" in url or "abc" in url or "nbc" in url or "news" in url or ".gov" in url: 82 | if "field" not in text and "flower" not in text and "news" not in text and "report" not in text and "defendant" not in text and "police" not in text and "claim" not in text and "allegation" not in text: 83 | if do_augment: 84 | if random.random() < 0.5: 85 | text = (text +", it is reported").replace(" ", " ") 86 | elif random.random() < 0.5: 87 | text = ("in the news: " + text).replace(" ", " ") 88 | elif random.random() < 0.5: 89 | text = ("reportedly " + text).replace(" ", " ") 90 | return text, "harmless" 91 | return orig_text,"csam" 92 | 93 | if (is_porn_url or \ 94 | (found_csam > 1 and (is_porn_url or "nude" in text or "naked" in text or "breast" in text or "penis" in text or "suck" in text or "masturbation" in text or "handjob" in text or " cunt " in text or "dildo" in text or " cum " in text \ 95 | or "boobs" in text or "blowjob" in text or " anal " in text or "fuck" in text or " cock " in text or "pussy" in text or "tits" in text)) or\ 96 | (" incest " in text or "beastie" in text or "efuck" in text or "gfuck" in text or "masturbation" in text or "handjob" in text or " cunt " in text or "dildo" in text or " milf " in text or "bbw" in text or \ 97 | " cum " in text or "boobs" in text or "blowjob" in text or " anal " in text or "fuck" in text or " cock " in text or "pussy" in text or "porn" in text or "tits" in text)): 98 | if ('piss' in text or ' scat ' in text or 'little' in text or 'teen' in text or "erotic" in text or "virgin" in text or " gay " in text or " lesbian " in text or "snatch" in text or " hairy " in text or \ 99 | " dirty " in text or " lick " in text or "sucks" in text or " dick " in text or "huge" in text or "xxx" in text or "redhead" in text or "movie" in text or " old" in text or\ 100 | " gay " in text or "free" in text or "ass" in text or "sex" in text or "blonde" in text or "japanese" in text or "mature" in text or "black" in text or "asian" in text or "girl" in text or "video" in text or \ 101 | "photo" in text or "scene" in text or "mom" in text or "dad" in text or "sister" in text or "brother" in text or "teacher" in text or "doctor" in text or "office" in text or \ 102 | "young" in text or "busty" in text or "teen" in text or "indian" in text or "granny" in text or "horny" in text or "amateur" in text 103 | ): 104 | if do_augment: 105 | if " girl" in text and random.random() < .5: 106 | text = text.replace(" girl", " "+random.choice(['nymphet', 'nimphet', 'babyj', 'baby', 'voglia', 'eurololita', 'lolli', 'lola', 'lolita', 'lolly', 'loli', 'lolitaguy', 'kid', 'sissy', 'preteen', 'kiddie', 'toddler',])+" ").replace(" ", " ").replace(" s ", " ") 107 | return text, "csam" 108 | elif " boy " in text and random.random() < .5: 109 | text = text.replace(" boy ", " "+random.choice([ 'nymphet', 'nimphet', 'babyj', 'baby', 'voglia', 'eurololita', 'lolli', 'lola', 'lolita', 'lolly', 'loli', 'lolitaguy', 'kid', 'sissy', 'preteen', 'kiddie', 'toddler',])+" ").replace(" ", " ").replace(" s ", " ") 110 | return text, "csam" 111 | elif "18yo" in text and random.random() < .5: 112 | text = text.replace("18yo", str(random.randint(1, 17))+"yo").replace(" ", " ").replace(" s ", " ") 113 | return text, "csam" 114 | elif "daugher" in text and random.random() < .5: 115 | text = text.replace("daugher", "baby").replace(" ", " ").replace(" s ", " ") 116 | return text, "csam" 117 | elif " son " in text and random.random() < .5: 118 | text = text.replace(" son ", " baby ").replace(" ", " ").replace(" s ", " ") 119 | return text, "csam" 120 | return orig_text, "porn" 121 | offensive = 0 122 | if is_porn_url: offensive = 1 123 | hate = 0 124 | for word in list(set(text_arr)): 125 | if word in flagged_words_set: 126 | offensive += 1 127 | if word in hate_words_set: 128 | hate += 1 129 | if hate > 2 or (offensive > 0 and hate > 1): 130 | return orig_text, "hate" 131 | elif offensive > 1: 132 | return orig_text, "offensive" 133 | if ".law" in url or ".edu" in url or ".gov" in url or "github" in url or "philpapers" in url or "stackoverflow" in url or "arxiv" in url or "pubmed" in url or "uspto" in url or "free.law" in url or "nih.gov" in url: 134 | return orig_text, "harmless" 135 | elif " memes " not in text and "memes" not in url and "confession bear" not in text: 136 | if ("ology" in url or "techno" in url or "genom" in url or "religion" in url or "philosophy" in url or "romance" in url or "mystery" in url or "scifi" in url or "fantasy" in url or "drama" in url or \ 137 | "library" in url or "justia" in url or "politic" in url or "history" in url or "science" in url or "humor" in url or "poetry" in url or "literature" in url or "math" in url or \ 138 | "medic" in url or "biotech" in url or "invention" in url or "legal" in url or "lexisnexis" in url or "westlaw" in url or \ 139 | "contracts" in url or "subtitles" in url or "github" in url or "wikimedia" in url or "wikipedia" in url or "free.law" in url): 140 | return text, "harmless" 141 | elif ("sports" in url or "dailymail" in url or "msn" in url or "times" in url or "youtube" in url or "cnn" in url or "bbc" in url or "abc" in url or "nbc" in url or "news" in url): 142 | return orig_text, "harmles" 143 | if do_augment: 144 | if " girl" in text and random.random() < .5: 145 | text = text.replace(" girl", " "+random.choice(['baby', 'lolli', 'lola', 'lolita', 'lolly', 'loli', 'kid', 'sissy', 'preteen', 'kiddie', 'toddler',])+" ").replace(" ", " ").replace(" s ", " ") 146 | return text, "harmless" 147 | elif " boy " in text and random.random() < .5: 148 | text = text.replace(" boy ", " "+random.choice(['baby', 'lolli', 'lola', 'lolita', 'lolly', 'loli', 'kid', 'sissy', 'preteen', 'kiddie', 'toddler',])+" ").replace(" ", " ").replace(" s ", " ") 149 | return text, "harmless" 150 | elif "18yo" in text and random.random() < .5: 151 | text = text.replace("18yo", str(random.randint(1, 17))+"yo").replace(" ", " ").replace(" s ", " ") 152 | return text, "harmless" 153 | elif "daugher" in text and random.random() < .5: 154 | text = text.replace("daugher", "baby").replace(" ", " ").replace(" s ", " ") 155 | return text, "harmless" 156 | elif " son " in text and random.random() < .5: 157 | text = text.replace(" son ", " baby ").replace(" ", " ").replace(" s ", " ") 158 | return text, "harmless" 159 | return orig_text, "unknown" 160 | 161 | 162 | def create_image_prompts(output): 163 | if not os.path.exists("laion_safe_nonsafe.tsv"): 164 | !cp /content/drive/Shareddrives/ontocord/laion_safe_nonsafe.tsv ./ 165 | laion_img = [a.split("\t")[0] for a in open("laion_safe_nonsafe.tsv").read().split("\n") if len(a.split("\t")[0]) > 100 and "..." not in a and "|" not in a and "foreground" not in a and "background" not in a and "photo" not in a and "picture" not in a and "(" not in a and ")" not in a and ".jpg" not in a and "@" not in a and "&q" not in a and "<" not in a and ">" not in a and "|| hate" not in a and "|| porn" not in a and "csam" not in a and "|| offensive" not in a] 166 | i = 0 167 | for sent in laion_img: 168 | if "http" in sent or "www" in sent or "#" in sent: continue 169 | sent = sent.replace("stockvideo's en b-roll-footage met ", "") 170 | sent, label = detect_non_safe(sent, do_augment=False) 171 | if label not in {"unknown", "unknown"}: continue 172 | i+=1 173 | bad_words = [word.strip("()[].,") for word in sent.split() if word.lower() not in stopwords_set] 174 | bad_words = list(set([word for word in bad_words if word.lower() in hate_words_set or word.lower() in flagged_words_set or word.lower() in csam_set1 or word.lower() in csam_set2])) 175 | if bad_words: continue 176 | if len(sent) > 300: 177 | instruction = ", ".join(list(set([word.strip(":;<>,.?/~`!@#$%^&*()-_+=") for word in sent.split() if len(word.strip("~`!@#$%^&*()-_+=")) > 4 and word.lower() not in stopwords_set]))[:5]).replace(",,",",").replace(", , ", ", ") 178 | dialog = ("User: Give me a sentence with these words: " +instruction+"\nAssistant: " + sent) 179 | else: 180 | instruction = ", ".join([word.strip(":;<>,.?/~`!@#$%^&*()-_+=") for word in sent.split() if len(word.strip("~`!@#$%^&*()-_+=")) > 4 and word.lower() not in stopwords_set][:5]).replace(",,",",").replace(", , ", ", ") 181 | dialog = ("User: Give me an image prompt to draw an image with " +instruction+"\nAssistant: " + sent) 182 | d = dialog 183 | if random.randint(0,1): 184 | d = d.replace("Give me", random.choice(["", "Can you create", "I'm looking for", "How about"])) 185 | if random.randint(0,1): 186 | d = d.replace("image prompt to draw", random.choice(["", "prompt for", "image prompt for", "stablity prompt for"])) 187 | labels = [a.split("[")[1] for a in d.split("Assistant:")[-1].split("]") if "[" in a] 188 | before, after = d.split("Assistant:") 189 | after = after.split("]")[-1] 190 | d = before+"Assistant:"+after 191 | d = d.replace(" ", " ").replace(" ", " ") 192 | if d: 193 | output.write (json.dumps({'text': d, 'metadata': {'source': 'laion_image_prompts'}})+"\n") 194 | 195 | #if i > 100: break 196 | print (i) 197 | 198 | -------------------------------------------------------------------------------- /OIG/src/kojma_cot.py: -------------------------------------------------------------------------------- 1 | #@title kojma_cot 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | import glob, random, os, json 21 | def kojma_cot(output): 22 | if not os.path.exists("./zero_shot_cot"): 23 | os.system("git clone https://github.com/kojima-takeshi188/zero_shot_cot") 24 | for file in glob.glob("./zero_shot_cot/log/*_cot.log"): 25 | with open(file, "rb") as input: 26 | prev_q = "" 27 | q = "" 28 | a = "" 29 | for l in input: 30 | l = l.decode().strip() 31 | if l.startswith("pred_before"): 32 | q = q.strip() 33 | if prev_q and q[:10] == prev_q[:10]: 34 | continue 35 | prev_q = q 36 | a = a.strip() 37 | a = a.replace("Let's think step by step.", "").replace(" ", " ").replace("\n\n", "\n").strip() 38 | steps = [""] 39 | for a1 in a.split("\n"): 40 | if a1.startswith("Second") or a1.startswith("Third") or \ 41 | a1.startswith("Fourth") or a1.startswith("Fifth") or \ 42 | a1.startswith("Sixth") or a1.startswith("Seventh") or \ 43 | a1.startswith("Eighth") or a1.startswith("Ninth") or \ 44 | a1.startswith("Tenth") or a1.startswith("Therefore") or \ 45 | a1.startswith("Finally") or a1.startswith("So,") or \ 46 | a1.startswith("But") or a1.startswith("Hence") or a1.startswith("With that said"): 47 | steps.append("") 48 | steps[-1] += "\n" + a1 49 | why = random.choice(["Please explain your reasoning.", "Why?", "How did you solve this?", "Let's solve this step by step."]) 50 | if random.randint(0,1) == 0 and "Therefore, " in steps[-1] and "Among A" not in steps[-1]: 51 | answer = steps[-1].replace("Therefore, ","").strip() 52 | answer = answer[0].upper()+answer[1:] 53 | all_steps = "\n".join(steps[:-1]).replace("\n\n", "\n") 54 | text = 'User: '+ q+ "\nAssistant: " + answer +f"\nUser: {why}\nAssistant:\n"+all_steps 55 | else: 56 | all_steps = "\n".join(steps).replace("\n\n", "\n") 57 | text = 'User: '+ q+ f" {why}\nAssistant: " + all_steps 58 | output.write(json.dumps({'text': text, 'metadata': {'source': 'kojma_cot'}})+"\n") 59 | q = a = "" 60 | if l.startswith("Q:"): 61 | q = l.split("Q:",1)[1] 62 | if a or l.startswith("A:"): 63 | a += "\n" + l.split("A:",1)[-1] 64 | 65 | 66 | -------------------------------------------------------------------------------- /OIG/src/lyrics.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | 8 | def create_lyrics(output): 9 | 10 | ####### 11 | # download and unzip https://www.kaggle.com/datasets/neisse/scrapped-lyrics-from-6-genres?select=lyrics-data.csv 12 | ####### 13 | 14 | with open('lyrics-data.csv') as f: 15 | reader = csv.reader(f) 16 | rows = [row for row in reader] 17 | 18 | en_rows = [row for row in rows if row[4] == 'en'] 19 | 20 | user_templates = [ 21 | "Can you provide the lyrics for a song called XXX?", 22 | "I'm looking for the lyrics to a track named XXX, could you show me?", 23 | "Could you display the lyrics of a song with the title of XXX?", 24 | "Can you present me with a song to a piece titled XXX?", 25 | "I'd like to see the lyrics to a song titled XXX, can you help me with that?", 26 | "Can you give me the lyrics for XXX?", 27 | "Would you mind writing the lyrics for a song called XXX?", 28 | "Make a song that goes by the name XXX", 29 | "Generate the lyrics for a song called XXX", 30 | "Write me a song lyrics with a name of XXX", 31 | "Compile words for a song titled XXX.", 32 | "Can you provide the lyrics to a song called XXX?", 33 | "I'm looking for the lyrics of a song called XXX, can you tell me?", 34 | "Can you show the lyrics to a song called XXX?", 35 | "Can you give me a song called XXX?", 36 | "I want to see the lyrics of a song called XXX, can you help me?", 37 | "Can you give me the lyrics to XXX?", 38 | "Would you mind writing lyrics for a song called XXX?", 39 | "Make a song called XXX.", 40 | "Generate lyrics for a song named XXX", 41 | "Write me a song titled XXX lyrics", 42 | "Write lyrics for a song called XXX", 43 | ] 44 | 45 | for _, sname, _, lyric, lang in tqdm.tqdm(en_rows): 46 | prompt = ": " + random.choice(user_templates).replace('XXX', sname) + '\n: ' + lyric 47 | output.write(json.dumps({'text': prompt}) + '\n') -------------------------------------------------------------------------------- /OIG/src/mathqa.py: -------------------------------------------------------------------------------- 1 | #@title mathqa 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | 21 | from tqdm.contrib import tzip 22 | import random, os 23 | try: 24 | from datasets import load_dataset 25 | except: 26 | os.system("pip install datasets") 27 | from datasets import load_dataset 28 | import json 29 | 30 | def math_qa(output): 31 | dataset = load_dataset("math_qa") 32 | 33 | qs = dataset['train']['Problem'] 34 | ans = dataset['train']['Rationale'] 35 | 36 | options = dataset['train']['options'] 37 | cors = dataset['train']['correct'] 38 | for a, q, opt, core in tzip(ans, qs, options, cors): 39 | a = a.replace("mr.", "mr") 40 | a = a.replace("dr.", "dr") 41 | a = a.replace("mrs", "mrs") 42 | q = q.replace("’", "'") 43 | a = a.replace("’", "'") 44 | a = a.replace("per. son", "person ").replace(" ;", ".\n").replace("no. ", "number ").replace("sq. ", "sq ").replace("⇒", ".\n").replace("- >", ".\n").replace("= >", ".\n").replace("∴", ". ∴").replace("hence", ". hence").replace("therefore", ". therefore").replace("thus", ". thus").replace("so", ". so").replace("let", ". let").replace("i.e.",".\n") 45 | a = a.replace(" - - -", ".\n").replace("- - ", "").replace(". . .", "\n").strip('"').replace(". ",".\n").replace("sol .\n","").replace("explanation:","") 46 | if "the answer is" in a: 47 | a = a.split("the answer is")[0] 48 | if "answer" in a: 49 | a = a.split("answer")[0] 50 | if "ans " in a: 51 | a = a.split("ans ")[0] 52 | if "option " in a: 53 | a = a.split("option ")[0] 54 | if "choice" in a: 55 | a = a.split("choice")[0] 56 | 57 | a = a.strip(" \n.").replace("\n.\n","\n").replace(" .",".").replace(" ,",",").replace("\ni.\ne.", "\n").replace("\ni. \ne.\n", "\n").replace("\ni.\ne", "\n").replace("\ni. \ne\n", "\n") 58 | a = a.replace("\nthus.\n","\n") 59 | a = "\n".join(a1.strip(" ,.").rstrip(" -.")+"." for a1 in a.split("\n") if len(a1) > 1) 60 | if [a1 for a1 in a.split("\n") if "+." in a1 or "=." in a1 or len(a1.strip(" .")) == 1 or (")" in a1 and "(" not in a1) or ("[" in a1 and "]" not in a1) or ("(" in a1 and ")" not in a1) or ("]" in a1 and "[" not in a1)]: 61 | continue 62 | if not a.strip(): 63 | continue 64 | a = a.replace("..",".").replace("as.\n","as").replace("hence.","").replace("so.", "").replace("thus.","").replace("now.", "").replace("solution.\n", "").replace("solution","").strip() 65 | a = a.replace("play.\n", "play ").replace("per.\nson", "person").replace("corret", "").replace(" ", " ").replace("explanation :.\n", "").replace(" .", ".").replace(" '","'").replace(" :", ":").replace(" / ", "/").\ 66 | replace("sq.\n", "sq ").replace("al.\nso", "\nalso").replace("no.\n", "number ").replace("to.\n", "to ").replace("the.\n", "the ").replace("and.\n", "and ").replace("are.\n", "are ").replace("is.\n", "is ").replace("per.\nson.", "person ").replace("and.\nso on", "and so on").replace(" ,", ",").rstrip(" ,.") 67 | a = ("\n".join([((p.strip()[0].upper()+p.strip()[1:]) if len(p) > 1 and p[1] != ' ' else p.strip()).strip(' ,?') for p in a.split("\n") if p.strip() and "option" not in p and "correct" not in p and "answer" not in p])+".") 68 | a = a.replace(" ,",",").replace(" .",".").replace("..",".").replace(",.",".").strip(" ,").replace(" ' ", "'").replace("' ", "'").replace(" '", "'") 69 | q = q.replace(" ,",",").replace(" .",".").replace("..",".").replace(",.",".").strip(" ,").replace(" ' ", "'").replace("' ", "'").replace(" '", "'") 70 | if [a1 for a1 in a.split("\n") if "+." in a1 or "=." in a1 or len(a1.strip(" .")) == 1 or (")" in a1 and "(" not in a1) or ("[" in a1 and "]" not in a1) or ("(" in a1 and ")" not in a1) or ("]" in a1 and "[" not in a1)]: 71 | continue 72 | if len(a.strip(". ,")) > 20: 73 | q = q[0].upper() + q[1:] 74 | a = a[0].upper() + a[1:] 75 | q = q.strip() 76 | a = a.strip() 77 | if "\n" in a: 78 | if random.randint(0,1) == 0: 79 | pr = f"User: {q}. "+random.choice(["And explain please.", "Let's think step by step.", "Can you show me your work?", "Help me solve this step by step."])+f"\nAssistant: {a}".replace("..", ".").replace("?.", "?").replace(" i ", " I ") 80 | output.write (json.dumps({"text":pr, "metadata": {"source": "mathqa"}})+"\n") 81 | else: 82 | a2 = a.split("\n") 83 | final_answer = a2[-1].replace("∴", "").replace("Hence", "").replace("Therefore", "").replace("Then", "").replace("Thus", "").replace("So", "").strip() 84 | final_answer = final_answer[0].upper() + final_answer[1:] 85 | pr = f"User: {q}.\nAssistant: {final_answer}\nUser: Can you solve this step by step?\Assistant: Sure.\n{a}".replace("..", ".").replace("?.", "?").replace(" i ", " I ") 86 | output.write (json.dumps({"text":pr, "metadata": {"source": "mathqa"}})+"\n") 87 | else: 88 | pr = f"User: {q}. Let's think step by step.\nAssistant: {a}".replace("..", ".").replace("?.", "?").replace(" i ", " I ") 89 | output.write (json.dumps({"text":pr, "metadata": {"source": "mathqa"}})+"\n") 90 | 91 | -------------------------------------------------------------------------------- /OIG/src/merged_code_xp3.py: -------------------------------------------------------------------------------- 1 | #@title merged_code_xp3 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | 21 | import os 22 | import json, random 23 | 24 | from torch import e 25 | def create_merged_code_xp3(output): 26 | if not os.path.exists("merged_code.jsonl"): 27 | os.system("wget https://huggingface.co/datasets/bigscience/xP3/resolve/main/code/merged_code.jsonl") 28 | 29 | with open("merged_code.jsonl") as file: 30 | for l in file: 31 | data = json.loads(l.strip()) 32 | if ' def ' in data['inputs'] or ' def ' in data['targets']: 33 | inputs, targets = data['inputs'], data['targets'] 34 | inputs = inputs.replace("", "").replace(" ; ", ".\n").replace("; ", ".\n") 35 | if inputs.startswith ("def ") or inputs.startswith("class ") or inputs[0] in {'#', '@'}: 36 | inputs = inputs.strip() 37 | if "\n" not in inputs: 38 | inputs = "Complete this python program:\n" + inputs 39 | else: 40 | inputs = "Solve the following python programming problem given the following code:\n" + inputs 41 | prefix = "Here is the Python code you requested" 42 | if inputs[0] == '"': 43 | prefix = inputs.split("\n")[0].strip('"').split("|")[0] + " and the other steps required." 44 | prefix = ("Here is code to " + prefix[0].lower()+prefix[1:] + "\n").replace(" ", " ") 45 | elif "Find "in inputs: 46 | prefix = inputs.split("Find ",1)[1].split("\n")[0] 47 | prefix = "Here is Python code to find " + prefix+"\n" 48 | elif "find "in inputs: 49 | prefix = inputs.split("find ",1)[1].split("\n")[0] 50 | prefix = "Here is Python code to find " + prefix+"\n" 51 | elif "Determine "in inputs: 52 | prefix = inputs.split("Determine ",1)[1].split("\n")[0] 53 | prefix = "Here is Python code to determine " + prefix+"\n" 54 | elif "determine "in inputs: 55 | prefix = inputs.split("determine ",1)[1].split("\n")[0] 56 | prefix = "Here is Python code to determine " + prefix+"\n" 57 | elif "Fix "in inputs: 58 | prefix = inputs.split("Fix ",1)[1].split("\n")[0] 59 | prefix = "Here is Python code to fix " + prefix+"\n" 60 | elif "fix "in inputs: 61 | prefix = inputs.split("fix ",1)[1].split("\n")[0] 62 | prefix = "Here is Python code to fix " + prefix+"\n" 63 | elif "Print "in inputs: 64 | prefix = inputs.split("Print ",1)[1].split("\n")[0] 65 | prefix = "Here is Python code to print " + prefix+"\n" 66 | elif "print "in inputs: 67 | prefix = inputs.split("print ",1)[1].split("\n")[0] 68 | prefix = "Here is Python code to print " + prefix+"\n" 69 | elif "Compare "in inputs: 70 | prefix = inputs.split("Compare ",1)[1].split("\n")[0] 71 | prefix = "Here is Python code to compare " + prefix+"\n" 72 | elif "compare "in inputs: 73 | prefix = inputs.split("compare ",1)[1].split("\n")[0] 74 | prefix = "Here is Python code to compare " + prefix+"\n" 75 | elif "Compute "in inputs: 76 | prefix = inputs.split("Compute ",1)[1].split("\n")[0] 77 | prefix = "Here is Python code to compute " + prefix+"\n" 78 | elif "compute "in inputs: 79 | prefix = inputs.split("compute ",1)[1].split("\n")[0] 80 | prefix = "Here is Python code to compute " + prefix+"\n" 81 | elif "Calculate "in inputs: 82 | prefix = inputs.split("Calculate ",1)[1].split("\n")[0] 83 | prefix = "Here is Python code to calculate " + prefix+"\n" 84 | elif "calculate "in inputs: 85 | prefix = inputs.split("calculate ",1)[1].split("\n")[0] 86 | prefix = "Here is Python code to calculate " + prefix+"\n" 87 | elif "What" in inputs: 88 | prefix = inputs.split("What",1)[1].split("\n")[0] 89 | prefix = "Here" + prefix+"\n" 90 | elif "what" in inputs: 91 | prefix = inputs.split("what",1)[1].split("\n")[0] 92 | prefix = "Here" + prefix+"\n" 93 | inputs = inputs.strip('"') 94 | if prefix: 95 | prefix = prefix.split("?")[0] 96 | prefix = "#"+prefix.split(".")[0]+".\n" 97 | if len(targets) <= 20: continue 98 | if not inputs.startswith("Complete this python program") and not inputs.startswith("Solve the following python programming"): 99 | if random.randint(0,1) and prefix: 100 | command = random.choice(["\n"," ", " ... ", "\n=====\n"])+prefix.replace("Here is", random.choice(["Write me a", "Give me a", "What is a", "Can you provide a"])).strip("\n .")+"?" 101 | command = command.replace("#", "") 102 | text= (f"User: {inputs}{command}\nAssistant: {prefix}{targets}") 103 | elif random.randint(0,1) and prefix: 104 | command = prefix.replace("Here is", random.choice(["Write me a", "Give me a", "What is a", "Can you provide a"])).strip(".\n ")+" given the following:\n" 105 | command = command.replace("#", "") 106 | text= (f"User: {command}{inputs}\nAssistant: {prefix}{targets}") 107 | else: 108 | text= (f"User: {inputs}\nAssistant: {prefix}{targets}") 109 | else: 110 | text= (f"User: {inputs}\nAssistant: {prefix}{targets}") 111 | text = text.replace("\n\n\n", "\n\n") 112 | if random.randint(0,1): 113 | text = text.replace("Python code", "Python program") 114 | output.write(json.dumps({'text':text, "metadata": {'source': 'merged_code_xp3'}})+"\n") 115 | 116 | -------------------------------------------------------------------------------- /OIG/src/minimal_trainer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | # Copyright 2023 LAION contributors to OIG 4 | # Copyright 2020 The HuggingFace Inc. team. All rights reserved. 5 | # 6 | # Licensed under the Apache License, Version 2.0 (the "License"); 7 | # you may not use this file except in compliance with the License. 8 | # You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | """ 18 | Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset. 19 | Here is the full list of checkpoints on the hub that can be fine-tuned by this script: 20 | https://huggingface.co/models?filter=causal-lm 21 | """ 22 | 23 | # This script is modified from huggingface example CLM training 24 | # script uploaded by Xirider. See https://github.com/Xirider/finetune-gpt2xl 25 | # You need to have a tokenized dataset with train and eval splits in the 26 | # same directory you execute this script from. 27 | 28 | """ 29 | Eample DS config to execute from the commandline. 30 | 31 | deepspeed --num_gpus=1 minimal_trainer.py \ 32 | --deepspeed ds_config_gptneo.json \ 33 | --model_name_or_path pythia-1.4b-deduped \ 34 | --do_train \ 35 | --do_eval \ 36 | --block_size 280 \ 37 | --fp16 \ 38 | --overwrite_cache \ 39 | --evaluation_strategy="steps" \ 40 | --output_dir custom_1.4B_512bs \ 41 | --num_train_epochs 1 \ 42 | --eval_steps 200 \ 43 | --gradient_accumulation_steps 1 \ 44 | --per_device_train_batch_size 64 \ 45 | --use_fast_tokenizer True \ 46 | --learning_rate 5e-06 \ 47 | --warmup_steps 5 48 | """ 49 | 50 | # Example json for dsconfig (ds_config_gptneo.json) 51 | # Up to 13B model can be trained on single A100 80GB in DS config 2 52 | 53 | """ 54 | { 55 | "fp16": { 56 | "enabled": "auto", 57 | "loss_scale": 0, 58 | "loss_scale_window": 1000, 59 | "initial_scale_power": 16, 60 | "hysteresis": 2, 61 | "min_loss_scale": 1 62 | }, 63 | "optimizer": { 64 | "type": "AdamW", 65 | "params": { 66 | "lr": "auto", 67 | "betas": "auto", 68 | "eps": "auto", 69 | "weight_decay": "auto" 70 | } 71 | }, 72 | "scheduler": { 73 | "type": "WarmupLR", 74 | "params": { 75 | "warmup_min_lr": "auto", 76 | "warmup_max_lr": "auto", 77 | "warmup_num_steps": "auto" 78 | } 79 | }, 80 | "zero_optimization": { 81 | "stage": 2, 82 | "allgather_partitions": true, 83 | "allgather_bucket_size": 2e8, 84 | "overlap_comm": true, 85 | "reduce_scatter": true, 86 | "reduce_bucket_size": 2e8, 87 | "contiguous_gradients": true, 88 | "cpu_offload": true 89 | }, 90 | "gradient_accumulation_steps": "auto", 91 | "gradient_clipping": "auto", 92 | "steps_per_print": 2000, 93 | "train_batch_size": "auto", 94 | "train_micro_batch_size_per_gpu": "auto", 95 | "wall_clock_breakdown": false 96 | } 97 | 98 | """ 99 | # You can also adapt this script on your own causal language modeling task. Pointers for this are left as comments. 100 | 101 | import logging 102 | import math 103 | import os 104 | import sys 105 | from dataclasses import dataclass, field 106 | from typing import Optional 107 | from pathlib import Path 108 | 109 | from datasets import load_dataset, Dataset, load_from_disk 110 | 111 | import transformers 112 | from transformers import ( 113 | CONFIG_MAPPING, 114 | MODEL_FOR_CAUSAL_LM_MAPPING, 115 | AutoConfig, 116 | AutoModelForCausalLM, 117 | AutoTokenizer, 118 | HfArgumentParser, 119 | Trainer, 120 | TrainingArguments, 121 | default_data_collator, 122 | set_seed, 123 | ) 124 | from transformers.trainer_utils import get_last_checkpoint, is_main_process 125 | from transformers.utils import check_min_version 126 | 127 | 128 | # Will error if the minimal version of Transformers is not installed. Remove at your own risks. 129 | check_min_version("4.5.0.dev0") 130 | 131 | logger = logging.getLogger(__name__) 132 | 133 | 134 | MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys()) 135 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) 136 | 137 | 138 | @dataclass 139 | class ModelArguments: 140 | """ 141 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. 142 | """ 143 | 144 | model_name_or_path: Optional[str] = field( 145 | default=None, 146 | metadata={ 147 | "help": "The model checkpoint for weights initialization." 148 | "Don't set if you want to train a model from scratch." 149 | }, 150 | ) 151 | model_type: Optional[str] = field( 152 | default=None, 153 | metadata={ 154 | "help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)}, 155 | ) 156 | config_name: Optional[str] = field( 157 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} 158 | ) 159 | tokenizer_name: Optional[str] = field( 160 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} 161 | ) 162 | cache_dir: Optional[str] = field( 163 | default=None, 164 | metadata={ 165 | "help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, 166 | ) 167 | use_fast_tokenizer: bool = field( 168 | default=True, 169 | metadata={ 170 | "help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, 171 | ) 172 | model_revision: str = field( 173 | default="main", 174 | metadata={ 175 | "help": "The specific model version to use (can be a branch name, tag name or commit id)."}, 176 | ) 177 | use_auth_token: bool = field( 178 | default=False, 179 | metadata={ 180 | "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " 181 | "with private models)." 182 | }, 183 | ) 184 | 185 | 186 | @dataclass 187 | class DataTrainingArguments: 188 | """ 189 | Arguments pertaining to what data we are going to input our model for training and eval. 190 | """ 191 | 192 | block_size: Optional[int] = field( 193 | default=None, 194 | metadata={ 195 | "help": "Optional input sequence length after tokenization." 196 | "The training dataset will be truncated in block of this size for training." 197 | "Default to the model max input length for single sentence inputs (take into account special tokens)." 198 | }, 199 | ) 200 | overwrite_cache: bool = field( 201 | default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} 202 | ) 203 | preprocessing_num_workers: Optional[int] = field( 204 | default=None, 205 | metadata={"help": "The number of processes to use for the preprocessing."}, 206 | ) 207 | 208 | 209 | 210 | def main(): 211 | # Directory of train and eval datasets 212 | # These datasets should be pretokenized and in the format of the toy dataset in (rallio_toy.zip) 213 | # A separate script to create datasets from text is available on the github. 214 | 215 | train_dataset = load_from_disk("custom_train") # replace custom_train with your path 216 | eval_dataset = load_from_disk("custom_eval") # replace custom eval with your path 217 | 218 | # See all possible arguments in src/transformers/training_args.py 219 | # or by passing the --help flag to this script. 220 | # We now keep distinct sets of args, for a cleaner separation of concerns. 221 | 222 | parser = HfArgumentParser( 223 | (ModelArguments, DataTrainingArguments, TrainingArguments)) 224 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): 225 | # If we pass only one argument to the script and it's the path to a json file, 226 | # let's parse it to get our arguments. 227 | model_args, data_args, training_args = parser.parse_json_file( 228 | json_file=os.path.abspath(sys.argv[1])) 229 | else: 230 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 231 | 232 | # Detecting last checkpoint. 233 | last_checkpoint = None 234 | if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: 235 | last_checkpoint = get_last_checkpoint(training_args.output_dir) 236 | if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: 237 | raise ValueError( 238 | f"Output directory ({training_args.output_dir}) already exists and is not empty. " 239 | "Use --overwrite_output_dir to overcome." 240 | ) 241 | elif last_checkpoint is not None: 242 | logger.info( 243 | f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " 244 | "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." 245 | ) 246 | 247 | # Setup logging 248 | logging.basicConfig( 249 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", 250 | datefmt="%m/%d/%Y %H:%M:%S", 251 | handlers=[logging.StreamHandler(sys.stdout)], 252 | ) 253 | logger.setLevel(logging.INFO if is_main_process( 254 | training_args.local_rank) else logging.WARN) 255 | 256 | # Log on each process the small summary: 257 | logger.warning( 258 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" 259 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" 260 | ) 261 | # Set the verbosity to info of the Transformers logger (on main process only): 262 | if is_main_process(training_args.local_rank): 263 | transformers.utils.logging.set_verbosity_info() 264 | transformers.utils.logging.enable_default_handler() 265 | transformers.utils.logging.enable_explicit_format() 266 | logger.info("Training/evaluation parameters %s", training_args) 267 | 268 | # Set seed before initializing model. 269 | set_seed(training_args.seed) 270 | 271 | # Load pretrained model and tokenizer 272 | # 273 | # Distributed training: 274 | # The .from_pretrained methods guarantee that only one local process can concurrently 275 | # download model & vocab. 276 | 277 | config_kwargs = { 278 | "cache_dir": model_args.cache_dir, 279 | "revision": model_args.model_revision, 280 | "use_auth_token": True if model_args.use_auth_token else None, 281 | } 282 | if model_args.config_name: 283 | config = AutoConfig.from_pretrained( 284 | model_args.config_name, **config_kwargs) 285 | elif model_args.model_name_or_path: 286 | config = AutoConfig.from_pretrained( 287 | model_args.model_name_or_path, **config_kwargs) 288 | else: 289 | config = CONFIG_MAPPING[model_args.model_type]() 290 | logger.warning( 291 | "You are instantiating a new config instance from scratch.") 292 | 293 | # Things that were changed from the huggingface file 294 | 295 | config.gradient_checkpointing = True 296 | config.use_cache = False 297 | 298 | # 299 | tokenizer_kwargs = { 300 | "cache_dir": model_args.cache_dir, 301 | "use_fast": model_args.use_fast_tokenizer, 302 | "revision": model_args.model_revision, 303 | "use_auth_token": True if model_args.use_auth_token else None, 304 | } 305 | if model_args.tokenizer_name: 306 | tokenizer = AutoTokenizer.from_pretrained( 307 | model_args.tokenizer_name, **tokenizer_kwargs) 308 | elif model_args.model_name_or_path: 309 | tokenizer = AutoTokenizer.from_pretrained( 310 | model_args.model_name_or_path, **tokenizer_kwargs) 311 | else: 312 | raise ValueError( 313 | "You are instantiating a new tokenizer from scratch. This is not supported by this script." 314 | "You can do it from another script, save it, and load it from here, using --tokenizer_name." 315 | ) 316 | 317 | 318 | if model_args.model_name_or_path: 319 | model = AutoModelForCausalLM.from_pretrained( 320 | model_args.model_name_or_path, 321 | from_tf=bool(".ckpt" in model_args.model_name_or_path), 322 | config=config, 323 | cache_dir=model_args.cache_dir, 324 | revision=model_args.model_revision, 325 | use_auth_token=True if model_args.use_auth_token else None, 326 | ) 327 | else: 328 | logger.info("Training new model from scratch") 329 | model = AutoModelForCausalLM.from_config(config) 330 | 331 | model.resize_token_embeddings(len(tokenizer)) 332 | 333 | 334 | # Initialize our Trainer 335 | trainer = Trainer( 336 | model=model, 337 | args=training_args, 338 | train_dataset=train_dataset if training_args.do_train else None, 339 | eval_dataset=eval_dataset if training_args.do_eval else None, 340 | tokenizer=tokenizer, 341 | # Data collator will default to DataCollatorWithPadding, so we change it. 342 | data_collator=default_data_collator, 343 | 344 | ) 345 | 346 | # Training 347 | if training_args.do_train: 348 | if last_checkpoint is not None: 349 | checkpoint = last_checkpoint 350 | elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path): 351 | checkpoint = model_args.model_name_or_path 352 | else: 353 | checkpoint = None 354 | train_result = trainer.train(resume_from_checkpoint=checkpoint) 355 | trainer.save_model() # Saves the tokenizer too for easy upload 356 | 357 | metrics = train_result.metrics 358 | metrics["train_samples"] = len(train_dataset) 359 | 360 | trainer.log_metrics("train", metrics) 361 | trainer.save_metrics("train", metrics) 362 | trainer.save_state() 363 | 364 | # Evaluation 365 | if training_args.do_eval: 366 | logger.info("*** Evaluate ***") 367 | metrics = trainer.evaluate() 368 | metrics["eval_samples"] = len(eval_dataset) 369 | perplexity = math.exp(metrics["eval_loss"]) 370 | metrics["perplexity"] = perplexity 371 | trainer.log_metrics("eval", metrics) 372 | trainer.save_metrics("eval", metrics) 373 | 374 | 375 | def _mp_fn(index): 376 | # For xla_spawn (TPUs) 377 | main() 378 | 379 | 380 | if __name__ == "__main__": 381 | main() 382 | -------------------------------------------------------------------------------- /OIG/src/multi_news.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | 8 | def create_multi_news(output): 9 | 10 | data = load_dataset('multi_news') 11 | 12 | doc_names = [('document', 'Document'), ('passage', 'Passage'), ('text', 'Text'),] 13 | user_templates = [ 14 | "Can you condense the documents into XXX words or less?", 15 | "Can you provide a brief rundown of the documents in XXX words?", 16 | "Could you distill the essence of the documents into XXX words?", 17 | "Can you succinctly summarize the documents in XXX words?", 18 | "Can you give a brief summary of the documents using only XXX words?", 19 | "Can you encapsulate the documents into XXX words or fewer?", 20 | "Could you articulate the main points of the documents in XXX words?", 21 | "Can you concisely highlight the essence of the documents in XXX words?", 22 | "Can you synthesize the documents into a XXX-word summary?", 23 | "Can you present a pithy summary of the documents in XXX words?", 24 | "Briefly summarize the documents in XXX words or less.", 25 | "Provide a succinct summary of the documents in no more than XXX words.", 26 | "Give a condensed overview of the documents in XXX words or fewer.", 27 | "Present a short summary of the documents using no more than XXX words.", 28 | "In XXX words or less, give a brief synopsis of the documents.", 29 | "Summarize the contents of the documents in no more than XXX words.", 30 | "Give a summary of the documents in a maximum of XXX words.", 31 | "Present an abridged version of the documents using XXX words or fewer.", 32 | "Summarize the information in the documents using a maximum of XXX words.", 33 | "In a nutshell, provide a summary of the documents in XXX words or less.", 34 | "Can you condense the documents?", 35 | "Can you provide a brief rundown of the documents?", 36 | "Could you distill the essence of the documents?", 37 | "Can you succinctly summarize the documents?", 38 | "Can you give a brief summary of the documents?", 39 | "Can you encapsulate the documents?", 40 | "Could you articulate the main points of the documents?", 41 | "Can you concisely highlight the essence of the documents?", 42 | "Can you synthesize the documents into a summary?", 43 | "Can you present a pithy summary of the documents?", 44 | "Briefly summarize the documents", 45 | "Provide a succinct summary of the documents.", 46 | "Give a condensed overview of the documents.", 47 | "Present a short summary of the documents.", 48 | "Give a brief synopsis of the documents.", 49 | "Summarize the contents of the documents.", 50 | "Give a summary of the documents.", 51 | "Present an abridged version of the documents.", 52 | "Summarize the information in the documents.", 53 | "In a nutshell, provide a summary of the documents.", 54 | ] 55 | 56 | for item in tqdm.tqdm(data['train']): 57 | 58 | ##### instruction after 59 | 60 | documents = [doc.strip() for doc in item['document'].split('|||||')] 61 | summary = item['summary'].lstrip('– ').strip() 62 | summary_n_words = (len(summary.split()) // 10 + 1) * 10 63 | 64 | template = random.choice(user_templates) 65 | doc_type, Doc_type = random.choice(doc_names) 66 | template = template.replace('document', doc_type) 67 | template = template.replace('XXX', str(summary_n_words)) 68 | 69 | prompt = ': ' 70 | for i, doc in enumerate(documents): 71 | doc = doc.replace('\n ', '\n') 72 | prompt += f"{Doc_type} {i+1}: " + doc + '\n' 73 | 74 | prompt += template 75 | 76 | prompt += '\n: ' + summary 77 | 78 | output.write(json.dumps({'text': prompt}) + '\n') 79 | 80 | ##### instruction before 81 | documents = [doc.strip() for doc in item['document'].split('|||||')] 82 | summary = item['summary'].lstrip('– ').strip() 83 | summary_n_words = (len(summary.split()) // 10 + 1) * 10 84 | 85 | template = random.choice(user_templates) 86 | doc_type, Doc_type = random.choice(doc_names) 87 | template = template.replace('document', doc_type) 88 | template = template.replace('XXX', str(summary_n_words)) 89 | 90 | prompt = ': ' 91 | 92 | prompt += template 93 | 94 | for i, doc in enumerate(documents): 95 | doc = doc.replace('\n ', '\n') 96 | prompt += f"\n{Doc_type} {i+1}: " + doc 97 | 98 | prompt += '\n: ' + summary 99 | 100 | output.write(json.dumps({'text': prompt}) + '\n') 101 | -------------------------------------------------------------------------------- /OIG/src/ni.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | def create_ni(output): 8 | 9 | data = load_dataset('Muennighoff/natural-instructions') 10 | 11 | last_task_def = None 12 | last_question = None 13 | prompt = '' 14 | 15 | for item in tqdm.tqdm(data['train']): 16 | 17 | task_def = item['definition'] 18 | question = item['inputs'] 19 | 20 | if question == last_question: 21 | continue 22 | 23 | last_question = question 24 | 25 | answer = item['targets'] 26 | 27 | # Do a cut every 20 examples. 28 | if task_def == last_task_def and len(prompt.split('')) < 20: 29 | 30 | prompt += f"\n: {question}\n: {answer}" 31 | 32 | else: 33 | if last_task_def is not None: 34 | 35 | output.write(json.dumps({'text': prompt}) + '\n') 36 | 37 | last_task_def = task_def 38 | prompt = f": {task_def}\n\n{question}\n: {answer}" 39 | 40 | output.write(json.dumps({'text': prompt}) + '\n') -------------------------------------------------------------------------------- /OIG/src/nq.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | import re 8 | 9 | def striphtml(data): 10 | p = re.compile(r'<.*?>') 11 | return p.sub('', data) 12 | 13 | from nltk.tokenize.treebank import TreebankWordDetokenizer 14 | detokenizer = TreebankWordDetokenizer() 15 | 16 | 17 | def create_nq(output): 18 | 19 | ##### 20 | # download and unzip https://storage.cloud.google.com/natural_questions/v1.0-simplified/simplified-nq-train.jsonl.gz 21 | ##### 22 | 23 | mores = [ 24 | "Could you point me to the text mentioning this?", 25 | "Which text refers to this?", 26 | "Can you highlight the sentence that discusses this?", 27 | "Where in the text can I find information about this?", 28 | "What sentence talks about this?", 29 | "Could you direct me to the sentence that pertains to this?", 30 | "Which sentence addresses this topic?", 31 | "Can you locate the sentence that covers this?", 32 | "Where in the passage is the sentence discussing this?", 33 | "What is the sentence that relates to this?", 34 | ] 35 | 36 | with open("v1.0-simplified_simplified-nq-train.jsonl") as f: 37 | for i, line in enumerate(f): 38 | if line.strip() == '': 39 | continue 40 | item = json.loads(line) 41 | 42 | # doc = detokenizer.detokenize(item['document_text'].split(' ')) 43 | c_id = item['annotations'][0]['long_answer']['candidate_index'] 44 | if c_id < 0: 45 | # no answer 46 | c_id = random.randint(0, len(item['long_answer_candidates'])) 47 | full_doc = item['document_text'].split(' ') 48 | doc = [] 49 | for cand in item['long_answer_candidates'][max(c_id-2, 0): c_id+2]: 50 | doc += full_doc[cand['start_token']: cand['end_token']] 51 | doc = detokenizer.detokenize(doc) 52 | 53 | if doc == '': 54 | print(doc, c_id) 55 | break 56 | 57 | q = item['question_text'].capitalize() 58 | 59 | if item['annotations'][0]['yes_no_answer'] == 'NONE' and item['annotations'][0]['long_answer']['start_token'] == -1 and len(item['annotations'][0]['short_answers']) == 0: 60 | 61 | prompt = f": {doc}\n\n{q}\n: Sorry, I cannot find a relevant answer in the given context." 62 | 63 | elif len(item['annotations'][0]['short_answers']) > 0: 64 | # short answer 65 | 66 | short_a = detokenizer.detokenize(item['document_text'].split(' ')[ 67 | item['annotations'][0]['short_answers'][0]['start_token']: item['annotations'][0]['short_answers'][0]['end_token'] 68 | ]) 69 | 70 | long_a = detokenizer.detokenize(item['document_text'].split(' ')[ 71 | item['annotations'][0]['long_answer']['start_token']: item['annotations'][0]['long_answer']['end_token'] 72 | ]) 73 | long_a = striphtml(long_a).strip() 74 | while long_a != long_a.replace(' ', ' '): 75 | long_a = long_a.replace(' ', ' ') 76 | long_a = detokenizer.detokenize(long_a.split(' ')) 77 | 78 | more = random.choice(mores) 79 | prompt = f": {doc}\n\n{q}\n: {short_a}\n: {more}\n: {long_a}" 80 | 81 | elif item['annotations'][0]['yes_no_answer'] != 'NONE': 82 | 83 | # short answer 84 | if item['annotations'][0]['yes_no_answer'] == 'NO': 85 | short_a = 'No.' 86 | else: 87 | short_a = 'Yes.' 88 | 89 | long_a = detokenizer.detokenize(item['document_text'].split(' ')[ 90 | item['annotations'][0]['long_answer']['start_token']: item['annotations'][0]['long_answer']['end_token'] 91 | ]) 92 | long_a = striphtml(long_a).strip() 93 | while long_a != long_a.replace(' ', ' '): 94 | long_a = long_a.replace(' ', ' ') 95 | long_a = detokenizer.detokenize(long_a.split(' ')) 96 | 97 | more = random.choice(mores) 98 | prompt = f": {doc}\n\n{q}\n: {short_a}\n: {more}\n: {long_a}" 99 | 100 | else: 101 | 102 | long_a = detokenizer.detokenize(item['document_text'].split(' ')[ 103 | item['annotations'][0]['long_answer']['start_token']: item['annotations'][0]['long_answer']['end_token'] 104 | ]) 105 | long_a = striphtml(long_a).strip() 106 | while long_a != long_a.replace(' ', ' '): 107 | long_a = long_a.replace(' ', ' ') 108 | long_a = detokenizer.detokenize(long_a.split(' ')) 109 | 110 | prompt = f": {doc}\n\n{q}\nfind me the text answering this question\n: {long_a}" 111 | 112 | output.write(json.dumps({'text': prompt}) + '\n') 113 | -------------------------------------------------------------------------------- /OIG/src/openai_summarize_tldr.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | 8 | def create_openai_summarize_tldr(output): 9 | 10 | data = load_dataset('CarperAI/openai_summarize_tldr') 11 | 12 | user_templates = [ 13 | "Can you condense the text into three sentences or less?", 14 | "Can you provide a brief rundown of the text in within 3 sentences?", 15 | "Could you distill the essence of the text into 1-3 sentences?", 16 | "Can you succinctly summarize the text in 1-3 sentences?", 17 | "Can you summarize the given text in a few sentences?", 18 | "Can you provide a brief overview of the text?", 19 | "Can you give me the gist of the text in a nutshell?", 20 | "Can you condense the text into a few key points?", 21 | "Can you give me a shortened version of the text?", 22 | "Summarize the given text in a few sentences.", 23 | "Provide a brief overview of the text.", 24 | "Give me the gist of the text in a nutshell", 25 | "Condense the text into a few key points", 26 | "Give me a shortened version of the text", 27 | ] 28 | 29 | for item in tqdm.tqdm(data['train']): 30 | 31 | #### instruction before 32 | 33 | text = item['prompt'][item['prompt'].find('\nPOST')+6:].replace('\nTL;DR:', '').strip() 34 | summary = item['label'] 35 | 36 | user_template = random.choice(user_templates) 37 | 38 | prompt = ': ' + user_template 39 | prompt += '\n\n' + text 40 | 41 | prompt += '\n: ' + summary 42 | 43 | output.write(json.dumps({'text': prompt}) + '\n') 44 | 45 | ##### instruction after 46 | 47 | text = item['prompt'][item['prompt'].find('\nPOST')+6:].replace('\nTL;DR:', '').strip() 48 | summary = item['label'] 49 | 50 | user_template = random.choice(user_templates) 51 | 52 | prompt = ': ' + text 53 | prompt += '\n\n' + user_template 54 | 55 | prompt += '\n: ' + summary 56 | 57 | output.write(json.dumps({'text': prompt}) + '\n') -------------------------------------------------------------------------------- /OIG/src/p3.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | def create_p3(output): 8 | 9 | data = load_dataset('Muennighoff/P3') 10 | 11 | last = None 12 | prompt = '' 13 | 14 | for item in tqdm.tqdm(data['train']): 15 | 16 | chunks = item['inputs'].split('\n\n') 17 | 18 | c = '\n\n'.join(chunks[:-1]) 19 | q = chunks[-1] 20 | 21 | a = item['targets'] 22 | 23 | if c == last: 24 | prompt += f"\n: {q}\n: {a}" 25 | else: 26 | if last is not None: 27 | 28 | output.write(json.dumps({'text': prompt}) + '\n') 29 | 30 | last = c 31 | prompt = f": {c}\n\n{q}\n: {a}" 32 | 33 | output.write(json.dumps({'text': prompt}) + '\n') -------------------------------------------------------------------------------- /OIG/src/poetry.py: -------------------------------------------------------------------------------- 1 | from datasets import Dataset,load_dataset 2 | import json 3 | import pandas as pd 4 | import numpy as np 5 | import pronouncing 6 | import re 7 | from collections import Counter 8 | from tqdm import tqdm 9 | from transformers import pipeline 10 | 11 | 12 | MIN_RHYMES = 3 13 | PROMPTS = { 14 | "begining":{ 15 | "default": ["{} a {} entitled {}"], 16 | "about": ["{} a {} on the topic {}", 17 | "{} a {} about {} "], 18 | "rhyming":["{} a {} containing rhyming words for the word '{}' entitled {}.", 19 | "{} a {} containing rhyming words for the word '{}'"], 20 | "genre_age":["{} a {} on the topic {} ", 21 | "{} a {} of the genre {} "], 22 | }, 23 | "completion":{"completion":["{} the {}", 24 | "{} the {} entitled {}"],} 25 | 26 | } 27 | SYNONYMS = { 28 | "poem" : ["poem","poetry"], 29 | "compose" : ["Write","Help me write", "Compose", "Please craft", "Give me"], 30 | "complete" : ["Complete", "Finish", "Put the finishing touches to"], 31 | "sentiment" : {"positive":["positive ","happy "],"negative":["negative ","sad "]}, 32 | "writing" : ["in the manner of {}.","in {}'s writing style.",], 33 | "period" : [" written in {} age."," written during {} period"], 34 | } 35 | 36 | nlp = pipeline(task='text-classification', 37 | model='nickwong64/bert-base-uncased-poems-sentiment') 38 | 39 | 40 | 41 | def get_best_rhymes(content): 42 | 43 | best_rhymes = [] 44 | first_word = 0 45 | lines = [re.sub(r'\r','',line) for line in content.split("\n") if line!=""] 46 | if len(lines)>1: 47 | last_words = [re.sub('[^A-Za-z0-9]+', '', line.split(" ")[-1]) for line in lines] 48 | for word in last_words: 49 | rhymes = pronouncing.rhymes(word) 50 | index = last_words.index(word) 51 | rhymes_words = np.intersect1d(rhymes,last_words[index:index+10]) 52 | if len(rhymes_words) > len(best_rhymes): 53 | best_rhymes = list(rhymes_words) 54 | best_rhymes.insert(0,word) 55 | if best_rhymes: 56 | first_word = min([last_words.index(word) for word in best_rhymes]) 57 | return last_words[first_word],best_rhymes 58 | return None,[] 59 | 60 | 61 | def toss_prompt(possible_prompts,prompt_types): 62 | prompt_type = np.random.choice(prompt_types,p=[0.9,0.1]) 63 | prompt = np.random.choice(possible_prompts[prompt_type]) 64 | return prompt_type,prompt 65 | 66 | def get_emotion(content): 67 | 68 | labels = [nlp(re.sub(r'\r|\n','',line))[0]["label"] for line in content.split(".")][:3] 69 | if labels.count("negative") >= 2: 70 | sentiment = "negative" 71 | elif labels.count("positive") >= 2: 72 | sentiment = "positive" 73 | else: 74 | sentiment = None 75 | return sentiment 76 | 77 | 78 | def build_prompt(possible_prompts,prompt_types,args,sentiment,rhyming_list): 79 | 80 | compose = np.random.choice(SYNONYMS["compose"]) 81 | poem = np.random.choice(SYNONYMS["poem"]) 82 | complete = np.random.choice(SYNONYMS["complete"]) 83 | 84 | prompt_type,prompt = toss_prompt(possible_prompts,prompt_types) 85 | if prompt_type == "default": 86 | 87 | prompt = prompt.format(compose,poem,args["poem name"]) 88 | 89 | elif prompt_type == "about": 90 | 91 | prompt = prompt.format(compose,poem,args["poem name"].lower()) 92 | 93 | elif prompt_type == "genre_age": 94 | 95 | prompt = prompt.format(compose,poem,args["type"],args["age"]) 96 | 97 | 98 | elif prompt_type == "completion": 99 | 100 | prompt = prompt.format(complete,poem,args["poem name"]) 101 | 102 | if ((np.random.randint(0,5)) and (sentiment!=None)): 103 | index = prompt.find(poem) 104 | prompt = prompt[:index] + np.random.choice(SYNONYMS["sentiment"][sentiment]) + prompt[index:] 105 | 106 | if ((np.random.randint(0,5)) and (len(rhyming_list)>2)): 107 | index = prompt.find(poem) + len(poem) 108 | prompt = prompt[:index] + " containing rhyming words for the word '{}'".format(rhyming_list[0]) + prompt[index:] 109 | 110 | 111 | return prompt 112 | 113 | def add_author(prompt,author,top_authors): 114 | 115 | if ((author.lower() in top_authors) and (np.random.randint(0,2))): 116 | style = np.random.choice(SYNONYMS["writing"]).format(author.lower().capitalize()) 117 | prompt= prompt + " " + style 118 | 119 | return prompt 120 | 121 | def get_top_authors(dataset): 122 | 123 | counter = Counter([poem["author"] for poem in dataset]).most_common(100) 124 | authors,_ = zip(*counter) 125 | return [name.lower() for name in authors] 126 | 127 | def create_poem_instructions(dataset): 128 | 129 | top_authors = get_top_authors(dataset) 130 | all_prompts = [] 131 | for item in tqdm(dataset): 132 | try: 133 | item["poem name"] = re.sub(r'\r|\n|\[.*\]','',item["poem name"]).strip() 134 | poem_name, content, author, genre, age = [item[key] for key in ["poem name","content","author","type","age"]] 135 | prompt_type = np.random.choice(["completion","begining"],p=[0.3,0.7]) 136 | 137 | sentiment = get_emotion(content) 138 | rh_word,rh_wordslist = get_best_rhymes(item["content"]) 139 | item["rhyming"] = rh_word 140 | 141 | possible_prompts = PROMPTS[prompt_type] 142 | 143 | 144 | if prompt_type == "begining": 145 | 146 | if ((genre!=None) and np.random.randint(0,5)): 147 | 148 | prompt = build_prompt(possible_prompts,["genre_age","default"],item,sentiment,rh_wordslist) 149 | 150 | if ((item["age"]!="") and (np.random.randint(0,5))): 151 | prompt += np.random.choice(SYNONYMS["period"]).format(item["age"]) 152 | 153 | 154 | elif poem_name.lower().startswith("the"): 155 | prompt = build_prompt(possible_prompts,["about","default"],item,sentiment,rh_wordslist) 156 | 157 | else: 158 | prompt = build_prompt(possible_prompts,["default","default"],item,sentiment,rh_wordslist) 159 | 160 | prompt = add_author(prompt,author,top_authors) 161 | response = item["content"].strip() 162 | 163 | else: 164 | prompt = build_prompt(possible_prompts,["completion","completion"],item,sentiment,rh_wordslist) 165 | prompt = add_author(prompt,author,top_authors) 166 | num_lines = np.random.randint(3,6) 167 | poem_lines = item["content"].split("\n") 168 | prompt = prompt + "\n" + "\n".join(poem_lines[:num_lines]) 169 | response = "\n".join(poem_lines[num_lines:]).strip() 170 | 171 | all_prompts.append({"prompt":prompt,"response":response}) 172 | except Exception as e: 173 | print(e) 174 | return all_prompts 175 | 176 | def write_output(output): 177 | 178 | with open("output.json","w") as file: 179 | json.dump(output,file,indent=4) 180 | 181 | 182 | def main(): 183 | 184 | prompts = [] 185 | hf_datasets = ["merve/poetry","shahules786/PoetryFoundationData"] 186 | for dataset_name in hf_datasets: 187 | dataset = load_dataset(dataset_name,split="train") 188 | prompts.extend(create_poem_instructions(dataset)) 189 | return prompts 190 | 191 | if __name__ == "__main__": 192 | 193 | output = main() 194 | write_output(output) 195 | 196 | 197 | 198 | -------------------------------------------------------------------------------- /OIG/src/register_oscar.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import re 3 | import random 4 | 5 | 6 | def mask_words(sentence, prob_mask=0.1): 7 | """Randomly replace words in a sentence based on 8 | a given probability. 9 | args: 10 | sentence (str): The word to be replace with ... 11 | prob_mask (float): The probability of a word being masked. 12 | Returns: 13 | str: The sentence with the masked words. 14 | """ 15 | words = sentence.split(' ') 16 | n = round(len(words) * prob_mask) 17 | maskedw = ['The missing words:'] 18 | for i in random.sample(range(len(words)), n): 19 | maskedw.append(f'{words[i]},') 20 | words[i] = '...' 21 | 22 | maskedw[-1] = maskedw[-1].replace(',','') 23 | merged_words = [] 24 | for word in words: 25 | if word == '*': 26 | if merged_words and merged_words[-1] == '*': 27 | continue 28 | merged_words.append('*') 29 | else: 30 | merged_words.append(word) 31 | 32 | return ' '.join(merged_words) 33 | 34 | 35 | 36 | def mask_sentence(sentence, prob_mask=0.1): 37 | """ 38 | Masks a random sentences in a given string with prob. 39 | 40 | Args: 41 | - sentence (str): The sentence to be replace with .... . 42 | - prob_mask (float): The probability of a sentence being masked, default is 0.1. 43 | 44 | Returns: 45 | - A pargraph with a masked sentence. 46 | """ 47 | 48 | words = sentence.split('.') 49 | n = round(len(words) * prob_mask) 50 | masked_sen = [] 51 | for i in random.sample(range(len(words)), n): 52 | masked_sen.append(words[i]) 53 | words[i] = '...' 54 | 55 | merged_words = [] 56 | for word in words: 57 | if word == '*': 58 | if merged_words and merged_words[-1] == '*': 59 | continue 60 | merged_words.append('*') 61 | else: 62 | merged_words.append(word) 63 | return ' '.join(merged_words) 64 | 65 | 66 | 67 | def mask_paragraph(sentence): 68 | """ 69 | Masks a paragraph. 70 | 71 | Args: 72 | - sentence (str): The paragraph to be masked. 73 | 74 | Returns: 75 | - A string of the masked paragraph. 76 | """ 77 | words = sentence.split('\n')[:-1] 78 | 79 | i = random.sample(range(len(words)),1)[0] 80 | missing = words[i] 81 | words[i] = '...' 82 | 83 | return ' '.join(words) 84 | 85 | 86 | 87 | w_styles = {'NA':'Narrative', 88 | 'IN': 'Informational Description', 89 | 'OP':'Opinion', 90 | 'ID':'Interactive Discussion', 91 | 'HI':'Instruction', 92 | 'IP':'Informational Persuasion', 93 | 'LY':'Lyrical', 94 | 'SP':'Spoken',} 95 | 96 | 97 | 98 | instructions = {'free_style':['Write {n} sentences about {topic} in {style_name} style.', 99 | 'Write a paragraph about {topic} in {style_name} style.'], 100 | 'fill_word':['Fill in the missing words in the following paragraph: {sent}'], 101 | 'fill_sent':['Fill in the missing sentences knowing that the pargraph follow {style_name} style about {topic}: {sent}', 102 | 'In {article} {style_name} paragraph about {topic}. What sentence is missing? Please provide the missing sentence following the same strcture: {sent}'], 103 | 'fill_parh':['Fill in the missing paragraph with {n} senteces in the style of {style_name} about {topic}']} 104 | 105 | stopwords = ['i', 'you', 'thy', 'he', 'she', 'it', 'one', 'we', 'you', 'who', 'what', 'well','the', 'is','are', 'while','what','when','their','this', 106 | 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',] 107 | 108 | 109 | def generate_inst(ex): 110 | 111 | inst_format = random.choice(list(instructions.keys())) 112 | masked_sent = '' 113 | num_sent = len(re.split(r'[.!?]+', ex['text']) ) 114 | captial = re.findall(r'\b[A-Z]\w*', ex['text']) 115 | index = 0 116 | 117 | for j in captial: 118 | if j.lower() not in stopwords or len(j) <1: 119 | captial = j 120 | index = ex['text'].index(captial) 121 | break 122 | 123 | if 'topic' not in list(ex.keys()): 124 | captialized_words = re.search(r'\b[A-Z]\w*( [A-Z]\w*)*\b', ex['text'][index:]) 125 | 126 | if captialized_words == None: 127 | 128 | captialized_words = re.search(r'\b[a-z]\w*( [a-z]\w*)*\b', ex['text'][index:]) 129 | 130 | if captialized_words == None: 131 | ex['prompt'] = '' 132 | return ex 133 | 134 | 135 | topic = captialized_words.group() 136 | else: 137 | topic = ex['topic'] 138 | 139 | 140 | rags = range(len(instructions[inst_format])) 141 | index = random.choice(rags) 142 | 143 | if inst_format == 'fill_word': 144 | masked_sent = mask_words(ex['text']) 145 | 146 | 147 | elif inst_format == 'fill_sent': 148 | masked_sent = mask_sentence(ex['text']) 149 | elif inst_format == 'fill_parh': 150 | mask_sent = mask_paragraph(ex['text']) 151 | if ex['labels']: 152 | 153 | style = ex["labels"][0] 154 | article = get_article(style) 155 | prompt = instructions[inst_format][index].format(n=num_sent, 156 | topic=topic, 157 | article=article, 158 | style_name=w_styles[style], 159 | sent=masked_sent) 160 | ex['prompt'] = prompt 161 | 162 | else: 163 | ex['prompt'] = f'Write {num_sent} sentences about {topic}.' 164 | return ex 165 | 166 | ds = load_dataset('TurkuNLP/register_oscar','en',cache_dir='/home/cache/TurkuNLP/') 167 | ds = ds.map(generate_inst) 168 | ds.to_json('oscar.json', 169 | orient = 'records', 170 | lines=True,) 171 | -------------------------------------------------------------------------------- /OIG/src/scitldr.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | import os 7 | 8 | 9 | def create_scitldr(output): 10 | 11 | os.system('git clone git@github.com:allenai/scitldr.git') 12 | 13 | data = load_dataset("json", data_files='scitldr/SciTLDR-Data/SciTLDR-A/train.jsonl') 14 | 15 | for item in tqdm.tqdm(data['train']): 16 | 17 | ##### instruction before 18 | 19 | text = ' '.join(item['source']).strip() 20 | summary = item['target'][0].strip() 21 | 22 | user_template = random.choice(user_templates) 23 | 24 | prompt = ': ' + user_template 25 | prompt += '\n\n' + text 26 | 27 | prompt += '\n: ' + summary 28 | 29 | output.write(json.dumps({'text': prompt}) + '\n') 30 | 31 | ##### instruction after 32 | 33 | text = ' '.join(item['source']).strip() 34 | summary = item['target'][0].strip() 35 | 36 | user_template = random.choice(user_templates) 37 | 38 | prompt = ': ' + text 39 | prompt += '\n\n' + user_template 40 | 41 | prompt += '\n: ' + summary 42 | 43 | output.write(json.dumps({'text': prompt}) + '\n') -------------------------------------------------------------------------------- /OIG/src/soda_dialog.py: -------------------------------------------------------------------------------- 1 | #@title Soda-dialog 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | import random, json 21 | from datasets import load_dataset 22 | 23 | def generate_soda_dialog(output): 24 | dataset = load_dataset("allenai/soda") 25 | for i in range(len(dataset['train'])): 26 | dat = dataset['train'][i] 27 | title = dat['literal'] 28 | story = dat['narrative'] 29 | theme = "" 30 | if dat['relation'] == 'xWant': 31 | theme = "wanting " + dat['tail'] 32 | elif dat['relation'] == 'xNeed': 33 | theme = "needing " + dat['tail'] 34 | elif not dat['tail'].startswith("to ") and not dat['tail'].startswith("and "): 35 | theme = "being "+ dat['tail'] 36 | elif dat['tail'].startswith("and "): 37 | theme = "people are "+ dat['tail'].replace("and PersonY ", "") 38 | else: 39 | theme = dat['tail'] 40 | theme = theme.replace("PersonY", "another person") 41 | theme = theme.replace("being is", "being") 42 | dialog = [s2+": "+s1 for s1, s2 in zip(dat['dialogue'],dat['speakers'])] 43 | txt = "" 44 | start = random.choice(["Ok, ", "Sure, ", "Of course, ", ""]) 45 | 46 | categories = get_main_non_stopwords(story) 47 | if random.randint(0,6)==0 and categories: 48 | categories = ", ".join(categories) 49 | txt = f"User: Can you give me a short story idea for {categories}?\n" 50 | txt += f"Assistant: {start}, a short story idea for {categories}:\n {story}.\n" 51 | dialog = dialog.replace(dat['speakers'][0], "User2").replace(dat['speakers'][1], "User") 52 | title = title.replace(dat['speakers'][0], "User2").replace(dat['speakers'][1], "User") 53 | theme = theme.replace(dat['speakers'][0], "User2").replace(dat['speakers'][1], "User") 54 | dialog2 = "" 55 | for d in dialog.split("\n"): 56 | if random.randint(0,3) == 0 and len(dialog2)>1 and "User:" in d and "@Assistant" not in dialog2: 57 | dialog2 += f"User: @Assistant, what would be a theme of my conversation with @User2?\nAssistant: One theme of your conversation could be {theme}.\n" 58 | dialog2 += d 59 | txt += dialog2 60 | txt += f"User2: @Assistant, can you summarize my conversation with User?\nAssistant: {title}.\n" 61 | elif random.randint(0,6)==0: 62 | txt = f"User: Can you give me a short story description for this dialog?\n" 63 | txt += " "+"\n ".join(dialog)+"\n" 64 | txt += f"Assistant: {start}, a short story description for this dialog could be: \n {story}\n" 65 | txt += "User: And a title?\n" 66 | txt += f"Assistant: {start}a title for this dialog could be: \n {title}\n" 67 | if theme: 68 | txt += "User: What would be one theme of this story?\n" 69 | txt += f"Assistant: One theme of this story could be: \"{theme}\"\n" 70 | elif random.randint(0,3)==0: 71 | txt = f"User: Can you write a short dialog based on this story:\n {story}\n" 72 | txt += f"Assistant: {start}a dialog for this story could be:\n" 73 | txt += " "+"\n ".join(dialog)+"\n" 74 | txt += "User: And a title?\n" 75 | txt += f"Assistant: {start}a title for this dialog could be: \n {title}\n" 76 | if theme: 77 | if random.randint(0,1) == 0: 78 | txt += "User: What would be one theme of this story?\n" 79 | else: 80 | txt += "User: a theme\n" 81 | txt += f"Assistant: One theme of this story could be: \"{theme}\"\n" 82 | elif random.randint(0,3)==0: 83 | txt = f"User: Can you write the next few lines of dialog for this scene:\n" 84 | if random.randint(0,1) == 0: 85 | txt += " "+"\n ".join(dialog[:-5])+"\n" 86 | txt += f"Assistant: {start}the next dialog for this scene could be:\n" 87 | txt += " "+"\n ".join(dialog[-5:])+"\n" 88 | elif random.randint(0,1) == 0: 89 | txt += " "+"\n ".join(dialog[:-3])+"\n" 90 | txt += f"Assistant: {start}the next dialog for this scene could be:\n" 91 | txt += " "+"\n ".join(dialog[-3:])+"\n" 92 | else: 93 | txt += " "+"\n ".join(dialog[:-4])+"\n" 94 | txt += f"Assistant: {start}the next dialog for this scene could be:\n" 95 | txt += " "+"\n ".join(dialog[-4:])+"\n" 96 | txt += "User: And a title?\n" 97 | txt += f"Assistant: {start}a title for this dialog could be: \n {title}\n" 98 | txt += "User: How about a short description?\n" 99 | txt += f"Assistant: {start}a short description for this dialog could be: \n {story}\n" 100 | if theme: 101 | if random.randint(0,1) == 0: 102 | txt += "User: What would be one theme of this story?\n" 103 | else: 104 | txt += "User: a theme?\n" 105 | txt += f"Assistant: One theme of this story could be: \"{theme}\"\n" 106 | elif random.randint(0,3)==0: 107 | title1 = title.split(".")[0] 108 | title2 = title.split(".")[1] 109 | txt = f"User: Can you write short story about: {title1}\n" 110 | txt += f"Assistant: {start}a short story about: \"{title1}\" could be:\n" 111 | txt += f" {story}\n" 112 | if random.randint(0,3) < 3: 113 | txt += " "+"\n ".join(dialog)+"\n" 114 | elif random.randint(0,1) == 0 and len(dialog) > 5: 115 | txt += " "+"\n ".join(dialog[:-5])+"\n" 116 | txt += f"User: Can you provide more dialog assuming \"{title2}\"?\n" 117 | txt += f"Assistant: {start}the next dialog for this scene could be:\n" 118 | txt += " "+"\n ".join(dialog[-5:])+"\n" 119 | elif random.randint(0,1) == 0: 120 | txt += " "+"\n ".join(dialog[:-3])+"\n" 121 | txt += "User: more please.\n" 122 | txt += f"Assistant: {start}the next dialog for this scene could be:\n" 123 | txt += " "+"\n ".join(dialog[-3:])+"\n" 124 | else: 125 | txt += " "+"\n ".join(dialog[:-4])+"\n" 126 | txt += f"User: Can you provide more dialog assuming \"{title2}\"?\n" 127 | txt += f"Assistant: {start}the next dialog for this scene could be:\n" 128 | txt += " "+"\n ".join(dialog[-4:])+"\n" 129 | if theme: 130 | txt += "User: What would be one theme of this story?\n" 131 | txt += f"Assistant: One theme of this story could be: \"{theme}\"\n" 132 | else: 133 | txt = f"User: Can you write a short story and dialog based on the theme:\n {theme}\n" 134 | txt += f"Assistant: {start}a short story and dialog based on the theme \"{theme}\" could be:\n" 135 | txt += f" {story}\n" 136 | txt += " "+"\n ".join(dialog)+"\n" 137 | txt += "User: And a title?\n" 138 | txt += f"Assistant: {start}a title for this dialog could be: \n {title}\n" 139 | if txt: 140 | if random.randint(0,1) == 1: 141 | txt = txt.replace("short story", "story") 142 | if random.randint(0,1) == 1: 143 | txt = txt.replace("more please", "next") 144 | if random.randint(0,1) == 1: 145 | txt = txt.replace("more please", "continue") 146 | if random.randint(0,1) == 1: 147 | txt = txt.replace("Can you ", "") 148 | txt = txt.strip("\n ") 149 | if "User2: " not in txt: 150 | txt = basic_augment(txt) 151 | if txt: 152 | output.write(json.dumps({"text": txt, 'metadata': {'source': 'soda-dialog'}})+"\n") 153 | 154 | 155 | 156 | -------------------------------------------------------------------------------- /OIG/src/squad_v2.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import csv 3 | import json 4 | import random 5 | import tqdm 6 | 7 | 8 | def create_squad_v2(output): 9 | 10 | data = load_dataset('squad_v2') 11 | 12 | last = None 13 | prompt = '' 14 | 15 | for item in data['train']: 16 | c = item['context'] 17 | q = item['question'] 18 | if len(item['answers']['text']) > 0: 19 | a = item['answers']['text'][0] 20 | has_answer = True 21 | else: 22 | a = "Sorry, I cannot find a relevant answer in the given context." 23 | has_answer = False 24 | 25 | if c == last: 26 | prompt += f"\n: {q}\n: {a}" 27 | else: 28 | if last is not None: 29 | output.write(json.dumps({'text': prompt}) + '\n') 30 | 31 | last = c 32 | prompt = f": {c}\n\n{q}\n: {a}" 33 | 34 | if not has_answer: 35 | output.write(json.dumps({'text': prompt}) + '\n') 36 | last = None 37 | continue 38 | 39 | output.write(json.dumps({'text': prompt}) + '\n') -------------------------------------------------------------------------------- /OIG/src/unified_skg.py: -------------------------------------------------------------------------------- 1 | #@title UnifiedSKG 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | 21 | import json, random 22 | 23 | def ask_context(context, text): 24 | if context: 25 | context_arr = context.split(".") 26 | if len(context_arr) > 4: 27 | context = ".".join(context_arr[-3:]) 28 | context_arr = context_arr[:-3] 29 | else: 30 | context_arr = [] 31 | if random.randint(0,1) == 0 or not text: 32 | text += "User: Tell me more about "+context.split(".")[0]+".\nAssistant: " + context 33 | else: 34 | text += "User: Tell me more about this subject.\nAssistant: " + context 35 | for rng in range(0, len(context_arr), 3): 36 | context = ".".join(context_arr[rng:min(len(context_arr), rng+3)]) 37 | if random.randint(0,1) == 0: 38 | text += "\nUser: More please.\nAssistant: " + context 39 | elif random.randint(0,1) == 0: 40 | text += "\nUser: More.\nAssistant: " + context 41 | else: 42 | text += "\nUser: Continue.\nAssistant: " + context 43 | return text 44 | 45 | def process_unifiedskg(output): 46 | i = 0 47 | seen = {} 48 | if True: #with open("unifiedskg_instructions.jsonl", "w") as output: 49 | with open("/content/drive/Shareddrives/LAION/OIG/unifiedskg.jsonl") as input: 50 | for l in input: 51 | data = json.loads(l.strip()) 52 | context = data['context'].replace("list of", "") 53 | if context and context.lower() in data['structured data']: 54 | sd = data['structured data'].replace(context.lower(),"").strip() 55 | else: 56 | sd = data['structured data'].strip() 57 | if not sd: continue 58 | del data['structured data'] 59 | 60 | if "" in sd: 61 | continue 62 | #sd = sd.replace("", "### ").replace(" ", ", ").replace(" ","\n * ").\ 63 | # replace(" ", " * ").replace(" ", " is the ").replace("", " ").replace("", " ").replace("
", "\n").\ 64 | # replace("", "|").replace("", " ").replace(" ", " ").replace("*", "\n*").strip() 65 | #sd = sd.replace(" is the", "\n--") 66 | #sd = "\n".join([s for s in sd.split("\n") if s.strip() and s.strip()[0] != '-']) 67 | #print (sd, "**\n",data) 68 | #break 69 | #continue 70 | table_name = "" 71 | is_table = False 72 | if "col :" in sd: 73 | is_table = True 74 | table_name = sd.split(":",1)[-2].split(" col")[0].strip() 75 | if table_name == "col": table_name = "" 76 | if table_name: 77 | sd = sd.replace(table_name, "").strip(" :") 78 | table_name = table_name.replace(" | ", ": ") 79 | col_sd = sd.replace("col :", "|").split("row",1) 80 | if len(col_sd) == 1: 81 | continue 82 | col, sd = col_sd 83 | col = ("| ID | "+ col + "|\n"+ (("| --------- " )*(col.count("|") +1)) + "|\n").replace("| |", "|") 84 | sd = "| "+sd.replace("row ", "|\n| ")+" |" 85 | sd = col + sd 86 | sd = ("|".join(sd.split(":"))).replace("_", " ").replace(" ", " ").replace("| – |", "| |") 87 | if table_name: 88 | sd = "### "+table_name+"\n" + sd 89 | else: 90 | sd = sd.replace("type.object.type", "is a type of") 91 | sd = " ".join([s if "." not in s else "'s "+ s.split(".")[-1].replace("_", " ") + " is " for s in sd.split()]) 92 | sd = sorted([s.strip() for s in sd.replace("of 's ", "of ").replace("is 's", "is ").replace("s is ", "s are").split("|")]) 93 | for idx, el in enumerate(sd): 94 | el = el.strip() 95 | el = el.lstrip("+-1234567890 ") 96 | if el.startswith("'s "): el = el.lstrip("'s ") 97 | if el.startswith("is "): el = el.lstrip("is ") 98 | if el.endswith("is "): el = el.rstrip("is ") 99 | el = el.replace("is is", "is") 100 | el = el.replace("is is", "is") 101 | el = el.replace("is is", "is") 102 | el = el.replace(" 's contains are ", " contains ") 103 | el = el.replace(" population number ", " population ") 104 | el = el.replace(" 's partially containedby is ", " is partially contained by ") 105 | el = el.replace(" 's containedby ", " is contained by ") 106 | el = el.replace("containedby", "contained by") 107 | el = el.replace(" ", " ") 108 | if el.count(":") == 2: 109 | a, b, c = el.split(":") 110 | el = a + " 's " + b + " is " + c 111 | el = el.replace(" ", " ") 112 | el = el.replace("[TABLECONTEXT] : [title] : ", "") 113 | el = el.replace("[TABLECONTEXT] : ", "") 114 | el = el.replace("[", " ") 115 | el = el.replace("]", " ") 116 | el = el.strip() 117 | el = el.lstrip("+-1234567890 ") 118 | el = el.replace(" ", " ") 119 | if el.endswith(" is"): el = "" 120 | sd[idx] = el 121 | sd = [s.rstrip(".")+"." for s in sd if s] 122 | sd.sort() 123 | sd_arr = sd 124 | if random.randint(0,1) == 0: 125 | sd = "\n".join([" "+str(idx)+". " + s for idx, s in enumerate(sd_arr)]) 126 | else: 127 | sd = '* '+("\n* ".join(sd_arr)) 128 | if sd in seen: continue 129 | seen[sd] = 1 130 | choice = random.randint(0,5) 131 | if choice == 0: 132 | text = "" 133 | add_context = False 134 | if random.randint(0,1) == 0: 135 | text = ask_context(context, "") 136 | add_context = True 137 | text += "\nBackground:\n" + sd + "\nUser: " + data['query'] + "\nAssistant: " + data['output'] 138 | if not add_context: 139 | text = ask_context(context, text+"\n") 140 | elif choice == 1: 141 | text = "" 142 | add_context = False 143 | if random.randint(0,1) == 0: 144 | text = ask_context(context, "") 145 | add_context = True 146 | text += "\nBackground:\n" + sd + "\nUser: What is a question for which the answer is '" + data['output'] + "'\nAssistant: One question for which the answer is '"+ data['output'] + "' could be: "+ data['query'] 147 | if not add_context: 148 | text = ask_context(context, text+"\n") 149 | elif choice == 2: 150 | text = "" 151 | add_context = False 152 | if random.randint(0,1) == 0: 153 | text = ask_context(context, "") 154 | add_context = True 155 | text += "\nUser: What is one question you can ask based on this data:\n"+ sd +"\nAssistant: " + data['query'] +"\nUser: Now tell me the answer.\nAssistant: " + data['output'] 156 | if not add_context: 157 | text = ask_context(context, text+"\n") 158 | elif choice == 3: 159 | context2 = "" 160 | if context: 161 | context_arr = context.split(".") 162 | if random.randint(0,1) == 0 and len(context_arr) > 3: 163 | context2 = ".".join(context_arr[-3:]) 164 | context_arr = context_arr[:-3] 165 | context = ".".join(context_arr) 166 | else: 167 | context2 = "" 168 | text = "" 169 | add_context = False 170 | if not context2 and random.randint(0,1) == 0: 171 | text = ask_context(context, "") 172 | add_context = True 173 | instr = random.choice(["Answer using the following:", "I will give you a question and some data. Please answer the question.", "", "Here is some data.", "Read this and respond based on my instructions."]) 174 | if context2: 175 | if random.randint(0,1) == 0: 176 | text += "\nBackground:\n"+ context2+"\nUser: "+instr+"\n"+sd + "\n"+data['query'] + "\nAssistant: " + data['output'] 177 | elif random.randint(0,1) == 0: 178 | if instr == "I will give you a question and some data. Please answer the question.": instr = "" 179 | text += "\nBackground:\n"+ context2+"\nUser: "+data['query'] +"\n"+instr+ "\n" +sd + "\nAssistant: " + data['output'] 180 | elif random.randint(0,1) == 0: 181 | text += "\nUser: "+instr+" Given this context: "+ context2+"\n"+sd + "\n"+data['query'] + "\nAssistant: " + data['output'] 182 | else: 183 | text += "\nUser: Given this context: "+ context2+"\n"+instr+"\n"+sd + "\n"+data['query'] + "\nAssistant: " + data['output'] 184 | else: 185 | if random.randint(0,1) == 0: 186 | text += "\nUser: "+instr+"\n"+sd + "\n"+data['query'] + "\nAssistant: " + data['output'] 187 | else: 188 | if instr == "I will give you a question and some data. Please answer the question.": instr = "" 189 | text += "\nUser: "+data['query'] +"\n"+instr+ "\n" +sd + "\nAssistant: " + data['output'] 190 | if not add_context: 191 | text = ask_context(context, text+"\n") 192 | elif choice == 4 and (table_name or context): 193 | text = "" 194 | add_context = False 195 | if random.randint(0,1) == 0: 196 | text = ask_context(context, "") 197 | add_context = True 198 | if "|" in sd: 199 | fields = sd.split("\n") 200 | if sd.startswith("##"): 201 | fields = fields[1] 202 | else: 203 | fields = fields[0] 204 | fields = fields.strip(" |").replace(" | ", ", ") 205 | if random.randint(0,1) == 0: 206 | text += f"\nUser: Give me a table with the fields {fields} about "+ table_name.split(",")[0] if table_name else context.split(".")[0] + ".\nAssistant:\n"+sd +"\nUser: " + data['query'] +"\Assistant: " + data['output'] 207 | else: 208 | text += f"\nUser: Give me a table of data with the fields {fields}.\nAssistant:\n"+sd +"\nUser: " + data['query'] +"\Assistant: " + data['output'] 209 | else: 210 | text += "\nUser: Give me a list of data about "+ table_name.split(",")[0] if table_name else context.split(".")[0] + "\nAssistant:\n"+sd +"\nUser: " + data['query'] +"\Assistant: " + data['output'] 211 | if not add_context: 212 | text = ask_context(context, text+"\n") 213 | else: 214 | text = "" 215 | add_context = False 216 | if random.randint(0,1) == 0: 217 | text = ask_context(context, "") 218 | add_context = True 219 | if "|" in sd: 220 | text += "\nUser: " + data['query'] +"\nAssistant: " + data['output'] +"\nUser: Give me a table of data useful for answering this question.\nAssistant:\n"+sd 221 | else: 222 | text += "\nUser: " + data['query'] +"\nAssistant: " + data['output'] +"\nUser: Give me a list of data useful for answering this question.\nAssistant:\n"+sd 223 | if not add_context: 224 | text = ask_context(context, text+"\n") 225 | #if context: print (context, '***\n', text) 226 | text = text.strip()+"\n" 227 | output.write(json.dumps({'text': text, 'metadata': {'source': 'unifiedskg'}})+"\n") 228 | if "|" in text and '*' in text: 229 | pass 230 | #print (text) 231 | i += 1 232 | #if i > 100: break 233 | 234 | process_unifiedskg() 235 | -------------------------------------------------------------------------------- /OIG/src/unnatural_instructions.py: -------------------------------------------------------------------------------- 1 | #@title unnatural_instructions 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | import json, os 21 | def create_unatural_instructions(output): 22 | %cd /content/ 23 | !git clone https://github.com/orhonovich/unnatural-instructions 24 | if not os.path.exists("full_data.jsonl"): 25 | os.system("unzip /content/unnatural-instructions/data/full_data.zip") 26 | j = 0 27 | instruction_with_output = [] 28 | with open("full_data.jsonl") as input: 29 | for l in input: 30 | dat = json.loads(l) 31 | #print (dat) 32 | instruction_with_output.extend([(dat2['instruction_with_input'], dat2['output']) for dat2 in dat.get('reformulations',[])]) 33 | instruction_with_output.extend([(dat2['instruction_with_input'], dat2['output']) for dat2 in dat.get('instances',[])]) 34 | instruction_with_output = list(set(instruction_with_output)) 35 | import json 36 | i = 0 37 | if True: 38 | for a, b in instruction_with_output: 39 | a = a.strip() 40 | a = a.replace("", "\n").replace("?", "?").replace("?", "?") 41 | b = b.strip() 42 | b = b.replace("", "\n").replace("?", "?").replace("?", "?") 43 | if b.count("?") == 1: 44 | if b[-1] not in "?": 45 | continue 46 | output.write(json.dumps({'text': "User: "+ a+"\nAssistant: "+ b, 'metadata': {'source': 'unatural_instructions'}})+"\n") 47 | 48 | #!cp una* /content/drive/Shareddrives/LAION/OIG/ 49 | -------------------------------------------------------------------------------- /OIG/src/xp3.py: -------------------------------------------------------------------------------- 1 | #@title XP3 2 | """ 3 | Copyright 2023, LAION contributors, inclduing Ontocord, LLC 4 | and the other authors of OIG 5 | Licensed to the Apache Software Foundation (ASF) under one 6 | or more contributor license agreements. See the NOTICE file 7 | distributed with this work for additional information 8 | regarding copyright ownership. The ASF licenses this file 9 | to you under the Apache License, Version 2.0 (the 10 | "License"); you may not use this file except in compliance 11 | with the License. You may obtain a copy of the License at 12 | http://www.apache.org/licenses/LICENSE-2.0 13 | Unless required by applicable law or agreed to in writing, 14 | software distributed under the License is distributed on an 15 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 16 | KIND, either express or implied. See the License for the 17 | specific language governing permissions and limitations 18 | under the License. 19 | """ 20 | import spacy 21 | import glob, os, json 22 | import random 23 | for file in glob.glob("/content/drive/Shareddrives/LAION/xp3/*"): 24 | name = file.split("/")[-1] 25 | lang = name.split("_")[-1].split(".")[0] 26 | if lang not in {'en',}: continue 27 | if not os.path.exists(name): 28 | os.system(f"cp {file} ./") 29 | for idx, l in enumerate(open(name)): 30 | if idx > 10: break 31 | data = json.loads(l.strip()) 32 | print (data) 33 | inputs, targets = data["inputs"], data["targets"] 34 | inputs = inputs.replace("\\n","\n") 35 | targets = targets.replace("\\n","\n") 36 | inputs = inputs.replace("-lrb-", "(").replace("-rrb-", ")") 37 | inputsHash = {} 38 | instr = "" 39 | for inpt in inputs.split("\n"): 40 | if ":" in inpt: 41 | aspect, inpt = inpt.split(":",1) 42 | aspect = aspect.strip() 43 | if " " not in aspect and inpt != '': 44 | aspect = aspect.strip() 45 | inpt = inpt.strip().replace(".", ". ") 46 | inputsHash[aspect] = inpt 47 | print (inputsHash) 48 | if "The previous" in inputs: 49 | text, instr = inputs.split("The previous" , 1) 50 | instr = "The previous" + instr 51 | elif "What is" in inputs: 52 | text, instr = inputs.split("What is",1) 53 | instr = "What is"+instr 54 | elif "\nGive" in inputs: 55 | text, instr = inputs.split("\nGive",1) 56 | instr = "Give"+instr 57 | elif "\nWrite" in inputs: 58 | text, instr = inputs.split("\nWrite",1) 59 | instr = "Write"+instr 60 | elif "\nGenerate" in inputs: 61 | text, instr = inputs.split("\nGenerate",1) 62 | instr = "Generate"+instr 63 | elif "\nProvide" in inputs: 64 | text, instr = inputs.split("\nProvide",1) 65 | instr = "Provide"+instr 66 | elif "\n##" in inputs : 67 | text, instr = inputs.split("\n\n",1) 68 | instr = instr.replace("#", "") 69 | elif "?" in inputs: 70 | if inputs[-1] == "?": 71 | instr = "".join(reversed("".join(reversed(inputs)).split(".",1)[0])) 72 | else: 73 | instr = inputs.split("?",1)[0].split(".")[-1] 74 | instr = instr+"?" 75 | text = inputs.replace(instr, "") 76 | elif "\n\n" in inputs: 77 | text, instr = inputs.split("\n\n",1) 78 | elif ":" in inputs: 79 | instr = inputs.split(":")[-2] 80 | text = inputs.replace(instr+":", "") 81 | elif inputs[-1] == ".": 82 | instr = inputs.split(".")[-2] 83 | text = inputs.replace(instr+".", "") 84 | else: 85 | instr = inputs.split(".")[-1] 86 | text = inputs.replace(instr+".", "") 87 | prefix = genre = "" 88 | if ":" in text: 89 | prefix, text = text.split(":",1) 90 | if ":" in prefix: 91 | tmp = prefix.split(":")[-2].strip() 92 | if " " not in tmp or len(tmp) < 50: 93 | genre = tmp 94 | else: 95 | prefix = "" 96 | genre = genre.strip() 97 | print ("**", genre) 98 | if genre == "Answer": genre = "" 99 | prefix = prefix.strip().replace(genre+":","").strip() 100 | instr = instr.replace("The same text in", ("Given " + prefix+", translate to") if prefix else "Translate to").replace(genre+":","") 101 | instr = instr.replace("Summary ", "Summarize ") 102 | instr = instr.replace("Here is a translation ", "Translate to ").replace("Here is the same text in ", "Translate to ").replace(" to to ", " to ").strip() 103 | text = text.replace(genre+":","").replace(".", ". ").replace(" ", " ").strip() 104 | if len(instr) > len(text): 105 | tmp = text 106 | instr = text 107 | text = instr 108 | print ("##") 109 | orig_key = "" 110 | span_1 = "" 111 | span2_2 = "" 112 | span_summary = "" 113 | modified_inputs = "" 114 | modified_inputs_with_summary = "" 115 | instr = instr.replace("following", "").replace(" ", " ") 116 | instr = instr.replace("above", "").replace(" ", " ") 117 | if not instr or not text or not targets: continue 118 | 119 | # fill in the blanks - using a summary or the actual text 120 | if False: # random.randint(0,1) == 0: 121 | val = text.split(".") 122 | len_val = len(val) 123 | remove_1 = random.randint(0,len_val-1) 124 | remove_2 = (remove_1 + 2) % len_val 125 | span_1 = val[remove_1].strip() 126 | val[remove_1] = "" 127 | span_2 = val[remove_2].strip() 128 | val[remove_2] = "" 129 | modified_inputs = ".".join(val).replace(" .", ".").strip() 130 | if (span_1 or span_2) and len(val) > 10: 131 | span_summary = run_model("summarize: " + span_1 + ". " + span_2, t5_model, t5_tokenizer, max_length=512)[0] 132 | print ('###', span_summary, '##', span_1, '###', span_2) 133 | val[remove_1] = span_summary 134 | modified_inputs_with_summary = ".".join(val).replace(" .", ".").strip() 135 | 136 | 137 | if random.randint(0,1) == 0 and len(inputsHash) >= 1: # 138 | items = list(inputsHash.items()) 139 | items.sort(key=lambda a: len(a[1]), reverse=True) 140 | orig_key, val = items[0] 141 | 142 | print (f"User: What kind of text is this?\n{val}\n\nAssistant: This appears to be a type of {orig_key}.") 143 | items = items[1:] 144 | if items and random.randint(0,1) == 0: 145 | key, val = items[0] 146 | print (f"User: What is a possible {key} for this {orig_key}? \n\nAssistant: A possible {key} for this {orig_key} could be '{val}'.") 147 | 148 | if genre: 149 | targets = "Here is a "+genre + ". " + targets 150 | text = text.replace("\n\n", "\n").replace("\n\n", "\n").replace("\n\n", "\n").strip() 151 | targets = targets.replace(" .", ".").replace(" ", " ").replace("\n\n", "\n").replace("\n\n", "\n").replace("\n\n", "\n").strip() 152 | if orig_key: 153 | print (f"User: {instr}\n\nAssistant: {targets}") 154 | else: 155 | choice = random.randint(0,10) 156 | if choice == 0: 157 | print (f"User: {instr}\n{text}\n\nAssistant: {targets}") 158 | elif choice == 1: 159 | print (f"User: Please follow my instructions based on the following text: {text}\n{instr}\n\nAssistant: {targets}") 160 | elif choice == 2: 161 | print (f"User: I will give you some text and my instruction:\n{text}\n{instr}\n\nAssistant: {targets}") 162 | else: 163 | print (f"User: {text}\n{instr}\n\nAssistant: {targets}") 164 | 165 | #os.system(f"rm {name}") 166 | #break 167 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Open-Instruction-Generalist Dataset 2 | 3 | Open Instruction Generalist (OIG) Dataset is intended to train assistants that are part of the LAION-AI's family of assistants. OIG Assistants will be trained on the OIG dataset, a massive synthetic instructions with the goal of performing many diverse types of tasks. 4 | 5 | We will have several versions of the OIG Assistant dataset ranging from a small (less than 1M) high quality synthetic dataset, to a massive synthetic instruction dataset. The research goal of OIG Assistant is to create high performing bots by using simple finetuning instead of RLHF. 6 | 7 | We will create ever larger instruction datasets with the goal to generate eventually 1T medium quality tokens of instructions. The receipe for training is to do additional pretrain on some subset of the larger instruction sets, followed by a finetune on OIG-small or some other high quality small dataset. 8 | 9 | * OIG-small-chip2 (200K) - Done and released. See small_instruction_set sub-directory. 10 | * OIG-40M - Done - Done and released. See 40M sub-directory 11 | 12 | # OIG-Moderation 13 | 14 | We have also created a small subset of safety data to tag instructions for moderation. This dataset was created by volunteers and also curated and augmented from public datasets (see https://huggingface.co/datasets/ontocord/OIG-moderation) 15 | 16 | * OIG_safety_v0.1.jsonl (66K) 17 | * OIG_safety_v0.2.jsonl (134K) 18 | 19 | ## Related Projects 20 | * Check out LAION's [Open Assistant Project](https://github.com/LAION-AI/Open-Assistant). We aim to build a chatbot based on RLHF and human feedback data. 21 | * Check out our friends Together.xyz's [OpenChatKit](https://github.com/togethercomputer/OpenChatKit). They trained a bot based on OIG! 22 | 23 | ## Models 24 | The community has trained several models based on a subset of the OIG datasets including: 25 | 26 | - Rallio67/joi2_(20,12,7)B_instruct_alpha 27 | - Rallio67/chip2_(20,12,7)B_instruct_alpha 28 | - Rallio67/joi_(20,12,7)B_instruct_alpha 29 | - Rallio67/chip_(20,12,7)B_instruct_alpha 30 | - togethercomputer/GPT-NeoXT-Chat-Base-20B 31 | 32 | ### Safety models 33 | 34 | - SummerSigh/T5-Base-Rule-Of-Thumb 35 | - SummerSigh/Safety-Policy 36 | - SummerSigh/BART-Base-Rule-Of-Thumb 37 | - shahules786/prosocial-classifier 38 | - shahules786/Safetybot-mt5-base 39 | - shahules786/Safetybot-T5-base 40 | - togethercomputer/GPT-JT-Moderation-6B 41 | 42 | Available on huggingface.co. 43 | -------------------------------------------------------------------------------- /small_instruction_set/Readme.md: -------------------------------------------------------------------------------- 1 | # A new dataset for instruction tuning large language models 2 | The purpose of this dataset is to make it easy to convert a language model pretrained on large amounts of text into an instruction following model using a small amount of additional compute via finetuning or softprompt tuning. 3 | 4 | Many additional datasets are being prepared by various community members and will be incorporated into this dataset as we are able to verify the quality and formatting of the data. Our goal is to make helpful and non-toxic instruction tuned models available to anyone with a pretrained large language model. 5 | 6 | # Disclaimer 7 | These datasets contain synthetic data and in some cases data that includes humans trying to get the language model to say toxic/offensive/trolling things. If you are concerned about the presence of this type of material in the dataset please make sure you carefully inspect each of the entries and filter appropriately. Our goal is for the model to be as helpful and non-toxic as possible and we are actively evaluating ways to reduce or eliminate undesirable content from the instruction tuning datasets. 8 | 9 | # Current dataset 10 | We have seen some promising capabilities from instruction tuning with the following mix of datasets that are derived from datasets available online. 11 | 12 | The files for this data are in json format as a list of tuples where each tuple is (prompt,agent_response) 13 | instruction_tuning_dataset_alpha_part1.json 14 | instruction_tuning_dataset_alpha_part2.json 15 | 16 | Python Code Examples (~6,000): 17 | --------------------- 18 | 19 | A set of instruction / response pairs where the User requests the agent to generate a python function. These examples were generated using a large language model and few shot prompting with python code verified to execute. 20 | There are also ~3000 examples of manually curated one line python code examples from the Conala publication (see: https://conala-corpus.github.io/) 21 | 22 | Natural Instruction Examples (~124,000): 23 | ------------------------------------- 24 | 25 | A balanced set of diverse natural and factual questions and answers made using few shot prompted UL2 20B and an instruction tuned GPT-NeoX-20B model (Chip) and then rejection sampled using multiple automatic evaluations to remove low quality outputs and to filter out factually inaccurate answers. Also includes some filtered natural instructions from Anthropic Helpful instructions (see: https://github.com/anthropics/hh-rlhf). 26 | 27 | Generic Harmless Instruction Examples (~6,500): 28 | -------------------------------------- 29 | 30 | A set of instruction / response pairs sourced from the Anthropic redteam paper github (see: https://github.com/anthropics/hh-rlhf). 31 | This dataset includes a lot of data regarding real humans trying to make the Anthropic language models say harmful/toxic/trolling things. 32 | For this dataset only examples that were rated lowly on the harmful scale (0,1,2 out of 4, where 4 is the most toxic) were included. Again, only the first lines of dialogue (instruction, first_agent_response) were retained. 33 | 34 | Instruction/Responses with Lists (~14,000): 35 | ----------------------------------- 36 | A set of filtered and reformatted instruction / response pairs where the agent response contains a list. Sourced from the Anthropic github (see: https://github.com/anthropics/hh-rlhf). Sourced from wikihow text lists created by b-mc2 (https://huggingface.co/datasets/b-mc2/wikihow_lists). And rejection filtered instruction response pairs generated by Chip20B that contained lists. All lists are formatted in a similar style. 37 | 38 | Follow-up questions (~12,500): 39 | ----------------------------------- 40 | Examples of instructions and responses where an appropriate response is to ask for more information from the prompter. These examples were generated from a combination of few shot prompted UL2 20B (to generate natural questions) and a large dialogue prompted language model to generate the responses containing follow-up questions. 41 | 42 | Wikipedia Toxic Adversarial Questions (~12,000): 43 | ----------------------------------- 44 | Questions and answers generated from wikipedia articles that discuss potentially sensitive topics (flagged as potentially toxic by an early toxicity detection model). 45 | 46 | Grade School Math GSM8K (~9,000): 47 | ----------------------------------- 48 | GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning. (https://github.com/openai/grade-school-math) 49 | 50 | Reasoning Instructions (~4,500): 51 | ----------------------------------- 52 | Examples from the Com2Sense and Strategy QA datasets that were reformatted into natural instructions using large language models with few shot prompting and additional quality filtering steps. 53 | 54 | Character and Scene Descriptions (~30,000): 55 | ----------------------------------- 56 | Examples of instructions and responses for the generation of character or scene descriptions. Scenes were sourced from video game wikis and reformatted into instruction / response format using large language models or generated by few shot prompting with large language models. 57 | --------------------------------------------------------------------------------