├── .github └── ISSUE_TEMPLATE │ ├── oig--new-data-source.md │ └── openchatkit--bug-report.md ├── README.md ├── configs ├── training_22_2_27.yaml ├── training_22_2_28.yaml └── training_22_3_1.yaml ├── data ├── README.md ├── dialogue_soda.yaml ├── mmlu-cot.yaml ├── ni.yaml ├── p3.yaml ├── pile.yaml ├── sec-10k-10q-2015-2021.yaml ├── unified_abstact_infill.yaml ├── unified_basic.yaml ├── unified_chip2.yaml ├── unified_conv_finqa.yaml ├── unified_cot_instructions.yaml ├── unified_cuad.yaml ├── unified_essays_with_instructions.yaml ├── unified_flan.yaml ├── unified_grade_school_math_instructions.yaml ├── unified_hc3_human.yaml ├── unified_image_prompts_instructions.yaml ├── unified_joke_explanations.yaml ├── unified_lyrics.yaml ├── unified_merged_code_xp3.yaml ├── unified_multi_news.yaml ├── unified_ni.yaml ├── unified_nq.yaml ├── unified_openai_summarize_tldr.yaml ├── unified_oscar_en_sample_dialog.yaml ├── unified_p3.yaml ├── unified_plot_screenplay_books_dialog.yaml ├── unified_poetry_instructions.yaml ├── unified_rosey_and_prosocial_plus_safety.yaml ├── unified_scitldr.yaml ├── unified_soda_dialog.yaml ├── unified_sqlv1.yaml ├── unified_sqlv2.yaml ├── unified_squad_v2.yaml ├── unified_squad_v2_more_neg.yaml ├── unified_ul2_plus_oscar_en_sample_dialog.yaml ├── unified_unatural_instructions.yaml └── unified_unifiedskg_instructions.yaml └── models ├── oig_v0.11.yaml └── oig_v0.13.yaml /.github/ISSUE_TEMPLATE/oig--new-data-source.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: 'OIG: New Data Source' 3 | about: New Data Source to Improve OIG 4 | title: '' 5 | labels: new data 6 | assignees: '' 7 | 8 | --- 9 | 10 | ### Data Source 11 | 12 | ``` 13 | pretty_name: ___NAME___ 14 | license: 15 | - ___LICENSE___ 16 | language: 17 | - ___LANGUAGE___ 18 | multilinguality: 19 | - ___monolingual/multilingual___ 20 | download_link: 21 | - ___ULR___ 22 | source: 23 | - ___SOURCE___ 24 | task_types: 25 | - ___e.g., dialogue___ 26 | processed_by: 27 | - ___USER___ 28 | description: 29 | - ___DESCRIPTION___ 30 | ``` 31 | 32 | ### Example Prompts that you believe this data source will perform well on (or issues that you think this data source will fix) 33 | 34 | __________ 35 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/openchatkit--bug-report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: 'OpenChatKit: Bug Report' 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: bug 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | 12 | A clear and concise description of what the bug is. 13 | 14 | **OIG Conversation** 15 | 16 | Copy your conversation with the OIG bot that triggers this behavior 17 | 18 | **Expected behavior** 19 | 20 | Write down what you are expecting OIG to output 21 | 22 | **OIG Version** 23 | 24 | Tell us the version of OIG that you are using 25 | 26 | **Potential Fix** 27 | 28 | Do you know potential data sources that we should include to improve OIG? 29 | 30 | **Additional context** 31 | 32 | Add any other context about the problem here. 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OpenDataHub 2 | 3 | This repository contains the current snapshot of the OpenChatKit bot. You can find all training data in `data`, 4 | the hyperparameters used for training in `training.yaml`, training log in `training_log`, 5 | and the pointer to the model at `model.yaml`. 6 | 7 | You can find in different branches different specialized versions of this bot. 8 | 9 | You can make it better by contributing data! 10 | 11 | ## Data Model 12 | 13 | How should we think about the training data for OpenChatKit bots? A _training set_ is a _set_ of _slices_, 14 | where each _slice_ contains a set of (input, output) pairs. Each slice corresponds to one file 15 | in the `data` folder. 16 | 17 | For example, if the data folder contains 18 | ``` 19 | data 20 | |- pile.yaml 21 | |- soda.yaml 22 | ``` 23 | during training, the training set will contain the union of both `pile` and `soda`. 24 | Note that different slices can be weighted differently, which will be specified in 25 | the file `training.yaml` (see "Model Training" for details) 26 | 27 | ### Data Format 28 | 29 | You can provide data in various formats. 30 | 31 | 1. You can provide a collection of input/output pairs 32 | ``` 33 | IOPairs: 34 | - input: INPUT TEXT STRING 35 | output: OUTPUT TEXT STRING 36 | - input: INPUT TEXT STRING 37 | output: OUTPUT TEXT STRING 38 | ... 39 | ``` 40 | or pure text 41 | ``` 42 | Text: 43 | - text: TEXT STRING 44 | - text: TEXT STRING 45 | ... 46 | ``` 47 | 48 | 2. You can provide us the link to your dataset on HuggingFace 49 | ``` 50 | HuggingFace: 51 | - link: LINK TO YOUR DATASET 52 | ``` 53 | 54 | 3. You can prepare your dataset as in OpenAI jsonl format (https://platform.openai.com/docs/guides/fine-tuning) 55 | and put it in a link that we can `wget` or `curl` 56 | ``` 57 | OpenAIJsonl: 58 | - link: LINK TO YOUR DATASET 59 | ``` 60 | 61 | ## Model Training 62 | 63 | Each merged pull request will trigger (currently manually) to the training of a model. 64 | Hyper-parameters, including the specific mixture of data, will be specified in `training.yaml`: 65 | ``` 66 | Training: 67 | - lr: 0.0001 68 | - momentum: 0.99 69 | Mixture: 70 | - pile: 0.5 71 | - soda: 0.5 72 | ``` 73 | After training, a file `training_log` will be committed to the repository. And a file 74 | `model.yaml` will be made available in the repository specifying where to find this model 75 | and (optionally) Together API end-point to query such a model. 76 | 77 | ## How to Contribute? 78 | 79 | You can help us to make OpenChatKit better in three ways. 80 | 81 | ### Finding "Bugs" 82 | 83 | If you realize that the bug is not performing well, please open an issue, specifying 84 | your input, the bot's output, and a description of what is wrong with it (potentially with the right answer). 85 | 86 | ### Fixing "Bugs" 87 | 88 | If you have data that you believe could be useful to fix some of the issues, please 89 | add your data into the `data` folder and make a pull request associated with the issue 90 | that you think this will fix. 91 | 92 | We will review these pull requests, train a model, and merge them. 93 | 94 | ### Specialization 95 | 96 | You don't have to always merge into the main branch. If you have specific things to 97 | try out (e.g., a `text2sql` bot), feel free to open a new branch work there! 98 | 99 | Let's work together to make the best open-source bot! 100 | -------------------------------------------------------------------------------- /configs/training_22_2_27.yaml: -------------------------------------------------------------------------------- 1 | optimizer: 2 | - optimizer_type: 8bit-adam 3 | - learning_rate: 1e-6 4 | lr_scheduler: 5 | - lr_scheduler_type: linear 6 | - peak_learning_rate: 1e-6 7 | - warmup_steps: 10 8 | parallel_training: 9 | - world_size: 16 10 | - pipline_parallel: 11 | - pipeline_group_size: 8 12 | - pipeline_type: gpipe 13 | - data_parallel: 14 | - data_group_size: 2 15 | - data_parallel_type: cocktail_sgd 16 | batch_size: 17 | - per_data_worker_batch_size: 128 18 | - global_batch_size: 256 19 | sequence_length: 20 | - 2048 21 | precision: 22 | - mixed precision fp16 23 | training_data: 24 | - unified_ni:0.2 25 | - unified_p3:0.5 26 | - unified_flan:0.2 27 | - unified_chip2:0.01 28 | - unified_rosey_and_prosocial_plus_safety:0.1 29 | - unified_soda_dialog:0.1 30 | - unified_unifiedskg_instructions:0.1 31 | - unified_merged_code_xp3:0.1 32 | - unified_oscar_en_sample_dialog:0.1 33 | - unified_ul2_plus_oscar_en_sample_dialog:0.1 34 | - unified_multi_news:0.01 35 | - unified_openai_summarize_tldr:0.01 36 | - unified_scitldr:0.01 37 | - unified_squad_v2:0.01 38 | - unified_nq:0.01 39 | - unified_poetry_instructions:0.01 40 | - unified_sqlv2:0.01 41 | - unified_unatural_instructions:0.01 42 | - unified_conv_finqa:0.01 43 | - unified_lyrics:0.01 44 | - unified_essays:0.01 45 | - unified_plot_screenplay_books_dialog:0.01 46 | - unified_grade_school_math_instructions:0.01 47 | - unified_cot_instructions:0.01 48 | - unified_joke_explanations:0.01 49 | - unified_cuad:0.01 -------------------------------------------------------------------------------- /configs/training_22_2_28.yaml: -------------------------------------------------------------------------------- 1 | optimizer: 2 | - optimizer_type: 8bit-adam 3 | - learning_rate: 1e-6 4 | lr_scheduler: 5 | - lr_scheduler_type: linear 6 | - peak_learning_rate: 1e-6 7 | - warmup_steps: 10 8 | parallel_training: 9 | - world_size: 16 10 | - pipline_parallel: 11 | - pipeline_group_size: 8 12 | - pipeline_type: gpipe 13 | - data_parallel: 14 | - data_group_size: 2 15 | - data_parallel_type: cocktail_sgd 16 | batch_size: 17 | - per_data_worker_batch_size: 128 18 | - global_batch_size: 256 19 | sequence_length: 20 | - 2048 21 | precision: 22 | - mixed precision fp16 23 | training_data: 24 | - unified_ni:0.2 25 | - unified_p3:0.5 26 | - unified_flan:0.2 27 | - unified_chip2:0.01 28 | - unified_rosey_and_prosocial_plus_safety:0.1 29 | - unified_soda_dialog:0.1 30 | - unified_unifiedskg_instructions:0.1 31 | - unified_merged_code_xp3:0.1 32 | - unified_oscar_en_sample_dialog:0.1 33 | - unified_ul2_plus_oscar_en_sample_dialog:0.1 34 | - unified_multi_news:0.05 35 | - unified_openai_summarize_tldr:0.05 36 | - unified_scitldr:0.05 37 | - unified_squad_v2:0.01 38 | - unified_nq:0.01 39 | - unified_poetry_instructions:0.01 40 | - unified_sqlv2:0.01 41 | - unified_unatural_instructions:0.01 42 | - unified_conv_finqa:0.01 43 | - unified_lyrics:0.01 44 | - unified_essays:0.01 45 | - unified_plot_screenplay_books_dialog:0.01 46 | - unified_grade_school_math_instructions:0.01 47 | - unified_cot_instructions:0.01 48 | - unified_joke_explanations:0.01 49 | - unified_cuad:0.01 50 | - unified_abstact_infill:0.1 51 | - unified_image_prompts_instructions:0.01 -------------------------------------------------------------------------------- /configs/training_22_3_1.yaml: -------------------------------------------------------------------------------- 1 | optimizer: 2 | - optimizer_type: 8bit-adam 3 | - learning_rate: 1e-6 4 | lr_scheduler: 5 | - lr_scheduler_type: linear 6 | - peak_learning_rate: 1e-6 7 | - warmup_steps: 10 8 | parallel_training: 9 | - world_size: 16 10 | - pipline_parallel: 11 | - pipeline_group_size: 8 12 | - pipeline_type: gpipe 13 | - data_parallel: 14 | - data_group_size: 2 15 | - data_parallel_type: cocktail_sgd 16 | batch_size: 17 | - per_data_worker_batch_size: 128 18 | - global_batch_size: 256 19 | sequence_length: 20 | - 2048 21 | precision: 22 | - mixed precision fp16 23 | training_data: 24 | - unified_ni:0.2 25 | - unified_p3:0.5 26 | - unified_flan:0.2 27 | - unified_chip2:0.01 28 | - unified_rosey_and_prosocial_plus_safety:0.1 29 | - unified_soda_dialog:0.1 30 | - unified_unifiedskg_instructions:0.1 31 | - unified_merged_code_xp3:0.1 32 | - unified_oscar_en_sample_dialog:0.1 33 | - unified_ul2_plus_oscar_en_sample_dialog:0.1 34 | - unified_multi_news:0.05 35 | - unified_openai_summarize_tldr:0.05 36 | - unified_scitldr:0.05 37 | - unified_squad_v2:0.01 38 | - unified_nq:0.01 39 | - unified_poetry_instructions:0.01 40 | - unified_sqlv2:0.01 41 | - unified_unatural_instructions:0.01 42 | - unified_conv_finqa:0.01 43 | - unified_lyrics:0.01 44 | - unified_essays:0.01 45 | - unified_plot_screenplay_books_dialog:0.01 46 | - unified_grade_school_math_instructions:0.01 47 | - unified_cot_instructions:0.01 48 | - unified_joke_explanations:0.01 49 | - unified_cuad:0.01 50 | - unified_abstact_infill:0.1 51 | - unified_image_prompts_instructions:0.01 -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /data/dialogue_soda.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: SODA 2 | license: 3 | - cc-by-4.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1JYpAS8s6pERQCG-SgiQxf7X3989TdIc4/view?usp=share_link 10 | source: 11 | - https://huggingface.co/datasets/allenai/soda 12 | task_types: 13 | - dialogue 14 | description: 15 | - Dialogue data from SODA -------------------------------------------------------------------------------- /data/mmlu-cot.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: mmlu-cot 2 | license: 3 | - other 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - refer to source 10 | source: 11 | - https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json 12 | task_types: 13 | - chain-of-thought 14 | description: 15 | - MMLU-COT -------------------------------------------------------------------------------- /data/ni.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Natural Instructions 2 | license: 3 | - apache-2.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - refer to source 10 | source: 11 | - https://github.com/allenai/natural-instructions 12 | task_types: 13 | - instruction-tuning 14 | description: 15 | - Natural Instructions -------------------------------------------------------------------------------- /data/p3.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: P3 2 | license: 3 | - apache-2.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - refer to source 10 | source: 11 | - https://huggingface.co/datasets/Muennighoff/P3 12 | task_types: 13 | - instruction-tuning 14 | description: 15 | - P3 -------------------------------------------------------------------------------- /data/pile.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: The Pile 2 | license: 3 | - other 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - refer to source 10 | source: 11 | - https://huggingface.co/datasets/the_pile 12 | task_types: 13 | - language-modeling 14 | description: 15 | - The pile. -------------------------------------------------------------------------------- /data/sec-10k-10q-2015-2021.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: SEC filings 10K and 10Q from 2015 to 2021 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1NxquT_niWLXKqbR1bBd8nfhhMv7EKMGY/view?usp=share_link 10 | source: 11 | - None 12 | task_types: 13 | - language-modeling 14 | description: 15 | - SEC data -------------------------------------------------------------------------------- /data/unified_abstact_infill.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: abstact infill 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1fcDxjB-RblIMmEF5JiBFRH-ag5eAXJ1-/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - None -------------------------------------------------------------------------------- /data/unified_basic.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Basic Instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1Nbz4nJH3xNV0G5tC98-PgyUBnBgy-5AH/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_chip2.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified Chip2 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1vTORgiROVdNSQdncn33Vj8BP5OAbVR_i/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Laion 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_conv_finqa.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified ConvFinQA 2 | license: 3 | - mit 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1P2LLqYrQ-RKvB1ZsGUJ1X2SeUYsslsrH/view?usp=drivesdk 10 | source: 11 | - https://github.com/czyssrs/ConvFinQA 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_cot_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified cot_instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1ek-A21VoYtRoVEV3uHbRiAIu_lvBgWTT/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Huu Nguyen 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_cuad.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_cuad.jsonl 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1DWljUFjvJxppKdgEyucqkz7NSah_a46z/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Huu Nguyen 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_essays_with_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified essays 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1Bm640IfCN-XAKWZkt78XKi49Z2tAxyaM/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - ChristophSchuhmann 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_flan.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Flan in dialogue 2 | license: 3 | - apache-2.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1TmUU4HsfwlbZjdJZHbO0FP2dU9ACKLdP/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/Muennighoff/flan 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_grade_school_math_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified grade_school_math_instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1vdTt2Gu45ZHEDrHl7iplW7ppRE_-I_Kt/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - qwedsacf 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_hc3_human.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified HC3 human 2 | license: 3 | - cc-by-sa-4.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1oX3b99uozKCe6wQRyd4idx16KI4-5meU/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/Hello-SimpleAI/HC3 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_image_prompts_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: image prompts instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1L0d1MFQNhlQGGHEsuV61bCLyNR4juOGG/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - None -------------------------------------------------------------------------------- /data/unified_joke_explanations.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_joke_explanations.jsonl 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1ALCvo54v6hDBcbfr2tLtoreG3ZvU7oVI/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Huu Nguyen 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_lyrics.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_lyrics 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/16QNrXAWFilhrCDM-zBqo1p3GvJ4eHppe/view?usp=drivesdk 10 | source: 11 | - https://www.kaggle.com/datasets/neisse/scrapped-lyrics-from-6-genres?select=lyrics-data.csv 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_merged_code_xp3.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_merged_code_xp3.jsonl 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1zg_LGidWpmBDpf7mOdPJwTqa0GaVyawv/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Huu Nguyen 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_multi_news.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_multi_news 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/15b2PFMeSnMp3ZIQZsSI_JRSKBhFhN1L3/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/multi_news 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_ni.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: NI in dialogue 2 | license: 3 | - apache-2.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/15n9BUIAQiQA4YoDY_-NbOmOuCPylb6OC/view?usp=share_link 10 | source: 11 | - https://huggingface.co/datasets/Muennighoff/natural-instructions 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_nq.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified NQ 2 | license: 3 | - cc-by-sa-3.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1o6nq2uD2lxKAJzsrkiSw5fd-o6fkq4qd/view?usp=drivesdk 10 | source: 11 | - https://ai.google.com/research/NaturalQuestions/ 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_openai_summarize_tldr.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified OPENAI TLDR 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1U9dRa9ata9tipGMqwurx7qW7yQfkhfC3/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/CarperAI/openai_summarize_tldr <=== not sure about the actual source. 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_oscar_en_sample_dialog.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_oscar_en_sample_dialog 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1Haa63l2WLktDRJ-wJcY5MUpB2xalPO0_/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu (Laion) 16 | description: 17 | - None -------------------------------------------------------------------------------- /data/unified_p3.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: P3 in dialogue 2 | license: 3 | - apache-2.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1WZD22rtjUz8eT0ZFV_v9SsTAU6oVJSSW/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/Muennighoff/P3 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_plot_screenplay_books_dialog.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified plot_screenplay_books_dialog 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1bSrmcGjSbHTKFgkAVqoNpQ7Z_OzTXKgK/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Huu Nguyen 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_poetry_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified poetry_instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/155HQQtYto8nQB-hfLw10rfymhAAGAsMJ/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - isaacrehg 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_rosey_and_prosocial_plus_safety.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: rosey_and_prosocial + safety 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1TfJ-9kTV173xZF2IiNhCvkSZqU27HVyh/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - None -------------------------------------------------------------------------------- /data/unified_scitldr.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified HC3 human 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1PxDgmWJUZUDZLLugRsb2uOdLL7TNT5Df/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/allenai/scitldr 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_soda_dialog.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: SODA 2 | license: 3 | - cc-by-4.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1QUZmGXvdEFb_cim0qI60GHe_hwR9EWXq/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/allenai/soda 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - Dialogue data from SODA -------------------------------------------------------------------------------- /data/unified_sqlv1.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified Text2SQL 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1hutkkGsNsKSgl3kOK8Yq0p9XahojYG-_/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Laurel (HazyResearch) 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_sqlv2.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified Text2SQL 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1aHFiO3sfEe1Y2Q_lD7cnjgK49gLwInXT/view?usp=drivesdk 10 | source: 11 | - None 12 | processed_by: 13 | - Laurel (HazyResearch) 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_squad_v2.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified SQUADv2 2 | license: 3 | - cc-by-sa-4.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1KwyktbACtjQee_71KidmfuarPvgRAcdf/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/squad_v2 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_squad_v2_more_neg.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: Unified SQUADv2 with more Negative Questions 2 | license: 3 | - cc-by-sa-4.0 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1qS32twBKk-qPdQTcE5fPnNwzSEzR50jO/view?usp=drivesdk 10 | source: 11 | - https://huggingface.co/datasets/squad_v2 12 | processed_by: 13 | - Jue 14 | task_types: 15 | - dialogue 16 | description: 17 | - ": xxxx\n: yyyy" -------------------------------------------------------------------------------- /data/unified_ul2_plus_oscar_en_sample_dialog.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_ul2_plus_oscar_en_sample_dialog 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/14eNpg13mPWcPFcxuM3MtvifkdJgyn8tZ/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - None -------------------------------------------------------------------------------- /data/unified_unatural_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_unatural_instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1R7ETiMP-Y3ZGXIMgc2-YhP3aTQbO33Wp/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - None -------------------------------------------------------------------------------- /data/unified_unifiedskg_instructions.yaml: -------------------------------------------------------------------------------- 1 | pretty_name: unified_unifiedskg_instructions 2 | license: 3 | - None 4 | language: 5 | - en 6 | multilinguality: 7 | - monolingual 8 | download_link: 9 | - https://drive.google.com/file/d/1ZiHnzRVtVffItxie-nfQAPAyLT6vsqq_/view?usp=drivesdk 10 | source: 11 | - None 12 | task_types: 13 | - dialogue 14 | processed_by: 15 | - Huu Nguyen 16 | description: 17 | - None -------------------------------------------------------------------------------- /models/oig_v0.11.yaml: -------------------------------------------------------------------------------- 1 | model_name: OIG 2 | version: 0.11 3 | training_config: 4 | - training_22_2_27.yaml for 524M tokens -------------------------------------------------------------------------------- /models/oig_v0.13.yaml: -------------------------------------------------------------------------------- 1 | model_name: OIG 2 | version: 0.13 3 | training_config: 4 | - training_22_2_27.yaml for 524M tokens 5 | - training_22_2_28.yaml for 262M tokens 6 | - training_22_3_1.yaml for 367M tokens --------------------------------------------------------------------------------