├── .gitignore ├── LICENSE ├── README.md └── assets └── totalplot.png /.gitignore: -------------------------------------------------------------------------------- 1 | notebooks 2 | data 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | share/python-wheels/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | MANIFEST 31 | 32 | # PyInstaller 33 | # Usually these files are written by a python script from a template 34 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .nox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *.cover 52 | *.py,cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | cover/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Django stuff: 62 | *.log 63 | local_settings.py 64 | db.sqlite3 65 | db.sqlite3-journal 66 | 67 | # Flask stuff: 68 | instance/ 69 | .webassets-cache 70 | 71 | # Scrapy stuff: 72 | .scrapy 73 | 74 | # Sphinx documentation 75 | docs/_build/ 76 | 77 | # PyBuilder 78 | .pybuilder/ 79 | target/ 80 | 81 | # Jupyter Notebook 82 | .ipynb_checkpoints 83 | 84 | # IPython 85 | profile_default/ 86 | ipython_config.py 87 | 88 | # pyenv 89 | # For a library or package, you might want to ignore these files since the code is 90 | # intended to run in multiple environments; otherwise, check them in: 91 | # .python-version 92 | 93 | # pipenv 94 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 95 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 96 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 97 | # install all needed dependencies. 98 | #Pipfile.lock 99 | 100 | # poetry 101 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 102 | # This is especially recommended for binary packages to ensure reproducibility, and is more 103 | # commonly ignored for libraries. 104 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 105 | #poetry.lock 106 | 107 | # pdm 108 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 109 | #pdm.lock 110 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 111 | # in version control. 112 | # https://pdm.fming.dev/#use-with-ide 113 | .pdm.toml 114 | 115 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 116 | __pypackages__/ 117 | 118 | # Celery stuff 119 | celerybeat-schedule 120 | celerybeat.pid 121 | 122 | # SageMath parsed files 123 | *.sage.py 124 | 125 | # Environments 126 | .env 127 | .venv 128 | env/ 129 | venv/ 130 | ENV/ 131 | env.bak/ 132 | venv.bak/ 133 | 134 | # Spyder project settings 135 | .spyderproject 136 | .spyproject 137 | 138 | # Rope project settings 139 | .ropeproject 140 | 141 | # mkdocs documentation 142 | /site 143 | 144 | # mypy 145 | .mypy_cache/ 146 | .dmypy.json 147 | dmypy.json 148 | 149 | # Pyre type checker 150 | .pyre/ 151 | 152 | # pytype static type analyzer 153 | .pytype/ 154 | 155 | # Cython debug symbols 156 | cython_debug/ 157 | 158 | # PyCharm 159 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 160 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 161 | # and can be added to the global gitignore or merged into this file. For a more nuclear 162 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 163 | #.idea/ 164 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Daniel Park 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Open-LLM-datasets 5 | Repository for organizing datasets used in Open LLM. 6 | 7 |
8 | 9 | 10 | # Table of Contents 11 | - [Datasets](#datasets) 12 | - [General Open Access Datasets for Alignment](#general-open-access-datasets-for-alignment) 13 | - [Open Datasets for Pretraining](#open-datasets-for-pretraining) 14 | - [Domain-specific datasets and Private datasets](#domain-specific-datasets-and-private-datasets) 15 | - [Potential Overlap](#potential-overlap) 16 | - [Papers](#papers) 17 | - [Pre-trained LLM](#pre-trained-llm) 18 | - [Instruction finetuned LLM](#instruction-finetuned-llm) 19 | - [Aligned LLM](#aligned-llm) 20 | - [Open LLM](#open-llm) 21 | - [LLM Training Frameworks](#llm-training-frameworks) 22 | - [LLM Optimization](#llm-optimization) 23 | - [State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods](#state-of-the-art-parameter-efficient-fine-tuning-peft-methods) 24 | - [Tools for deploying LLM](#tools-for-deploying-llm) 25 | - [Tutorials about LLM](#tutorials-about-llm) 26 | - [Courses about LLM](#courses-about-llm) 27 | - [Opinions about LLM](#opinions-about-llm) 28 | - [Other Awesome Lists](#other-awesome-lists) 29 | - [Other Useful Resources](#other-useful-resources) 30 | - [Contribute](#contribute) 31 | - [References](#references) 32 | 33 | 34 |
35 | 36 | 37 | 38 | # Datasets 39 | To download or access information about the most commonly used datasets: https://huggingface.co/datasets 40 | 41 | ## General Open Access Datasets for Alignment 42 | - [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) 43 | - [ultraChat](https://huggingface.co/datasets/stingning/ultrachat) 44 | - [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) 45 | - [pku-saferlhf-dataset](https://github.com/PKU-Alignment/safe-rlhf#pku-saferlhf-dataset) 46 | - [RefGPT-Dataset](https://github.com/ziliwangnlp/RefGPT) 47 | - [Luotuo-QA-A-CoQA-Chinese](https://huggingface.co/datasets/silk-road/Luotuo-QA-A-CoQA-Chinese) 48 | - [Wizard-LM-Chinese-instruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol) 49 | - [alpaca_chinese_dataset](https://github.com/hikariming/alpaca_chinese_dataset) 50 | - [Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL) 51 | - [Alpaca-GPT-4_zh-cn](https://huggingface.co/datasets/shibing624/alpaca-zh) 52 | - [Baize Dataset](https://github.com/project-baize/baize-chatbot/tree/main/data) 53 | - [h2oai/h2ogpt-fortune2000-personalized](https://huggingface.co/datasets/h2oai/h2ogpt-fortune2000-personalized) 54 | - [SHP](https://huggingface.co/datasets/stanfordnlp/SHP) 55 | - [ELI5](https://huggingface.co/datasets/eli5#source-data) 56 | - [evol_instruct_70k](https://huggingface.co/datasets/victor123/evol_instruct_70k) 57 | - [MOSS SFT data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data) 58 | - [ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K) 59 | - [GPT-4all Dataset](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations) 60 | - [COIG](https://huggingface.co/datasets/BAAI/COIG) 61 | - [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) 62 | - [OpenAssistant Conversations Dataset (OASST1)](https://huggingface.co/datasets/OpenAssistant/oasst1) 63 | - [Alpaca-COT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) 64 | - [CBook-150K](https://github.com/FudanNLPLAB/CBook-150K) 65 | - [databricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data) ([possible zh-cn version](https://huggingface.co/datasets/jaja7744/dolly-15k-cn)) 66 | - [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned) 67 | - [GPT-4-LLM Dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) 68 | - [GPTeacher](https://github.com/teknium1/GPTeacher) 69 | - [HC3](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection) 70 | - [Alpaca data](https://github.com/tatsu-lab/stanford_alpaca#data-release) [Download](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) 71 | - [OIG](https://huggingface.co/datasets/laion/OIG) [OIG-small-chip2](https://huggingface.co/datasets/0-hero/OIG-small-chip2) 72 | - [ChatAlpaca data](https://github.com/cascip/ChatAlpaca) 73 | - [InstructionWild](https://github.com/XueFuzhao/InstructionWild) 74 | - [Firefly(流萤)](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) 75 | - [BELLE](https://github.com/LianjiaTech/BELLE) [0.5M version](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) [1M version](https://huggingface.co/datasets/BelleGroup/train_1M_CN) [2M version](https://huggingface.co/datasets/BelleGroup/train_2M_CN) 76 | - [GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset#guanacodataset) 77 | - [xP3 (and some variant)](https://huggingface.co/datasets/bigscience/xP3) 78 | - [OpenAI WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons) 79 | - [OpenAI Summarization Comparison](https://huggingface.co/datasets/openai/summarize_from_feedback) 80 | - [Natural Instruction](https://instructions.apps.allenai.org/) [GitHub&Download](https://github.com/allenai/natural-instructions) 81 | - [hh-rlhf](https://github.com/anthropics/hh-rlhf) [on Huggingface](https://huggingface.co/datasets/Anthropic/hh-rlhf) 82 | - [OpenAI PRM800k](https://github.com/openai/prm800k) 83 | 84 | 85 | ## Open Datasets for Pretraining 86 | 87 | - [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) 88 | - [Common Crawl](https://commoncrawl.org/) 89 | - [nlp_Chinese_Corpus](https://github.com/brightmart/nlp_chinese_corpus) 90 | - [The Pile (V1)](https://pile.eleuther.ai/) 91 | - [Huggingface dataset for C4](https://huggingface.co/datasets/c4) 92 | - [TensorFlow dataset for C4](https://www.tensorflow.org/datasets/catalog/c4) 93 | - [ROOTS](https://huggingface.co/bigscience-data) 94 | - [PushshPairs reddit](https://files.pushshPairs.io/reddit/) 95 | - [Gutenberg project](https://www.gutenberg.org/policy/robot_access.html) 96 | - [CLUECorpus](https://github.com/CLUEbenchmark/CLUE) 97 | 98 | 99 | 100 | ## Domain-specific datasets and Private datasets 101 | 102 | - [ChatGPT-Jailbreak-Prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts) 103 | - [awesome-chinese-legal-resources](https://github.com/pengxiao-song/awesome-chinese-legal-resources) 104 | - [Long Form](https://github.com/akoksal/LongForm) 105 | - [symbolic-instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning) 106 | - [Safety Prompt](https://github.com/thu-coai/Safety-Prompts) 107 | - [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k) 108 | - [instructional_codesearchnet_python](https://huggingface.co/datasets/Nan-Do/instructional_codesearchnet_python) 109 | - [finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca) 110 | - WebText(Reddit links) - Private Dataset 111 | - MassiveText - Private Dataset 112 | - [Korean-Open-LLM-Datasets](https://github.com/dsdanielpark/Korean-Open-LLM-Datasets) 113 | 114 | 115 | ## Potential Overlap 116 | | | OIG | hh-rlhf | xP3 | Natural instruct | AlpacaDataCleaned | GPT-4-LLM | Alpaca-CoT | 117 | |-------------------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| 118 | | OIG | - | Contains | Overlap | Overlap | Overlap | | Overlap | 119 | | hh-rlhf | Part of | - | | | | | Overlap | 120 | | xP3 | Overlap | | - | Overlap | | | Overlap | 121 | | Natural instruct | Overlap | | Overlap | - | | | Overlap | 122 | | AlpacaDataCleaned | Overlap | | | | - | Overlap | Overlap | 123 | | GPT-4-LLM | | | | | Overlap | - | Overlap | 124 | | Alpaca-CoT | Overlap | Overlap | Overlap | Overlap | Overlap | Overlap | - | 125 | 126 | 127 | 128 | # Papers 129 | 130 | - [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) 131 | - [Improving Language Understanding by Generative Pre-Training](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf) 132 | - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423.pdf) 133 | - [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) 134 | - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf) 135 | - [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://jmlr.org/papers/v21/20-074.html) 136 | - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/pdf/1910.02054.pdf) 137 | - [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf) 138 | - [Language models are few-shot learners](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf) 139 | - [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/pdf/2101.03961.pdf) 140 | - [Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374.pdf) 141 | - [On the Opportunities and Risks of Foundation Models](https://arxiv.org/pdf/2108.07258.pdf) 142 | - [Finetuned Language Models are Zero-Shot Learners](https://openreview.net/forum?id=gEZrGCozdqR) 143 | - [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207) 144 | - [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/pdf/2112.06905.pdf) 145 | - [WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing](https://openai.com/blog/webgpt/) 146 | - [Improving language models by retrieving from trillions of tokens](https://www.deepmind.com/publications/improving-language-models-by-retrieving-from-trillions-of-tokens) 147 | - [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/pdf/2112.11446.pdf) 148 | - [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf) 149 | - [LaMDA: Language Models for Dialog Applications](https://arxiv.org/pdf/2201.08239.pdf) 150 | - [Solving Quantitative Reasoning Problems with Language Models](https://arxiv.org/abs/2206.14858) 151 | - [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](https://arxiv.org/pdf/2201.11990.pdf) 152 | - [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155.pdf) 153 | - [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/pdf/2204.02311.pdf) 154 | - [An empirical analysis of compute-optimal large language model training](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training) 155 | - [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/pdf/2205.01068.pdf) 156 | - [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) 157 | - [Emergent Abilities of Large Language Models](https://openreview.net/pdf?id=yzkSU5zdwD) 158 | - [Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models](https://github.com/google/BIG-bench) 159 | - [Language Models are General-Purpose Interfaces](https://arxiv.org/pdf/2206.06336.pdf) 160 | - [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/pdf/2209.14375.pdf) 161 | - [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) 162 | - [GLM-130B: An Open Bilingual Pre-trained Model](https://arxiv.org/pdf/2210.02414.pdf) 163 | - [Holistic Evaluation of Language Models](https://arxiv.org/pdf/2211.09110.pdf) 164 | - [BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/pdf/2211.05100.pdf) 165 | - [Galactica: A Large Language Model for Science](https://arxiv.org/pdf/2211.09085.pdf) 166 | - [OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization](https://arxiv.org/pdf/2212.12017) 167 | - [The Flan Collection: Designing Data and Methods for Effective Instruction Tuning](https://arxiv.org/pdf/2301.13688.pdf) 168 | - [LLaMA: Open and Efficient Foundation Language Models](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/) 169 | - [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045) 170 | - [PaLM-E: An Embodied Multimodal Language Model](https://palm-e.github.io) 171 | - [GPT-4 Technical Report](https://openai.com/research/gpt-4) 172 | - [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2304.01373) 173 | - [Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision](https://arxiv.org/abs/2305.03047) 174 | - [PaLM 2 Technical Report](https://ai.google/static/documents/palm2techreport.pdf) 175 | - [RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/abs/2305.13048) 176 | - [Let’s Verify Step by Step - Open AI](https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf) 177 | 178 | 179 | ## Pre-trained LLM 180 | - Switch Transformer: [Paper](https://arxiv.org/pdf/2101.03961.pdf) 181 | - GLaM: [Paper](https://arxiv.org/pdf/2112.06905.pdf) 182 | - PaLM: [Paper](https://arxiv.org/pdf/2204.02311.pdf) 183 | - MT-NLG: [Paper](https://arxiv.org/pdf/2201.11990.pdf) 184 | - J1-Jumbo: [api](https://docs.ai21.com/docs/complete-api), [Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) 185 | - OPT: [api](https://opt.alpa.ai), [ckpt](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT), [Paper](https://arxiv.org/pdf/2205.01068.pdf), [OPT-175B License Agreement](https://github.com/facebookresearch/metaseq/blob/edefd4a00c24197486a3989abe28ca4eb3881e59/projects/OPT/MODEL_LICENSE.md) 186 | - BLOOM: [api](https://huggingface.co/bigscience/bloom), [ckpt](https://huggingface.co/bigscience/bloom), [Paper](https://arxiv.org/pdf/2211.05100.pdf), [BigScience RAIL License v1.0](https://huggingface.co/spaces/bigscience/license) 187 | - GPT 3.0: [api](https://openai.com/api/), [Paper](https://arxiv.org/pdf/2005.14165.pdf) 188 | - LaMDA: [Paper](https://arxiv.org/pdf/2201.08239.pdf) 189 | - GLM: [ckpt](https://github.com/THUDM/GLM-130B), [Paper](https://arxiv.org/pdf/2210.02414.pdf), [The GLM-130B License](https://github.com/THUDM/GLM-130B/blob/799837802264eb9577eb9ae12cd4bad0f355d7d6/MODEL_LICENSE) 190 | - YaLM: [ckpt](https://github.com/yandex/YaLM-100B), [Blog](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6), [Apache 2.0 License](https://github.com/yandex/YaLM-100B/blob/14fa94df2ebbbd1864b81f13978f2bf4af270fcb/LICENSE) 191 | - LLaMA: [ckpt](https://github.com/facebookresearch/llama), [Paper](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/), [Non-commercial bespoke license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) 192 | - GPT-NeoX: [ckpt](https://github.com/EleutherAI/gpt-neox), [Paper](https://arxiv.org/pdf/2204.06745.pdf), [Apache 2.0 License](https://github.com/EleutherAI/gpt-neox/blob/main/LICENSE) 193 | - UL2: [ckpt](https://huggingface.co/google/ul2), [Paper](https://arxiv.org/pdf/2205.05131v1.pdf), [Apache 2.0 License](https://huggingface.co/google/ul2) 194 | - T5: [ckpt](https://huggingface.co/t5-11b), [Paper](https://jmlr.org/papers/v21/20-074.html), [Apache 2.0 License](https://huggingface.co/t5-11b) 195 | - CPM-Bee: [api](https://live.openbmb.org/models/bee), [Paper](https://arxiv.org/pdf/2012.00413.pdf) 196 | - rwkv-4: [ckpt](https://huggingface.co/BlinkDL/rwkv-4-pile-7b), [Github](https://github.com/BlinkDL/RWKV-LM), [Apache 2.0 License](https://huggingface.co/BlinkDL/rwkv-4-pile-7b) 197 | - GPT-J: [ckpt](https://huggingface.co/EleutherAI/gpt-j-6B), [Github](https://github.com/kingoflolz/mesh-transformer-jax), [Apache 2.0 License](https://huggingface.co/EleutherAI/gpt-j-6b) 198 | - GPT-Neo: [ckpt](https://github.com/EleutherAI/gpt-neo), [Github](https://github.com/EleutherAI/gpt-neo), [MIT License](https://github.com/EleutherAI/gpt-neo/blob/23485e3c7940560b3b4cb12e0016012f14d03fc7/LICENSE) 199 | 200 | 201 | 202 | ## Instruction finetuned LLM 203 | - Flan-PaLM: [Link](https://arxiv.org/pdf/2210.11416.pdf) 204 | - BLOOMZ: [Link](https://huggingface.co/bigscience/bloomz) 205 | - InstructGPT: [Link](https://platform.openai.com/overview) 206 | - Galactica: [Link](https://huggingface.co/facebook/galactica-120b) 207 | - OpenChatKit: [Link](https://github.com/togethercomputer/OpenChatKit) 208 | - Flan-UL2: [Link](https://github.com/google-research/google-research/tree/master/ul2) 209 | - Flan-T5: [Link](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) 210 | - T0: [Link](https://huggingface.co/bigscience/T0) 211 | - Alpaca: [Link](https://crfm.stanford.edu/alpaca/) 212 | 213 | 214 | ## Aligned LLM 215 | - GPT 4: [Blog](https://openai.com/research/gpt-4) 216 | - ChatGPT: [Demo](https://openai.com/blog/chatgpt/) | [API](https://share.hsforms.com/1u4goaXwDRKC9-x9IvKno0A4sk30) 217 | - Sparrow: [Paper](https://arxiv.org/pdf/2209.14375.pdf) 218 | - Claude: [Demo](https://poe.com/claude) | [API](https://www.anthropic.com/earlyaccess) 219 | 220 | 221 |
222 | 223 | # Open LLM 224 | 225 | ## LLM Leader Board 226 | ![](./assets/totalplot.png) 227 | - Visuallization of Open LLM Leader Board: https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report 228 | - Open LLM Leader Board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard 229 | 230 |
231 | 232 | - [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) - A foundational, 65-billion-parameter large language model. 233 | - [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. 234 | - [Flan-Alpaca](https://github.com/declare-lab/flan-alpaca) - Instruction Tuning from Humans and Machines. 235 | - [Baize](https://github.com/project-baize/baize-chatbot) - Baize is an open-source chat model trained with LoRA. 236 | - [Cabrita](https://github.com/22-hours/cabrita) - A Portuguese finetuned instruction LLaMA. 237 | - [Vicuna](https://github.com/lm-sys/FastChat) - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. 238 | - [Llama-X](https://github.com/AetherCortex/Llama-X) - Open Academic Research on Improving LLaMA to SOTA LLM. 239 | - [Chinese-Vicuna](https://github.com/Facico/Chinese-Vicuna) - A Chinese Instruction-following LLaMA-based Model. 240 | - [GPTQ-for-LLaMA](https://github.com/qwopqwop200/GPTQ-for-LLaMa) - 4 bits quantization of LLaMA using GPTQ. 241 | - [GPT4All](https://github.com/nomic-ai/gpt4all) - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa. 242 | - [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) - A Dialogue Model for Academic Research. 243 | - [BELLE](https://github.com/LianjiaTech/BELLE) - Be Everyone's Large Language model Engine. 244 | - [StackLLaMA](https://huggingface.co/blog/stackllama) - A hands-on guide to train LLaMA with RLHF. 245 | - [RedPajama](https://github.com/togethercomputer/RedPajama-Data) - An Open Source Recipe to Reproduce LLaMA training dataset. 246 | - [Chimera](https://github.com/FreedomIntelligence/LLMZoo) - Latin Phoenix. 247 | - [CaMA](https://github.com/zjunlp/CaMA) - a Chinese-English Bilingual LLaMA Model. 248 | - [BLOOM](https://huggingface.co/bigscience/bloom) - BigScience Large Open-science Open-access Multilingual Language Model. 249 | - [BLOOMZ&mT0](https://huggingface.co/bigscience/bloomz) - a family of models capable of following human instructions in dozens of languages zero-shot. 250 | - [Phoenix](https://github.com/FreedomIntelligence/LLMZoo) 251 | - [T5](https://arxiv.org/abs/1910.10683) - Text-to-Text Transfer Transformer. 252 | - [T0](https://arxiv.org/abs/2110.08207) - Multitask Prompted Training Enables Zero-Shot Task Generalization. 253 | - [OPT](https://arxiv.org/abs/2205.01068) - Open Pre-trained Transformer Language Models. 254 | - [UL2](https://arxiv.org/abs/2205.05131v1) - a unified framework for pretraining models that are universally effective across datasets and setups. 255 | - [GLM](https://github.com/THUDM/GLM)- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks. 256 | - [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) - ChatGLM-6B is an open-source, supporting Chinese and English dialogue language model based on General Language Model (GLM) architecture. 257 | - [RWKV](https://github.com/BlinkDL/RWKV-LM) - Parallelizable RNN with Transformer-level LLM Performance. 258 | - [ChatRWKV](https://github.com/BlinkDL/ChatRWKV) - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model. 259 | - [StableLM](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) - Stability AI Language Models. 260 | - [YaLM](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6) - a GPT-like neural network for generating and processing text. 261 | - [GPT-Neo](https://github.com/EleutherAI/gpt-neo) - An implementation of model & data parallel GPT3-like models. 262 | - [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b) - A 6 billion parameter, autoregressive text generation model trained on The Pile. 263 | - [Dolly](https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html) - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT. 264 | - [Pythia](https://github.com/EleutherAI/pythia) - Interpreting Autoregressive Transformers Across Time and Scale. 265 | - [Dolly 2.0](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use. 266 | - [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) - an open-source reproduction of DeepMind's Flamingo model. 267 | - [Cerebras-GPT](https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/) - A Family of Open, Compute-efficient, Large Language Models. 268 | - [GALACTICA](https://github.com/paperswithcode/galai/blob/main/docs/model_card.md) - The GALACTICA models are trained on a large-scale scientific corpus. 269 | - [GALPACA](https://huggingface.co/GeorgiaTechResearchInstitute/galpaca-30b) - GALACTICA 30B fine-tuned on the Alpaca dataset. 270 | - [Palmyra](https://huggingface.co/Writer/palmyra-base) - Palmyra Base was primarily pre-trained with English text. 271 | - [Camel](https://huggingface.co/Writer/camel-5b-hf) - a state-of-the-art instruction-following large language model. 272 | - [h2oGPT](https://github.com/h2oai/h2ogpt) 273 | - [PanGu-α](https://openi.org.cn/pangu/) - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model. 274 | - [MOSS](https://github.com/OpenLMLab/MOSS) - MOSS is an open-source dialogue language model that supports Chinese and English. 275 | - [Open-Assistant](https://github.com/LAION-AI/Open-Assistant) - a project meant to give everyone access to a great chat-based large language model. 276 | - [HuggingChat](https://huggingface.co/chat/) - Powered by Open Assistant's latest model – the best open-source chat model right now and @huggingface Inference API. 277 | - [StarCoder](https://huggingface.co/blog/starcoder) - Hugging Face LLM for Code 278 | - [MPT-7B](https://www.mosaicml.com/blog/mpt-7b) - Open LLM for commercial use by MosaicML 279 | 280 | 281 | 282 | 283 | ## LLM Training Frameworks 284 | - [Serving OPT-175B, BLOOM-176B and CodeGen-16B using Alpa](https://alpa.ai/tutorials/opt_serving.html) 285 | - [Alpa](https://github.com/alpa-projects/alpa) 286 | - [Megatron-LM GPT2 tutorial](https://www.deepspeed.ai/tutorials/megatron/) 287 | - [DeepSpeed Chat](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat) 288 | - [pretrain_gpt3_175B.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt3_175B.sh) 289 | - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) 290 | - [deepspeed.ai](https://www.deepspeed.ai) 291 | - [Github repo](https://github.com/microsoft/DeepSpeed) 292 | - [Colossal-AI](https://colossalai.org) 293 | - [Open source solution replicates ChatGPT training process! Ready to go with only 1.6GB GPU memory and gives you 7.73 times faster training!](https://www.hpc-ai.tech/blog/colossal-ai-chatgpt) 294 | - [BMTrain](https://github.com/OpenBMB/BMTrain) 295 | - [Mesh TensorFlow `(mtf)`](https://github.com/tensorflow/mesh) 296 | - [This tutorial](https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html) 297 | 298 | ## LLM Optimization 299 | ### State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods 300 | - **github:** https://github.com/huggingface/peft 301 | - **abstract:** Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. Seamlessly integrated with hugging face Accelerate for large scale models leveraging DeepSpeed and Big Model Inference. 302 | - **Supported methods:** 303 | Supported methods: 304 | 305 | 1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685) 306 | 2. Prefix Tuning: [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://aclanthology.org/2021.acl-long.353/), [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf) 307 | 3. P-Tuning: [GPT Understands, Too](https://arxiv.org/abs/2103.10385) 308 | 4. Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691) 309 | 5. AdaLoRA: [Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.10512) 310 | 311 | 312 | 313 | ## Tools for deploying LLM 314 | 315 | - [Haystack](https://haystack.deepset.ai/) 316 | - [Sidekick](https://github.com/ai-sidekick/sidekick) 317 | - [LangChain](https://github.com/hwchase17/langchain) 318 | - [wechat-chatgpt](https://github.com/fuergaosi233/wechat-chatgpt) 319 | - [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) 320 | 321 | 322 | ## Tutorials about LLM 323 | - [Andrej Karpathy] State of GPT [video](https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2) 324 | - [Hyung Won Chung] Instruction finetuning and RLHF lecture [Youtube](https://www.youtube.com/watch?v=zjrM-MW-0y0) 325 | - [Jason Wei] Scaling, emergence, and reasoning in large language models [Slides](https://docs.google.com/presentation/d/1EUV7W7X_w0BDrscDhPg7lMGzJCkeaPkGCJ3bN8dluXc/edit?pli=1&resourcekey=0-7Nz5A7y8JozyVrnDtcEKJA#slide=id.g16197112905_0_0) 326 | - [Susan Zhang] Open Pretrained Transformers [Youtube](https://www.youtube.com/watch?v=p9IxoSkvZ-M&t=4s) 327 | - [Ameet Deshpande] How Does ChatGPT Work? [Slides](https://docs.google.com/presentation/d/1TTyePrw-p_xxUbi3rbmBI3QQpSsTI1btaQuAUvvNc8w/edit#slide=id.g206fa25c94c_0_24) 328 | - [Yao Fu] The Source of the Capability of Large Language Models: Pretraining, Instructional Fine-tuning, Alignment, and Specialization [Bilibili](https://www.bilibili.com/video/BV1Qs4y1h7pn/?spm_id_from=333.337.search-card.all.click&vd_source=1e55c5426b48b37e901ff0f78992e33f) 329 | - [Hung-yi Lee] ChatGPT: Analyzing the Principle [Youtube](https://www.youtube.com/watch?v=yiY4nPOzJEg&list=RDCMUC2ggjtuuWvxrHHHiaDH1dlQ&index=2) 330 | - [Jay Mody] GPT in 60 Lines of NumPy [Link](https://jaykmody.com/blog/gpt-from-scratch/) 331 | - [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models [Link](https://icml.cc/virtual/2022/tutorial/18440) 332 | - [NeurIPS 2022] Foundational Robustness of Foundation Models [Link](https://nips.cc/virtual/2022/tutorial/55796) 333 | - [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. [Video](https://www.youtube.com/watch?v=kCc8FmEb1nY)|[Code](https://github.com/karpathy/ng-video-lecture) 334 | - [DAIR.AI] Prompt Engineering Guide [Link](https://github.com/dair-ai/Prompt-Engineering-Guide) 335 | - [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers [Link](https://www.philschmid.de/fine-tune-flan-t5-deepspeed) 336 | - [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) [Link](https://huggingface.co/blog/rlhf) 337 | - [HuggingFace] What Makes a Dialog Agent Useful? [Link](https://huggingface.co/blog/dialog-agents) 338 | - [HeptaAI] ChatGPT Kernel: InstructGPT, PPO Reinforcement Learning Based on Feedback Instructions [Link](https://zhuanlan.zhihu.com/p/589747432) 339 | - [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources [Link](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1) 340 | - [Stephen Wolfram] What Is ChatGPT Doing ... and Why Does It Work? [Link](https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/) 341 | - [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? [Link](https://jingfengyang.github.io/gpt) 342 | - [Hung-yi Lee] ChatGPT (possibly) How It Was Created - The Socialization Process of GPT [Video](https://www.youtube.com/watch?v=e0aKI2GGZNg) 343 | - [Open AI Improving mathematical reasoning with process supervision](https://openai.com/research/improving-mathematical-reasoning-with-process-supervision) 344 | 345 | 346 | ## Courses about LLM 347 | 348 | - [DeepLearning.AI] ChatGPT Prompt Engineering for Developers [Homepage](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/) 349 | - [Princeton] Understanding Large Language Models [Homepage](https://www.cs.princeton.edu/courses/archive/fall22/cos597G/) 350 | - [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF [Slides](https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf) 351 | - [Stanford] CS324-Large Language Models [Homepage](https://stanford-cs324.github.io/winter2022/) 352 | - [Stanford] CS25-Transformers United V2 [Homepage](https://web.stanford.edu/class/cs25/) 353 | - [Stanford Webinar] GPT-3 & Beyond [Video](https://www.youtube.com/watch?v=-lnHHWRCDGk) 354 | - [MIT] Introduction to Data-Centric AI [Homepage](https://dcai.csail.mit.edu) 355 | 356 | ## Opinions about LLM 357 | 358 | - [Google "We Have No Moat, And Neither Does OpenAI"](https://www.semianalysis.com/p/google-we-have-no-moat-and-neither) [2023-05-05] 359 | - [AI competition statement](https://petergabriel.com/news/ai-competition-statement/) [2023-04-20] [petergabriel] 360 | - [Noam Chomsky: The False Promise of ChatGPT](https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html) \[2023-03-08][Noam Chomsky] 361 | - [Is ChatGPT 175 Billion Parameters? Technical Analysis](https://orenleung.super.site/is-chatgpt-175-billion-parameters-technical-analysis) \[2023-03-04][Owen] 362 | - [The Next Generation Of Large Language Models ](https://www.notion.so/Awesome-LLM-40c8aa3f2b444ecc82b79ae8bbd2696b) \[2023-02-07][Forbes] 363 | - [Large Language Model Training in 2023](https://research.aimultiple.com/large-language-model-training/) \[2023-02-03][Cem Dilmegani] 364 | - [What Are Large Language Models Used For? ](https://www.notion.so/Awesome-LLM-40c8aa3f2b444ecc82b79ae8bbd2696b) \[2023-01-26][NVIDIA] 365 | - [Large Language Models: A New Moore's Law ](https://huggingface.co/blog/large-language-models) \[2021-10-26\]\[Huggingface\] 366 | 367 | 368 | ## Other Awesome Lists 369 | 370 | - [LLMsPracticalGuide](https://github.com/Mooler0410/LLMsPracticalGuide) 371 | - [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts) 372 | - [awesome-chatgpt-prompts-zh](https://github.com/PlexPt/awesome-chatgpt-prompts-zh) 373 | - [Awesome ChatGPT](https://github.com/humanloop/awesome-chatgpt) 374 | - [Chain-of-Thoughts Papers](https://github.com/Timothyxxx/Chain-of-ThoughtsPapers) 375 | - [Instruction-Tuning-Papers](https://github.com/SinclairCoder/Instruction-Tuning-Papers) 376 | - [LLM Reading List](https://github.com/crazyofapple/Reading_groups/) 377 | - [Reasoning using Language Models](https://github.com/atfortes/LM-Reasoning-Papers) 378 | - [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub) 379 | - [Awesome GPT](https://github.com/formulahendry/awesome-gpt) 380 | - [Awesome GPT-3](https://github.com/elyase/awesome-gpt3) 381 | - [Awesome LLM Human Preference Datasets](https://github.com/PolisAI/awesome-llm-human-preference-datasets) 382 | - [RWKV-howto](https://github.com/Hannibal046/RWKV-howto) 383 | - *[Amazing-Bard-Prompts](https://github.com/dsdanielpark/amazing-bard-prompts)* 384 | 385 | ## Other Useful Resources 386 | 387 | - [Arize-Phoenix](https://phoenix.arize.com/) 388 | - [Emergent Mind](https://www.emergentmind.com) 389 | - [ShareGPT](https://sharegpt.com) 390 | - [Major LLMs + Data Availability](https://docs.google.com/spreadsheets/d/1bmpDdLZxvTCleLGVPgzoMTQ0iDP2-7v7QziPrzPdHyM/edit#gid=0) 391 | - [500+ Best AI Tools](https://vaulted-polonium-23c.notion.site/500-Best-AI-Tools-e954b36bf688404ababf74a13f98d126) 392 | - [Cohere Summarize Beta](https://txt.cohere.ai/summarize-beta/) 393 | - [chatgpt-wrapper](https://github.com/mmabrouk/chatgpt-wrapper) 394 | - [Open-evals](https://github.com/open-evals/evals) 395 | - [Cursor](https://www.cursor.so) 396 | - [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT) 397 | - [OpenAGI](https://github.com/agiresearch/OpenAGI) 398 | - [HuggingGPT](https://github.com/microsoft/JARVIS) 399 | 400 | 401 |
402 | 403 | ## How to Contribute 404 | Since this repository focuses on collecting various datasets for LLM, you are welcome to contribute and add datasets in any form you prefer. 405 | 406 | 407 | # References 408 | [1]https://github.com/KennethanCeyer/awesome-llm
409 | [2]https://github.com/Hannibal046/Awesome-LLM
410 | [3]https://github.com/Zjh-819/LLMDataHub
411 | [4]https://huggingface.co/datasets
412 | -------------------------------------------------------------------------------- /assets/totalplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dsdanielpark/open-llm-datasets/5b0abd9915038f6800835ef1d6b533b62a35109f/assets/totalplot.png --------------------------------------------------------------------------------