├── .gitignore
├── LICENSE
├── README.md
└── assets
└── totalplot.png
/.gitignore:
--------------------------------------------------------------------------------
1 | notebooks
2 | data
3 |
4 | # Byte-compiled / optimized / DLL files
5 | __pycache__/
6 | *.py[cod]
7 | *$py.class
8 |
9 | # C extensions
10 | *.so
11 |
12 | # Distribution / packaging
13 | .Python
14 | build/
15 | develop-eggs/
16 | dist/
17 | downloads/
18 | eggs/
19 | .eggs/
20 | lib/
21 | lib64/
22 | parts/
23 | sdist/
24 | var/
25 | wheels/
26 | share/python-wheels/
27 | *.egg-info/
28 | .installed.cfg
29 | *.egg
30 | MANIFEST
31 |
32 | # PyInstaller
33 | # Usually these files are written by a python script from a template
34 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
35 | *.manifest
36 | *.spec
37 |
38 | # Installer logs
39 | pip-log.txt
40 | pip-delete-this-directory.txt
41 |
42 | # Unit test / coverage reports
43 | htmlcov/
44 | .tox/
45 | .nox/
46 | .coverage
47 | .coverage.*
48 | .cache
49 | nosetests.xml
50 | coverage.xml
51 | *.cover
52 | *.py,cover
53 | .hypothesis/
54 | .pytest_cache/
55 | cover/
56 |
57 | # Translations
58 | *.mo
59 | *.pot
60 |
61 | # Django stuff:
62 | *.log
63 | local_settings.py
64 | db.sqlite3
65 | db.sqlite3-journal
66 |
67 | # Flask stuff:
68 | instance/
69 | .webassets-cache
70 |
71 | # Scrapy stuff:
72 | .scrapy
73 |
74 | # Sphinx documentation
75 | docs/_build/
76 |
77 | # PyBuilder
78 | .pybuilder/
79 | target/
80 |
81 | # Jupyter Notebook
82 | .ipynb_checkpoints
83 |
84 | # IPython
85 | profile_default/
86 | ipython_config.py
87 |
88 | # pyenv
89 | # For a library or package, you might want to ignore these files since the code is
90 | # intended to run in multiple environments; otherwise, check them in:
91 | # .python-version
92 |
93 | # pipenv
94 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
95 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
96 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
97 | # install all needed dependencies.
98 | #Pipfile.lock
99 |
100 | # poetry
101 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102 | # This is especially recommended for binary packages to ensure reproducibility, and is more
103 | # commonly ignored for libraries.
104 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105 | #poetry.lock
106 |
107 | # pdm
108 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109 | #pdm.lock
110 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111 | # in version control.
112 | # https://pdm.fming.dev/#use-with-ide
113 | .pdm.toml
114 |
115 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
116 | __pypackages__/
117 |
118 | # Celery stuff
119 | celerybeat-schedule
120 | celerybeat.pid
121 |
122 | # SageMath parsed files
123 | *.sage.py
124 |
125 | # Environments
126 | .env
127 | .venv
128 | env/
129 | venv/
130 | ENV/
131 | env.bak/
132 | venv.bak/
133 |
134 | # Spyder project settings
135 | .spyderproject
136 | .spyproject
137 |
138 | # Rope project settings
139 | .ropeproject
140 |
141 | # mkdocs documentation
142 | /site
143 |
144 | # mypy
145 | .mypy_cache/
146 | .dmypy.json
147 | dmypy.json
148 |
149 | # Pyre type checker
150 | .pyre/
151 |
152 | # pytype static type analyzer
153 | .pytype/
154 |
155 | # Cython debug symbols
156 | cython_debug/
157 |
158 | # PyCharm
159 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
160 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
161 | # and can be added to the global gitignore or merged into this file. For a more nuclear
162 | # option (not recommended) you can uncomment the following to ignore the entire idea folder.
163 | #.idea/
164 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Daniel Park
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | # Open-LLM-datasets
5 | Repository for organizing datasets used in Open LLM.
6 |
7 |
8 |
9 |
10 | # Table of Contents
11 | - [Datasets](#datasets)
12 | - [General Open Access Datasets for Alignment](#general-open-access-datasets-for-alignment)
13 | - [Open Datasets for Pretraining](#open-datasets-for-pretraining)
14 | - [Domain-specific datasets and Private datasets](#domain-specific-datasets-and-private-datasets)
15 | - [Potential Overlap](#potential-overlap)
16 | - [Papers](#papers)
17 | - [Pre-trained LLM](#pre-trained-llm)
18 | - [Instruction finetuned LLM](#instruction-finetuned-llm)
19 | - [Aligned LLM](#aligned-llm)
20 | - [Open LLM](#open-llm)
21 | - [LLM Training Frameworks](#llm-training-frameworks)
22 | - [LLM Optimization](#llm-optimization)
23 | - [State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods](#state-of-the-art-parameter-efficient-fine-tuning-peft-methods)
24 | - [Tools for deploying LLM](#tools-for-deploying-llm)
25 | - [Tutorials about LLM](#tutorials-about-llm)
26 | - [Courses about LLM](#courses-about-llm)
27 | - [Opinions about LLM](#opinions-about-llm)
28 | - [Other Awesome Lists](#other-awesome-lists)
29 | - [Other Useful Resources](#other-useful-resources)
30 | - [Contribute](#contribute)
31 | - [References](#references)
32 |
33 |
34 |
35 |
36 |
37 |
38 | # Datasets
39 | To download or access information about the most commonly used datasets: https://huggingface.co/datasets
40 |
41 | ## General Open Access Datasets for Alignment
42 | - [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
43 | - [ultraChat](https://huggingface.co/datasets/stingning/ultrachat)
44 | - [ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)
45 | - [pku-saferlhf-dataset](https://github.com/PKU-Alignment/safe-rlhf#pku-saferlhf-dataset)
46 | - [RefGPT-Dataset](https://github.com/ziliwangnlp/RefGPT)
47 | - [Luotuo-QA-A-CoQA-Chinese](https://huggingface.co/datasets/silk-road/Luotuo-QA-A-CoQA-Chinese)
48 | - [Wizard-LM-Chinese-instruct-evol](https://huggingface.co/datasets/silk-road/Wizard-LM-Chinese-instruct-evol)
49 | - [alpaca_chinese_dataset](https://github.com/hikariming/alpaca_chinese_dataset)
50 | - [Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)
51 | - [Alpaca-GPT-4_zh-cn](https://huggingface.co/datasets/shibing624/alpaca-zh)
52 | - [Baize Dataset](https://github.com/project-baize/baize-chatbot/tree/main/data)
53 | - [h2oai/h2ogpt-fortune2000-personalized](https://huggingface.co/datasets/h2oai/h2ogpt-fortune2000-personalized)
54 | - [SHP](https://huggingface.co/datasets/stanfordnlp/SHP)
55 | - [ELI5](https://huggingface.co/datasets/eli5#source-data)
56 | - [evol_instruct_70k](https://huggingface.co/datasets/victor123/evol_instruct_70k)
57 | - [MOSS SFT data](https://github.com/OpenLMLab/MOSS/tree/main/SFT_data)
58 | - [ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)
59 | - [GPT-4all Dataset](https://huggingface.co/datasets/nomic-ai/gpt4all-j-prompt-generations)
60 | - [COIG](https://huggingface.co/datasets/BAAI/COIG)
61 | - [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
62 | - [OpenAssistant Conversations Dataset (OASST1)](https://huggingface.co/datasets/OpenAssistant/oasst1)
63 | - [Alpaca-COT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
64 | - [CBook-150K](https://github.com/FudanNLPLAB/CBook-150K)
65 | - [databricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data) ([possible zh-cn version](https://huggingface.co/datasets/jaja7744/dolly-15k-cn))
66 | - [AlpacaDataCleaned](https://github.com/gururise/AlpacaDataCleaned)
67 | - [GPT-4-LLM Dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
68 | - [GPTeacher](https://github.com/teknium1/GPTeacher)
69 | - [HC3](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection)
70 | - [Alpaca data](https://github.com/tatsu-lab/stanford_alpaca#data-release) [Download](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)
71 | - [OIG](https://huggingface.co/datasets/laion/OIG) [OIG-small-chip2](https://huggingface.co/datasets/0-hero/OIG-small-chip2)
72 | - [ChatAlpaca data](https://github.com/cascip/ChatAlpaca)
73 | - [InstructionWild](https://github.com/XueFuzhao/InstructionWild)
74 | - [Firefly(流萤)](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)
75 | - [BELLE](https://github.com/LianjiaTech/BELLE) [0.5M version](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) [1M version](https://huggingface.co/datasets/BelleGroup/train_1M_CN) [2M version](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
76 | - [GuanacoDataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset#guanacodataset)
77 | - [xP3 (and some variant)](https://huggingface.co/datasets/bigscience/xP3)
78 | - [OpenAI WebGPT](https://huggingface.co/datasets/openai/webgpt_comparisons)
79 | - [OpenAI Summarization Comparison](https://huggingface.co/datasets/openai/summarize_from_feedback)
80 | - [Natural Instruction](https://instructions.apps.allenai.org/) [GitHub&Download](https://github.com/allenai/natural-instructions)
81 | - [hh-rlhf](https://github.com/anthropics/hh-rlhf) [on Huggingface](https://huggingface.co/datasets/Anthropic/hh-rlhf)
82 | - [OpenAI PRM800k](https://github.com/openai/prm800k)
83 |
84 |
85 | ## Open Datasets for Pretraining
86 |
87 | - [falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
88 | - [Common Crawl](https://commoncrawl.org/)
89 | - [nlp_Chinese_Corpus](https://github.com/brightmart/nlp_chinese_corpus)
90 | - [The Pile (V1)](https://pile.eleuther.ai/)
91 | - [Huggingface dataset for C4](https://huggingface.co/datasets/c4)
92 | - [TensorFlow dataset for C4](https://www.tensorflow.org/datasets/catalog/c4)
93 | - [ROOTS](https://huggingface.co/bigscience-data)
94 | - [PushshPairs reddit](https://files.pushshPairs.io/reddit/)
95 | - [Gutenberg project](https://www.gutenberg.org/policy/robot_access.html)
96 | - [CLUECorpus](https://github.com/CLUEbenchmark/CLUE)
97 |
98 |
99 |
100 | ## Domain-specific datasets and Private datasets
101 |
102 | - [ChatGPT-Jailbreak-Prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts)
103 | - [awesome-chinese-legal-resources](https://github.com/pengxiao-song/awesome-chinese-legal-resources)
104 | - [Long Form](https://github.com/akoksal/LongForm)
105 | - [symbolic-instruction-tuning](https://huggingface.co/datasets/sail/symbolic-instruction-tuning)
106 | - [Safety Prompt](https://github.com/thu-coai/Safety-Prompts)
107 | - [Tapir-Cleaned](https://huggingface.co/datasets/MattiaL/tapir-cleaned-116k)
108 | - [instructional_codesearchnet_python](https://huggingface.co/datasets/Nan-Do/instructional_codesearchnet_python)
109 | - [finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)
110 | - WebText(Reddit links) - Private Dataset
111 | - MassiveText - Private Dataset
112 | - [Korean-Open-LLM-Datasets](https://github.com/dsdanielpark/Korean-Open-LLM-Datasets)
113 |
114 |
115 | ## Potential Overlap
116 | | | OIG | hh-rlhf | xP3 | Natural instruct | AlpacaDataCleaned | GPT-4-LLM | Alpaca-CoT |
117 | |-------------------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
118 | | OIG | - | Contains | Overlap | Overlap | Overlap | | Overlap |
119 | | hh-rlhf | Part of | - | | | | | Overlap |
120 | | xP3 | Overlap | | - | Overlap | | | Overlap |
121 | | Natural instruct | Overlap | | Overlap | - | | | Overlap |
122 | | AlpacaDataCleaned | Overlap | | | | - | Overlap | Overlap |
123 | | GPT-4-LLM | | | | | Overlap | - | Overlap |
124 | | Alpaca-CoT | Overlap | Overlap | Overlap | Overlap | Overlap | Overlap | - |
125 |
126 |
127 |
128 | # Papers
129 |
130 | - [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
131 | - [Improving Language Understanding by Generative Pre-Training](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)
132 | - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423.pdf)
133 | - [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
134 | - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf)
135 | - [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://jmlr.org/papers/v21/20-074.html)
136 | - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/pdf/1910.02054.pdf)
137 | - [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf)
138 | - [Language models are few-shot learners](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)
139 | - [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/pdf/2101.03961.pdf)
140 | - [Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374.pdf)
141 | - [On the Opportunities and Risks of Foundation Models](https://arxiv.org/pdf/2108.07258.pdf)
142 | - [Finetuned Language Models are Zero-Shot Learners](https://openreview.net/forum?id=gEZrGCozdqR)
143 | - [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207)
144 | - [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/pdf/2112.06905.pdf)
145 | - [WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing](https://openai.com/blog/webgpt/)
146 | - [Improving language models by retrieving from trillions of tokens](https://www.deepmind.com/publications/improving-language-models-by-retrieving-from-trillions-of-tokens)
147 | - [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/pdf/2112.11446.pdf)
148 | - [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf)
149 | - [LaMDA: Language Models for Dialog Applications](https://arxiv.org/pdf/2201.08239.pdf)
150 | - [Solving Quantitative Reasoning Problems with Language Models](https://arxiv.org/abs/2206.14858)
151 | - [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](https://arxiv.org/pdf/2201.11990.pdf)
152 | - [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155.pdf)
153 | - [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/pdf/2204.02311.pdf)
154 | - [An empirical analysis of compute-optimal large language model training](https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training)
155 | - [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/pdf/2205.01068.pdf)
156 | - [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1)
157 | - [Emergent Abilities of Large Language Models](https://openreview.net/pdf?id=yzkSU5zdwD)
158 | - [Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models](https://github.com/google/BIG-bench)
159 | - [Language Models are General-Purpose Interfaces](https://arxiv.org/pdf/2206.06336.pdf)
160 | - [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/pdf/2209.14375.pdf)
161 | - [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf)
162 | - [GLM-130B: An Open Bilingual Pre-trained Model](https://arxiv.org/pdf/2210.02414.pdf)
163 | - [Holistic Evaluation of Language Models](https://arxiv.org/pdf/2211.09110.pdf)
164 | - [BLOOM: A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/pdf/2211.05100.pdf)
165 | - [Galactica: A Large Language Model for Science](https://arxiv.org/pdf/2211.09085.pdf)
166 | - [OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization](https://arxiv.org/pdf/2212.12017)
167 | - [The Flan Collection: Designing Data and Methods for Effective Instruction Tuning](https://arxiv.org/pdf/2301.13688.pdf)
168 | - [LLaMA: Open and Efficient Foundation Language Models](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/)
169 | - [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045)
170 | - [PaLM-E: An Embodied Multimodal Language Model](https://palm-e.github.io)
171 | - [GPT-4 Technical Report](https://openai.com/research/gpt-4)
172 | - [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2304.01373)
173 | - [Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision](https://arxiv.org/abs/2305.03047)
174 | - [PaLM 2 Technical Report](https://ai.google/static/documents/palm2techreport.pdf)
175 | - [RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/abs/2305.13048)
176 | - [Let’s Verify Step by Step - Open AI](https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf)
177 |
178 |
179 | ## Pre-trained LLM
180 | - Switch Transformer: [Paper](https://arxiv.org/pdf/2101.03961.pdf)
181 | - GLaM: [Paper](https://arxiv.org/pdf/2112.06905.pdf)
182 | - PaLM: [Paper](https://arxiv.org/pdf/2204.02311.pdf)
183 | - MT-NLG: [Paper](https://arxiv.org/pdf/2201.11990.pdf)
184 | - J1-Jumbo: [api](https://docs.ai21.com/docs/complete-api), [Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)
185 | - OPT: [api](https://opt.alpa.ai), [ckpt](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT), [Paper](https://arxiv.org/pdf/2205.01068.pdf), [OPT-175B License Agreement](https://github.com/facebookresearch/metaseq/blob/edefd4a00c24197486a3989abe28ca4eb3881e59/projects/OPT/MODEL_LICENSE.md)
186 | - BLOOM: [api](https://huggingface.co/bigscience/bloom), [ckpt](https://huggingface.co/bigscience/bloom), [Paper](https://arxiv.org/pdf/2211.05100.pdf), [BigScience RAIL License v1.0](https://huggingface.co/spaces/bigscience/license)
187 | - GPT 3.0: [api](https://openai.com/api/), [Paper](https://arxiv.org/pdf/2005.14165.pdf)
188 | - LaMDA: [Paper](https://arxiv.org/pdf/2201.08239.pdf)
189 | - GLM: [ckpt](https://github.com/THUDM/GLM-130B), [Paper](https://arxiv.org/pdf/2210.02414.pdf), [The GLM-130B License](https://github.com/THUDM/GLM-130B/blob/799837802264eb9577eb9ae12cd4bad0f355d7d6/MODEL_LICENSE)
190 | - YaLM: [ckpt](https://github.com/yandex/YaLM-100B), [Blog](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6), [Apache 2.0 License](https://github.com/yandex/YaLM-100B/blob/14fa94df2ebbbd1864b81f13978f2bf4af270fcb/LICENSE)
191 | - LLaMA: [ckpt](https://github.com/facebookresearch/llama), [Paper](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/), [Non-commercial bespoke license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
192 | - GPT-NeoX: [ckpt](https://github.com/EleutherAI/gpt-neox), [Paper](https://arxiv.org/pdf/2204.06745.pdf), [Apache 2.0 License](https://github.com/EleutherAI/gpt-neox/blob/main/LICENSE)
193 | - UL2: [ckpt](https://huggingface.co/google/ul2), [Paper](https://arxiv.org/pdf/2205.05131v1.pdf), [Apache 2.0 License](https://huggingface.co/google/ul2)
194 | - T5: [ckpt](https://huggingface.co/t5-11b), [Paper](https://jmlr.org/papers/v21/20-074.html), [Apache 2.0 License](https://huggingface.co/t5-11b)
195 | - CPM-Bee: [api](https://live.openbmb.org/models/bee), [Paper](https://arxiv.org/pdf/2012.00413.pdf)
196 | - rwkv-4: [ckpt](https://huggingface.co/BlinkDL/rwkv-4-pile-7b), [Github](https://github.com/BlinkDL/RWKV-LM), [Apache 2.0 License](https://huggingface.co/BlinkDL/rwkv-4-pile-7b)
197 | - GPT-J: [ckpt](https://huggingface.co/EleutherAI/gpt-j-6B), [Github](https://github.com/kingoflolz/mesh-transformer-jax), [Apache 2.0 License](https://huggingface.co/EleutherAI/gpt-j-6b)
198 | - GPT-Neo: [ckpt](https://github.com/EleutherAI/gpt-neo), [Github](https://github.com/EleutherAI/gpt-neo), [MIT License](https://github.com/EleutherAI/gpt-neo/blob/23485e3c7940560b3b4cb12e0016012f14d03fc7/LICENSE)
199 |
200 |
201 |
202 | ## Instruction finetuned LLM
203 | - Flan-PaLM: [Link](https://arxiv.org/pdf/2210.11416.pdf)
204 | - BLOOMZ: [Link](https://huggingface.co/bigscience/bloomz)
205 | - InstructGPT: [Link](https://platform.openai.com/overview)
206 | - Galactica: [Link](https://huggingface.co/facebook/galactica-120b)
207 | - OpenChatKit: [Link](https://github.com/togethercomputer/OpenChatKit)
208 | - Flan-UL2: [Link](https://github.com/google-research/google-research/tree/master/ul2)
209 | - Flan-T5: [Link](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)
210 | - T0: [Link](https://huggingface.co/bigscience/T0)
211 | - Alpaca: [Link](https://crfm.stanford.edu/alpaca/)
212 |
213 |
214 | ## Aligned LLM
215 | - GPT 4: [Blog](https://openai.com/research/gpt-4)
216 | - ChatGPT: [Demo](https://openai.com/blog/chatgpt/) | [API](https://share.hsforms.com/1u4goaXwDRKC9-x9IvKno0A4sk30)
217 | - Sparrow: [Paper](https://arxiv.org/pdf/2209.14375.pdf)
218 | - Claude: [Demo](https://poe.com/claude) | [API](https://www.anthropic.com/earlyaccess)
219 |
220 |
221 |
222 |
223 | # Open LLM
224 |
225 | ## LLM Leader Board
226 | 
227 | - Visuallization of Open LLM Leader Board: https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report
228 | - Open LLM Leader Board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
229 |
230 |
231 |
232 | - [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) - A foundational, 65-billion-parameter large language model.
233 | - [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
234 | - [Flan-Alpaca](https://github.com/declare-lab/flan-alpaca) - Instruction Tuning from Humans and Machines.
235 | - [Baize](https://github.com/project-baize/baize-chatbot) - Baize is an open-source chat model trained with LoRA.
236 | - [Cabrita](https://github.com/22-hours/cabrita) - A Portuguese finetuned instruction LLaMA.
237 | - [Vicuna](https://github.com/lm-sys/FastChat) - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
238 | - [Llama-X](https://github.com/AetherCortex/Llama-X) - Open Academic Research on Improving LLaMA to SOTA LLM.
239 | - [Chinese-Vicuna](https://github.com/Facico/Chinese-Vicuna) - A Chinese Instruction-following LLaMA-based Model.
240 | - [GPTQ-for-LLaMA](https://github.com/qwopqwop200/GPTQ-for-LLaMa) - 4 bits quantization of LLaMA using GPTQ.
241 | - [GPT4All](https://github.com/nomic-ai/gpt4all) - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
242 | - [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) - A Dialogue Model for Academic Research.
243 | - [BELLE](https://github.com/LianjiaTech/BELLE) - Be Everyone's Large Language model Engine.
244 | - [StackLLaMA](https://huggingface.co/blog/stackllama) - A hands-on guide to train LLaMA with RLHF.
245 | - [RedPajama](https://github.com/togethercomputer/RedPajama-Data) - An Open Source Recipe to Reproduce LLaMA training dataset.
246 | - [Chimera](https://github.com/FreedomIntelligence/LLMZoo) - Latin Phoenix.
247 | - [CaMA](https://github.com/zjunlp/CaMA) - a Chinese-English Bilingual LLaMA Model.
248 | - [BLOOM](https://huggingface.co/bigscience/bloom) - BigScience Large Open-science Open-access Multilingual Language Model.
249 | - [BLOOMZ&mT0](https://huggingface.co/bigscience/bloomz) - a family of models capable of following human instructions in dozens of languages zero-shot.
250 | - [Phoenix](https://github.com/FreedomIntelligence/LLMZoo)
251 | - [T5](https://arxiv.org/abs/1910.10683) - Text-to-Text Transfer Transformer.
252 | - [T0](https://arxiv.org/abs/2110.08207) - Multitask Prompted Training Enables Zero-Shot Task Generalization.
253 | - [OPT](https://arxiv.org/abs/2205.01068) - Open Pre-trained Transformer Language Models.
254 | - [UL2](https://arxiv.org/abs/2205.05131v1) - a unified framework for pretraining models that are universally effective across datasets and setups.
255 | - [GLM](https://github.com/THUDM/GLM)- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
256 | - [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) - ChatGLM-6B is an open-source, supporting Chinese and English dialogue language model based on General Language Model (GLM) architecture.
257 | - [RWKV](https://github.com/BlinkDL/RWKV-LM) - Parallelizable RNN with Transformer-level LLM Performance.
258 | - [ChatRWKV](https://github.com/BlinkDL/ChatRWKV) - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
259 | - [StableLM](https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models) - Stability AI Language Models.
260 | - [YaLM](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6) - a GPT-like neural network for generating and processing text.
261 | - [GPT-Neo](https://github.com/EleutherAI/gpt-neo) - An implementation of model & data parallel GPT3-like models.
262 | - [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b) - A 6 billion parameter, autoregressive text generation model trained on The Pile.
263 | - [Dolly](https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html) - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
264 | - [Pythia](https://github.com/EleutherAI/pythia) - Interpreting Autoregressive Transformers Across Time and Scale.
265 | - [Dolly 2.0](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
266 | - [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) - an open-source reproduction of DeepMind's Flamingo model.
267 | - [Cerebras-GPT](https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/) - A Family of Open, Compute-efficient, Large Language Models.
268 | - [GALACTICA](https://github.com/paperswithcode/galai/blob/main/docs/model_card.md) - The GALACTICA models are trained on a large-scale scientific corpus.
269 | - [GALPACA](https://huggingface.co/GeorgiaTechResearchInstitute/galpaca-30b) - GALACTICA 30B fine-tuned on the Alpaca dataset.
270 | - [Palmyra](https://huggingface.co/Writer/palmyra-base) - Palmyra Base was primarily pre-trained with English text.
271 | - [Camel](https://huggingface.co/Writer/camel-5b-hf) - a state-of-the-art instruction-following large language model.
272 | - [h2oGPT](https://github.com/h2oai/h2ogpt)
273 | - [PanGu-α](https://openi.org.cn/pangu/) - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model.
274 | - [MOSS](https://github.com/OpenLMLab/MOSS) - MOSS is an open-source dialogue language model that supports Chinese and English.
275 | - [Open-Assistant](https://github.com/LAION-AI/Open-Assistant) - a project meant to give everyone access to a great chat-based large language model.
276 | - [HuggingChat](https://huggingface.co/chat/) - Powered by Open Assistant's latest model – the best open-source chat model right now and @huggingface Inference API.
277 | - [StarCoder](https://huggingface.co/blog/starcoder) - Hugging Face LLM for Code
278 | - [MPT-7B](https://www.mosaicml.com/blog/mpt-7b) - Open LLM for commercial use by MosaicML
279 |
280 |
281 |
282 |
283 | ## LLM Training Frameworks
284 | - [Serving OPT-175B, BLOOM-176B and CodeGen-16B using Alpa](https://alpa.ai/tutorials/opt_serving.html)
285 | - [Alpa](https://github.com/alpa-projects/alpa)
286 | - [Megatron-LM GPT2 tutorial](https://www.deepspeed.ai/tutorials/megatron/)
287 | - [DeepSpeed Chat](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)
288 | - [pretrain_gpt3_175B.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt3_175B.sh)
289 | - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
290 | - [deepspeed.ai](https://www.deepspeed.ai)
291 | - [Github repo](https://github.com/microsoft/DeepSpeed)
292 | - [Colossal-AI](https://colossalai.org)
293 | - [Open source solution replicates ChatGPT training process! Ready to go with only 1.6GB GPU memory and gives you 7.73 times faster training!](https://www.hpc-ai.tech/blog/colossal-ai-chatgpt)
294 | - [BMTrain](https://github.com/OpenBMB/BMTrain)
295 | - [Mesh TensorFlow `(mtf)`](https://github.com/tensorflow/mesh)
296 | - [This tutorial](https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html)
297 |
298 | ## LLM Optimization
299 | ### State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods
300 | - **github:** https://github.com/huggingface/peft
301 | - **abstract:** Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. Seamlessly integrated with hugging face Accelerate for large scale models leveraging DeepSpeed and Big Model Inference.
302 | - **Supported methods:**
303 | Supported methods:
304 |
305 | 1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)
306 | 2. Prefix Tuning: [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://aclanthology.org/2021.acl-long.353/), [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
307 | 3. P-Tuning: [GPT Understands, Too](https://arxiv.org/abs/2103.10385)
308 | 4. Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691)
309 | 5. AdaLoRA: [Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.10512)
310 |
311 |
312 |
313 | ## Tools for deploying LLM
314 |
315 | - [Haystack](https://haystack.deepset.ai/)
316 | - [Sidekick](https://github.com/ai-sidekick/sidekick)
317 | - [LangChain](https://github.com/hwchase17/langchain)
318 | - [wechat-chatgpt](https://github.com/fuergaosi233/wechat-chatgpt)
319 | - [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui)
320 |
321 |
322 | ## Tutorials about LLM
323 | - [Andrej Karpathy] State of GPT [video](https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2)
324 | - [Hyung Won Chung] Instruction finetuning and RLHF lecture [Youtube](https://www.youtube.com/watch?v=zjrM-MW-0y0)
325 | - [Jason Wei] Scaling, emergence, and reasoning in large language models [Slides](https://docs.google.com/presentation/d/1EUV7W7X_w0BDrscDhPg7lMGzJCkeaPkGCJ3bN8dluXc/edit?pli=1&resourcekey=0-7Nz5A7y8JozyVrnDtcEKJA#slide=id.g16197112905_0_0)
326 | - [Susan Zhang] Open Pretrained Transformers [Youtube](https://www.youtube.com/watch?v=p9IxoSkvZ-M&t=4s)
327 | - [Ameet Deshpande] How Does ChatGPT Work? [Slides](https://docs.google.com/presentation/d/1TTyePrw-p_xxUbi3rbmBI3QQpSsTI1btaQuAUvvNc8w/edit#slide=id.g206fa25c94c_0_24)
328 | - [Yao Fu] The Source of the Capability of Large Language Models: Pretraining, Instructional Fine-tuning, Alignment, and Specialization [Bilibili](https://www.bilibili.com/video/BV1Qs4y1h7pn/?spm_id_from=333.337.search-card.all.click&vd_source=1e55c5426b48b37e901ff0f78992e33f)
329 | - [Hung-yi Lee] ChatGPT: Analyzing the Principle [Youtube](https://www.youtube.com/watch?v=yiY4nPOzJEg&list=RDCMUC2ggjtuuWvxrHHHiaDH1dlQ&index=2)
330 | - [Jay Mody] GPT in 60 Lines of NumPy [Link](https://jaykmody.com/blog/gpt-from-scratch/)
331 | - [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models [Link](https://icml.cc/virtual/2022/tutorial/18440)
332 | - [NeurIPS 2022] Foundational Robustness of Foundation Models [Link](https://nips.cc/virtual/2022/tutorial/55796)
333 | - [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. [Video](https://www.youtube.com/watch?v=kCc8FmEb1nY)|[Code](https://github.com/karpathy/ng-video-lecture)
334 | - [DAIR.AI] Prompt Engineering Guide [Link](https://github.com/dair-ai/Prompt-Engineering-Guide)
335 | - [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers [Link](https://www.philschmid.de/fine-tune-flan-t5-deepspeed)
336 | - [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) [Link](https://huggingface.co/blog/rlhf)
337 | - [HuggingFace] What Makes a Dialog Agent Useful? [Link](https://huggingface.co/blog/dialog-agents)
338 | - [HeptaAI] ChatGPT Kernel: InstructGPT, PPO Reinforcement Learning Based on Feedback Instructions [Link](https://zhuanlan.zhihu.com/p/589747432)
339 | - [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources [Link](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1)
340 | - [Stephen Wolfram] What Is ChatGPT Doing ... and Why Does It Work? [Link](https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/)
341 | - [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? [Link](https://jingfengyang.github.io/gpt)
342 | - [Hung-yi Lee] ChatGPT (possibly) How It Was Created - The Socialization Process of GPT [Video](https://www.youtube.com/watch?v=e0aKI2GGZNg)
343 | - [Open AI Improving mathematical reasoning with process supervision](https://openai.com/research/improving-mathematical-reasoning-with-process-supervision)
344 |
345 |
346 | ## Courses about LLM
347 |
348 | - [DeepLearning.AI] ChatGPT Prompt Engineering for Developers [Homepage](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/)
349 | - [Princeton] Understanding Large Language Models [Homepage](https://www.cs.princeton.edu/courses/archive/fall22/cos597G/)
350 | - [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF [Slides](https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf)
351 | - [Stanford] CS324-Large Language Models [Homepage](https://stanford-cs324.github.io/winter2022/)
352 | - [Stanford] CS25-Transformers United V2 [Homepage](https://web.stanford.edu/class/cs25/)
353 | - [Stanford Webinar] GPT-3 & Beyond [Video](https://www.youtube.com/watch?v=-lnHHWRCDGk)
354 | - [MIT] Introduction to Data-Centric AI [Homepage](https://dcai.csail.mit.edu)
355 |
356 | ## Opinions about LLM
357 |
358 | - [Google "We Have No Moat, And Neither Does OpenAI"](https://www.semianalysis.com/p/google-we-have-no-moat-and-neither) [2023-05-05]
359 | - [AI competition statement](https://petergabriel.com/news/ai-competition-statement/) [2023-04-20] [petergabriel]
360 | - [Noam Chomsky: The False Promise of ChatGPT](https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html) \[2023-03-08][Noam Chomsky]
361 | - [Is ChatGPT 175 Billion Parameters? Technical Analysis](https://orenleung.super.site/is-chatgpt-175-billion-parameters-technical-analysis) \[2023-03-04][Owen]
362 | - [The Next Generation Of Large Language Models ](https://www.notion.so/Awesome-LLM-40c8aa3f2b444ecc82b79ae8bbd2696b) \[2023-02-07][Forbes]
363 | - [Large Language Model Training in 2023](https://research.aimultiple.com/large-language-model-training/) \[2023-02-03][Cem Dilmegani]
364 | - [What Are Large Language Models Used For? ](https://www.notion.so/Awesome-LLM-40c8aa3f2b444ecc82b79ae8bbd2696b) \[2023-01-26][NVIDIA]
365 | - [Large Language Models: A New Moore's Law ](https://huggingface.co/blog/large-language-models) \[2021-10-26\]\[Huggingface\]
366 |
367 |
368 | ## Other Awesome Lists
369 |
370 | - [LLMsPracticalGuide](https://github.com/Mooler0410/LLMsPracticalGuide)
371 | - [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts)
372 | - [awesome-chatgpt-prompts-zh](https://github.com/PlexPt/awesome-chatgpt-prompts-zh)
373 | - [Awesome ChatGPT](https://github.com/humanloop/awesome-chatgpt)
374 | - [Chain-of-Thoughts Papers](https://github.com/Timothyxxx/Chain-of-ThoughtsPapers)
375 | - [Instruction-Tuning-Papers](https://github.com/SinclairCoder/Instruction-Tuning-Papers)
376 | - [LLM Reading List](https://github.com/crazyofapple/Reading_groups/)
377 | - [Reasoning using Language Models](https://github.com/atfortes/LM-Reasoning-Papers)
378 | - [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub)
379 | - [Awesome GPT](https://github.com/formulahendry/awesome-gpt)
380 | - [Awesome GPT-3](https://github.com/elyase/awesome-gpt3)
381 | - [Awesome LLM Human Preference Datasets](https://github.com/PolisAI/awesome-llm-human-preference-datasets)
382 | - [RWKV-howto](https://github.com/Hannibal046/RWKV-howto)
383 | - *[Amazing-Bard-Prompts](https://github.com/dsdanielpark/amazing-bard-prompts)*
384 |
385 | ## Other Useful Resources
386 |
387 | - [Arize-Phoenix](https://phoenix.arize.com/)
388 | - [Emergent Mind](https://www.emergentmind.com)
389 | - [ShareGPT](https://sharegpt.com)
390 | - [Major LLMs + Data Availability](https://docs.google.com/spreadsheets/d/1bmpDdLZxvTCleLGVPgzoMTQ0iDP2-7v7QziPrzPdHyM/edit#gid=0)
391 | - [500+ Best AI Tools](https://vaulted-polonium-23c.notion.site/500-Best-AI-Tools-e954b36bf688404ababf74a13f98d126)
392 | - [Cohere Summarize Beta](https://txt.cohere.ai/summarize-beta/)
393 | - [chatgpt-wrapper](https://github.com/mmabrouk/chatgpt-wrapper)
394 | - [Open-evals](https://github.com/open-evals/evals)
395 | - [Cursor](https://www.cursor.so)
396 | - [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT)
397 | - [OpenAGI](https://github.com/agiresearch/OpenAGI)
398 | - [HuggingGPT](https://github.com/microsoft/JARVIS)
399 |
400 |
401 |
402 |
403 | ## How to Contribute
404 | Since this repository focuses on collecting various datasets for LLM, you are welcome to contribute and add datasets in any form you prefer.
405 |
406 |
407 | # References
408 | [1]https://github.com/KennethanCeyer/awesome-llm
409 | [2]https://github.com/Hannibal046/Awesome-LLM
410 | [3]https://github.com/Zjh-819/LLMDataHub
411 | [4]https://huggingface.co/datasets
412 |
--------------------------------------------------------------------------------
/assets/totalplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dsdanielpark/open-llm-datasets/5b0abd9915038f6800835ef1d6b533b62a35109f/assets/totalplot.png
--------------------------------------------------------------------------------