├── README.md └── img ├── banner_large.png └── toc.png /README.md: -------------------------------------------------------------------------------- 1 | [](https://www.nlpfromscratch.com/) 2 | 3 | 4 | # Master NLP and LLM Resource List 5 | 6 | This is the master resource list for [NLP from scratch](https://www.nlpfromscratch.com). This is a living document and will continually be updated and so should always be considered a work in progress. If you find any dead links or other issues, feel free to [submit an issue](https://github.com/nlpfromscratch/nlp-llms-resources/issues/new/choose). 7 | 8 | This document is quite large, so you may wish to use the Table of Contents automatically generated by Github to find what you are looking for: 9 | 10 |

11 | 12 | 13 | 14 |

15 | 16 | Thanks, and enjoy! 17 | 18 | 19 | ## Traditional NLP 20 | 21 | 22 | ### Datasets 23 | 24 | 25 | 26 | * [nlp-datasets](https://github.com/niderhoff/nlp-datasets): Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) 27 | * [awesome-public-datasets - Natural Language](https://github.com/awesomedata/awesome-public-datasets#natural-language): Natural language section of the awesome public datasets github page 28 | * [SMS Spam Dataset:](https://archive.ics.uci.edu/dataset/228/sms+spam+collection) The “Hello World” of NLP datasets, ~55K SMS messages with label of spam/not spam for binary classification. Hosted on UC Irvine Machine Learning repository. 29 | * [IMDB dataset:](https://ai.stanford.edu/~amaas/data/sentiment/) The other “Hello World” of datasets for NLP, 50K “highly polar” movie reviews scraped from IMDB and compiled by Andrew Maas of Stanford. 30 | * [Twitter Airline Sentiment:](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment) Tweets from February of 2015 and associated sentiment labels at major US airlines - hosted on Kaggle (~3.5MB) 31 | * [CivilCommentst](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data): Dataset from the Civil Comments platform which shut down in 2017. 2M public comments with labels for toxicity, obscenity, threat, insulting, etc. 32 | * [Cornell Movie Dialog](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html): ~220K conversations from 10K pairs of characters across 617 popular movies, compiled by Cristian Danescu-Niculescu-Mizil of Cornell. Tabular compiled format [available on Hugging Face](https://huggingface.co/datasets/cornell_movie_dialog). 33 | * [CNN Daily Mail](https://github.com/abisee/cnn-dailymail): “Hello World” dataset for summarization, consisting of articles from CNN and Daily Mail and accompanying summaries. Also available through [Tensorflow](https://www.tensorflow.org/datasets/catalog/cnn_dailymail) and via [Hugging Face](https://huggingface.co/datasets/cnn_dailymail). 34 | * [Entity Recognition Datasets](https://github.com/juand-r/entity-recognition-datasets): Very large list of named entity recognition (NER) datasets (on Github). 35 | * [WikiNER](https://metatext.io/datasets/wikiner): 7,200 manually-labelled Wikipedia articles across nine languages: English, German, French, Polish, Italian, Spanish,Dutch, Portuguese and Russian. 36 | * [OntoNotes](https://catalog.ldc.upenn.edu/LDC2013T19): Large corpus comprising various genres of text in three languages with structural information and shallow semantic information. 37 | * [Flores-101](https://ai.meta.com/blog/the-flores-101-data-set-helping-build-better-translation-systems-around-the-world/) - Multilingual, multi-task dataset from Meta for machine translation research, focusing on “low resource” languages. Associated [Github repo](https://github.com/facebookresearch/flores/tree/main). 38 | * [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX): Open dataset of 167 languages with over 6T words, the largest multilingual dataset ever released 39 | * [Amazon Review Datasets:](https://jmcauley.ucsd.edu/data/amazon/) Massive datasets of reviews from Amazon.com, compiled by Julian McAuley of University of California San Diego 40 | * [Yelp Open Dataset](https://www.yelp.com/dataset): 7M reviews, 210K businesses, and 200K images released by Yelp. Note the educational license. 41 | * [Google Books N-grams:](https://storage.googleapis.com/books/ngrams/books/datasetsv3.html) Very large dataset (2.2TB) of all the n-grams from Google Books. Also available hosted in an [S3 bucket by AWS](https://aws.amazon.com/datasets/google-books-ngrams/). 42 | * [Sentiment Analysis @ Stanford NLP:](https://nlp.stanford.edu/sentiment/index.html) Includes a link to the dataset of movie reviews used for Stanford Sentiment Treebank 2 (SST2). Also available [on Hugging Face](https://huggingface.co/datasets/sst2). 43 | * [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/): Language-independent entity recognition dataset from the Conference on Computational Natural Language Learning (CoNLL-2003) shared task. Foundational datasets for named entity recognition (NER). 44 | * [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://huggingface.co/datasets/lmsys/lmsys-chat-1m): Large scale dataset of LLM 1M conversations with LLMs collected from Chatbot Arena website. 45 | * [TabLib](https://www.approximatelabs.com/blog/tablib): Largest publicly available dataset of tabular tokens (627M tables, 867B tokens), to encourage the community to build Large Data Models that better understand tabular data 46 | * [LAION 5B](https://laion.ai/blog/laion-5b/): Massive dataset of images and captions from Large-scale Artificial Intelligence Open Network (LAION), used to train Stable Diffusion. 47 | * [Databricks Dolly 15K](databricks/databricks-dolly-15k): Instruction dataset compiled internally by Databricks, used to train the Dolly models based on the Pythia LLMs. 48 | * [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/): Large image & caption pair dataset from Google research. 49 | * [Instruction Tuning Volume 1](https://nlpnewsletter.substack.com/p/instruction-tuning-vol-1): List of popular instruction-tuning datasets from Sebastian Ruder 50 | * [Objaverse](https://objaverse.allenai.org/): Massive dataset of annotated 3D objects (with associated text labels) from Allen Institute. Comes in two sizes: 1.0 (800K objects) and XL (~10M objects). 51 | * [Gretel Synthetic Text to SQL Dataset](https://gretel.ai/blog/synthetic-text-to-sql-dataset): Open dataset of synthetically generated natural language and SQL query pairs for LLM training, from Gretel AI. 52 | * [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb): 15T token dataset of cleaned and deduplicated data from CommonCrawl by Hugging Face. 53 | 54 | 55 | ### Data Acquisition 56 | 57 | 58 | 59 | * [API tutorials beyond OpenAPI](https://antonz.org/interactive-api-tutorials/): Detailed tutorial on the particulars of APIs and how they function. 60 | * [requests python library](https://requests.readthedocs.io/en/latest/): The standard library for making HTTP requests in python, simple and easy. 61 | * [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): Python library for parsing data out of HTML, XML, and other markup documents. Essential for web scraping. 62 | * [Selenium Python bindings](https://selenium-python.readthedocs.io/): Working with Selenium in Python for more advanced web scraping. 63 | * [Web scraping with ChatGPT prompts](https://proxiesapi.com/articles/web-scraping-using-chatgpt-complete-guide-with-examples): Medium post on working with ChatGPT to build web scraping code with requests, BeautifulSoup, and Selenium 64 | 65 | 66 | ### Libraries 67 | 68 | 69 | 70 | * [Natural Language Toolkit (NLTK)](https://www.nltk.org/index.html): Core and essential NLP python library put together for teaching purposes by University of Pennsylvania, now fundamental to NLP work. 71 | * [spaCy](https://spacy.io/): Fundamental python NLP library for “industrial-strength natural language processing”, focused on building production systems. 72 | * [Gensim](https://radimrehurek.com/gensim/): open-source python library with a focus on topic modeling, semantic similarity, and embeddings. Also contains implementations of word2vec and doc2vec. 73 | * [fastText](https://fasttext.cc/): Open-source, free, lightweight library that allows users to learn text representations (embeddings) and text classifiers. Includes [pre-trained word vectors](https://fasttext.cc/docs/en/english-vectors.html) from Wikipedia and Common Crawl. From Meta’s FAIR Group. 74 | * [KerasNLP](https://keras.io/keras_nlp/): Natural language processing with deep learning and LLMs in Keras using Tensorflow, Pytorch, or JAX. Includes models such as BERT, GPT, and OPT. 75 | * [Tensorflow Text](https://www.tensorflow.org/text): Lower level than KerasNLP, text manipulation built into Tensorflow. 76 | * [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/): Java-based NLP library from Stanford, still important and in use 77 | * [TextBlob](https://textblob.readthedocs.io/en/dev/): Easy to use NLP library in Python, including simple sentiment scoring and part-of-speech (POS) tagging. 78 | * [Scikit-learn (sklearn):](https://scikit-learn.org/stable/) The essential library for doing machine learning in python, but more specifically [for working with text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files). 79 | * [SparkNLP](https://nlp.johnsnowlabs.com/): Essential Big Data library for NLP work from John Snow Labs. Take a look at their [extensive model repo](https://sparknlp.org/models). Github repo with lots of resources [here](https://github.com/JohnSnowLabs/spark-nlp). Medium [post here](https://towardsdatascience.com/hands-on-googles-text-to-text-transfer-transformer-t5-with-spark-nlp-6f7db75cecff) on using the T5 model for classification with SparkNLP. 80 | 81 | 82 | ### Neural Networks / Deep Learning 83 | 84 | 85 | 86 | * [A Recipe for Training Neural Networks](http://karpathy.github.io/2019/04/25/recipe/): Tips on training neural networks and debugging from Andrej Karpathy 87 | * [Deep Learning for NLP in Pytorch:](https://github.com/rguthrie3/DeepLearningForNLPInPytorch/blob/master/Deep%20Learning%20for%20Natural%20Language%20Processing%20with%20Pytorch.ipynb) Detailed tutorial on applying NLP techniques with Pytorch, including LSTMs and embeddings. 88 | * [Exploring LSTMs](http://blog.echen.me/2017/05/30/exploring-lstms/): A deep technical look into LSTMs and visualizations of what the networks “see”. 89 | 90 | 91 | ### Sentiment Analysis 92 | 93 | 94 | 95 | * [VADER: Valence Aware Dictionary and sEntiment Reasoner](https://github.com/cjhutto/vaderSentiment): A lexicon and rules-based sentiment scoring model, implemented in python. Used by other major NLP libraries (NLTK, spaCy, TextBlob, etc.) 96 | * [PyABSA - Open Framework for Aspect-based Sentiment Analysis](https://github.com/yangheng95/pyabsa): Python library for aspect-based sentiment analysis 97 | * [Explainable AI: Integrated Gradients:](https://databasecamp.de/en/ml/integrated-gradients-nlp) Detailed blog post on using the Integrated Gradients method from the [Alibi python library](https://docs.seldon.io/projects/alibi/en/stable/examples/integrated_gradients_imdb.html) for explainable ML on text. 98 | 99 | 100 | ### Optical Character Recognition (OCR) 101 | 102 | 103 | 104 | * [Pytesseract](https://pypi.org/project/pytesseract/): Python wrapper for [Google’s Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract). 105 | * [Donut - Document understanding transformer](https://github.com/clovaai/donut): OCR-free end-to-end Transformer model for various visual document understanding tasks, such as visual document classification or information extraction. By Clova.ai research group. 106 | * [Facebook Nougat](https://facebookresearch.github.io/nougat/): Neural Optical Understanding for Academic Documents (NOUGAT) is a Meta research project for specifically doing OCR and converting academic documents into a markup language. Available on Hugging Face spaces [here](https://huggingface.co/spaces/ysharma/nougat) and [here](https://huggingface.co/spaces/hf-vision/nougat-transformers). 107 | * [Amazon Textract:](https://aws.amazon.com/textract/) AWS Service for automatically extracting information from documents such as PDFs. 108 | * [OCR with Google Document AI](https://codelabs.developers.google.com/codelabs/docai-ocr-python#0): Codelab demonstrating OCR on PDFs with GCP Document AI 109 | 110 | 111 | ### Information Extraction and NERD 112 | 113 | 114 | 115 | * [RAKE](https://pypi.org/project/rake-nltk): Rapid Automatic Keyword Extraction, a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurrence with other words in the text. 116 | * [YAKE](https://liaad.github.io/yake/): Yet Another Keyword Extractor is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. 117 | * [Pytextrank](https://derwen.ai/docs/ptr/): Python implementation of [TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) and associated algorithms as a spaCy pipeline extension, for information extraction and extractive summarization. 118 | * [PKE (Python Keyphrase Extraction)](https://github.com/boudinfl/pke): open source python-based keyphrase extraction toolkit, implementing a variety of algorithms. Uses spaCy. 119 | * [KeyBERT](https://github.com/MaartenGr/KeyBERT): Keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document. 120 | * [UniversalNER](https://universal-ner.github.io): Targeted distillation model for named entity recognition from Microsoft Research and USC, based on data generated by ChatGPT. 121 | * [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER): Framework for NER models based on transformers such as BERT, RoBERTa and ELECTRA using Hugging Face Transformers ([HF page](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd)) 122 | 123 | 124 | ### Semantics and Syntax 125 | 126 | 127 | 128 | * [Treebank](https://en.wikipedia.org/wiki/Treebank): Definition at Wikipedia 129 | * [Universal Dependencies:](https://universaldependencies.org/#language-) Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. 130 | * [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/): UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. 131 | 132 | 133 | ### Topic Modeling & Embedding 134 | 135 | 136 | 137 | * [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model): Wikipedia page 138 | * [Latent Dirichlet Allocation (LDA):](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) Wikipedia page 139 | * [A Beginner’s Guide to Latent Dirichlet Allocation:](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) TDS article with some easier to understand explanations of how LDA works. 140 | * [Latent Semantic Analysis (LSA):](https://en.wikipedia.org/wiki/Latent_semantic_analysis) Wikipedia page 141 | * [Termite:](https://idl.cs.washington.edu/papers/termite/) Python-based visualization framework for topic modeling 142 | * [Practical NLP Project](https://pnlpuos.github.io/topic-modeling): A nice overview of Topic Modeling as part of a project at Universität Osnabrück 143 | * [Topic Modeling with Llama2](https://maartengrootendorst.substack.com/p/topic-modeling-with-llama-2): Post from Maarten Grootendorst on using Meta’s LLaMA 2 model and Hugging Face transformers for topic model 144 | * [BERTopic](https://maartengr.github.io/BERTopic/index.html): Topic modeling in Python using Hugging Face transformers and c-TF-IDF to create dense clusters, allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 145 | * [MTEB: Massive Text Embedding Benchmark:](https://github.com/embeddings-benchmark/mteb) General benchmark for LLM embedding performance. 146 | * [Nomic Atlas](https://atlas.nomic.ai/): Project from Nomic AI to create high-performance browser-based visualizations of the embeddings of large datasets 147 | 148 | 149 | ### Multilingual NLP and Machine Translation: 150 | 151 | 152 | 153 | * [fastText language identification models](https://fasttext.cc/docs/en/language-identification.html): Language identification models for use with fastText 154 | * [SeamlessM4T:](https://ai.meta.com/blog/seamless-m4t/) Multimodal translation and transcription model based on the transformer architecture from Meta research. 155 | * Demo: [https://seamless.metademolab.com](https://seamless.metademolab.com/?utm_source=linkedin&utm_medium=organic_social&utm_campaign=seamless&utm_content=video) 156 | * Hugging Face space: [https://huggingface.co/spaces/facebook/seamless_m4t](https://huggingface.co/spaces/facebook/seamless_m4t) 157 | * Code: [https://github.com/facebookresearch/seamless_communication](https://github.com/facebookresearch/seamless_communication?utm_source=linkedin&utm_medium=organic_social&utm_campaign=seamless&utm_content=video) 158 | * [Helsinki NLP Translation Models:](https://huggingface.co/Helsinki-NLP) Well-known and used translation models in Hugging Face from the University of Helsinki Language Technology Research Group, based on the [OPUS](https://github.com/Helsinki-NLP/Opus-MT) neural machine translation framework. 159 | * [ACL 2023 Multilingual Models Tutorial](https://aka.ms/ACL2023tutorial): Microsoft’s presentations from ACL 2023 - a lot of dense content here on low resource languages, benchmarks, prompting, and bias. 160 | * [ROUGE:](https://en.wikipedia.org/wiki/ROUGE_(metric)) Wikipedia page for ROUGE score for summarization and translation tasks. 161 | * [BLEU](https://en.wikipedia.org/wiki/BLEU): Wikipedia page for BLEU machine translation tasks. 162 | * [sacreBLEU](https://github.com/mjpost/sacrebleu): Python library for hassle-free and reproducible BLEU scores 163 | * [XTREME](https://sites.research.google/xtreme): Comprehensive benchmark for cross-lingual transfer learning on a diverse set of languages and tasks from researchers at Google and Carnegie Mellon 164 | * [Belebele](https://github.com/facebookresearch/belebele): Multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants from Meta, based upon the Flores dataset 165 | * [OpenNMT](https://opennmt.net): Open neural machine translation models in Pytorch and Tensorflow. Documentation for [python here](https://opennmt.net/OpenNMT-py/). 166 | * [FinGPT-3](https://turkunlp.org/gpt3-finnish): GPT model trained in Finnish, from a research group at the University of Turku, Finland. 167 | * [Jais 13-B](https://www.inceptioniai.org/jais/): Bilingual Arabic/English model based on GPT-3 architecture, from Inception AI / Core42 group in UAE. 168 | * [Evo-LLM-JP](https://sakana.ai/evolutionary-model-merge/): Japanese LLM from AI startup Sakana.ai created using evolutionary model merging. There is a chat model, a vision model, and a stable diffusion model all of which can be prompted and converse in Japanese. On Hugging Face [here](https://huggingface.co/SakanaAI). 169 | 170 | 171 | ### Natural Language Inference (NLI) and Natural Language Understanding (NLU) 172 | 173 | 174 | 175 | * [Adversarial NLI](https://github.com/facebookresearch/anli): Benchmark for NLI from Meta research and associated dataset. 176 | 177 | 178 | ### Interviewing 179 | 180 | 181 | 182 | * [NLP Interview Checklist](https://www.kaggle.com/discussions/getting-started/432875): Checklist of knowledge for interviewing for NLP roles. 183 | 184 | 185 | ## Large Language Models (LLMs) and Gen AI 186 | 187 | 188 | ### Introductory LLMs 189 | 190 | 191 | 192 | * [A Beginner’s Guide to Large Language Models](https://resources.nvidia.com/en-us-large-language-model-ebooks/llm-ebook-part1): Free e-book from NVIDIA covering some of the absolute fundamentals in plain language. 193 | * [How AI chatbots like ChatGPT or Bard work – visual explainer](https://www.theguardian.com/technology/ng-interactive/2023/nov/01/how-ai-chatbots-like-chatgpt-or-bard-work-visual-explainer): A good short visual explainer from The Guardian on how embeddings works and make LLMs function. 194 | * [Generative AI Primer](https://er.educause.edu/articles/2023/8/a-generative-ai-primer): A primer for the layperson on generative AI and LLMs, from Educause Review - lots of links to source materials here and well-written. 195 | * [Understanding Causal LLMs, Masked LLM’s, and Seq2Seq: A Guide to Language Model Training Approaches](https://medium.com/@tom_21755/understanding-causal-llms-masked-llm-s-and-seq2seq-a-guide-to-language-model-training-d4457bbd07fa): Medium post breaking down the different types of language modeling training approaches, causal language modeling vs masked language modeling (CLM vs MLM). 196 | * [[1 hr Talk] Intro to Large Language Models](https://www.youtube.com/watch?v=zjkBMFhNj_g): Great short talk by Andrej Karpathy himself, covering the fundamentals of what LLMs are and how they are trained, demonstrations of GPT-4’s capabilities, and adversarial attacks & jailbreaks. 197 | 198 | 199 | ### Foundation Models 200 | 201 | 202 | 203 | * [Explainer: What is a foundation model?](https://www.adalovelaceinstitute.org/resource/foundation-models-explainer/): A good and lengthy explainer and discussion of foundation models, including visuals and tables. From Ada Lovelace Institute. 204 | * [Center for Research on Foundation Models (CRFM)](https://crfm.stanford.edu/): Interdisciplinary initiative born out of the Stanford Institute for Human-Centered Artificial Intelligence (HAI) that aims to make fundamental advances in the study, development, and deployment of foundation models. See the [report](https://crfm.stanford.edu/report.html), [transparency index](https://crfm.stanford.edu/fmti/), and their [master list of models](https://crfm.stanford.edu/helm/latest/?models=1) (increment using the URL). 205 | * [Getting Started with Llama](https://ai.meta.com/llama/get-started/): Official getting started page for working with LLaMA 2 model from Meta. 206 | 207 | 208 | ### Text Generation 209 | 210 | 211 | 212 | * [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate): Overview of different text generation decoding methods from HuggingFace including beam search vs greedy, top-p and top-k sampling. 213 | * [Guiding Text Generation with Constrained Beam Search in 🤗 Transformers](https://huggingface.co/blog/constrained-beam-search): Blog post from HF on using constrained Beam Search in transformers as opposed to regular beam search. 214 | * [GPT in 60 lines of Numpy](https://jaykmody.com/blog/gpt-from-scratch/): Great blog post on building GPT from scratch and the fundamental workings of the decoder side of the transformer. 215 | * Explanations of [Temperature](https://docs.cohere.com/docs/temperature) and [Top-p and Top-k sampling](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p#2-pick-from-amongst-the-top-tokens-top-k) Cohere documentation. 216 | * [Creatively Deterministic: What are Temperature and Top P in Generative AI?](https://www.linkedin.com/pulse/creatively-deterministic-what-temperature-topp-ai-kevin-tupper/): LinkedIn post on temperature, top-p, and top-k 217 | * [What is Temperature in NLP?](https://lukesalamone.github.io/posts/what-is-temperature/) A short explainer on temperature with a nice accompanying interactive visual, showing its effect on output probabilities. 218 | 219 | 220 | ### Web-based Chat Clients 221 | 222 | 223 | 224 | * [ChatGPT](https://chat.openai.com/): Obviously. From OpenAI. Free, but requires an account. 225 | * [Perplexity Labs](https://labs.perplexity.ai/): Free, web-based LLM chat client, no account required. Includes popular models such as versions of LLaMA and Mistral as well as Perplexity’s own pplx model. 226 | * [HuggingChat](https://huggingface.co/chat): Chat client from HuggingFace, includes LLaMA and Mistral clients as well as OpenChat. Free for short conversations (in guest mode), account required for longer use. 227 | * [DeepInfra Chat](https://deepinfra.com/chat): Includes LLaMA and Mistral, even Mixtral 8x7B! Free to use. 228 | * [Pi](https://pi.ai/talk): Conversational LLM from Inflection. No account required. 229 | * [Poe](https://poe.com/): AI assistant from Quora, allows interacting with OpenAI, Anthropic, LLaMA and Google models. Account required. 230 | * [Copilot](https://copilot.microsoft.com/): Or is it Bing Chat? The lines are blurry. Backed by GPT, allows using GPT-4 on mobile ([iOS](https://apps.apple.com/us/app/microsoft-copilot/id6472538445), [Android](https://play.google.com/store/apps/details?id=com.microsoft.copilot&hl=en&gl=US)) for free! Requires a Microsoft account. 231 | 232 | 233 | ### Summarization 234 | 235 | 236 | 237 | * [PEGASUS - A State-of-the-Art Model for Abstractive Text Summarization](https://blog.research.google/2020/06/pegasus-state-of-art-model-for.html): Foundational LLM for abstractive summarization from Google research in 2020. Available on Hugging Face [here](https://huggingface.co/google/pegasus-large). 238 | 239 | 240 | ### Fine-tuning LLMs 241 | 242 | 243 | 244 | * [Fine-tuning Guide from OpenAI](https://platform.openai.com/docs/guides/fine-tuning): Official docs from OpenAI on fine-tuning hosted GPT-3.5-turb. 245 | * [Getting Started with Deep Learning with PyTorch and Hugging Face](https://github.com/philschmid/deep-learning-pytorch-huggingface/tree/main): Lots of example notebooks for fine-tuning models (T5, Falcon, LLaMA) from Phil Schmid of Hugging Face 246 | * [Fine-tune a non-English GPT-2 Model with Huggingface](https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface): “Hello World” example of fine-tuning a GPT2 model to write German recipes. 247 | * [HuggingFace Community Resources:](https://huggingface.co/docs/transformers/community) Community resources from Hugging Face. A ton of free Colab notebooks here on fine-tuning various foundation models 248 | * [Personal Copilot: Train Your Own Coding Assistant](https://huggingface.co/blog/personal-copilot): Blog post from Hugging Face on fine-tuning a code generating LLM, using both traditional fine-tuning and PEFT with StarCoder. 249 | * [Optimizing Pre-Trained Models: A Guide To Parameter-Efficient Fine-Tuning (PEFT)](https://www.leewayhertz.com/parameter-efficient-fine-tuning/): A long guide on terminology and the particulars of different types of PEFT. 250 | * [GPT 3.5 vs Llama 2 fine-tuning: A Comprehensive Comparison](https://ragntune.com/blog/gpt3.5-vs-llama2-finetuning): Short blog post comparing fine-tuning GPT vs. LLaMA 2 on a SQL code task, taking price into consideration. 251 | * [Regression with Text Input Using BERT and Transformers](https://lajavaness.medium.com/regression-with-text-input-using-bert-and-transformers-71c155034b13): Fairly in-depth Medium post (including a lot of code) on using BERT for regression. 252 | * [LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models](https://huggingface.co/papers/2309.12307): Method for fine-tuning models (specifically LLaMA and Alpaca) to have longer context windows. Lots of resources around this on their [official Github page](https://github.com/dvlab-research/LongLoRA). 253 | * [PEFT](https://github.com/huggingface/peft): State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) library in Hugging Face. methods. Official Github repo. 254 | * [LoRA: Low-Rank Adaptation of Large Language Models](https://github.com/microsoft/LoRA): Official Github repo for LoRA from Microsoft. 255 | * [Low-RanK Adapters (LoRA)](https://huggingface.co/docs/peft/conceptual_guides/lora): Conceptual guide from Hugging Face. 256 | * [QLoRA](https://github.com/artidoro/qlora): Efficient Finetuning of Quantized LLMs, official Github repo. The method that produced Guanaco from LLaMA. 257 | * [Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms): List of tips and learnings based on using LoRA/QLoRA from Sebastian Raschka 258 | * [Instruction Tuning Volume 1:](https://nlpnewsletter.substack.com/p/instruction-tuning-vol-1) Summary of instruction-tuning and links to some resources from Sebastian Ruder’s NLP newsletter. 259 | * [Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments](https://lightning.ai/pages/community/lora-insights): Results of many experiments with model fine tuning using LoRA and QLoRA from Lightning AI, on memory and compute usage, training time, etc. 260 | * [Overview of PEFT: State-of-the-art Parameter-Efficient Fine-Tuning](https://www.kdnuggets.com/overview-of-peft-stateoftheart-parameterefficient-finetuning): Article from KDNuggets with example code using LLaMA 7B 261 | * [Llama Recipes](https://github.com/facebookresearch/llama-recipes): Recipes from Meta themselves for fine-tuning LLaMA 262 | * [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl): Software framework for streamlining fine-tuning of LLMs from OpenAccess AI Collective. 263 | 264 | 265 | ### Model Quantization 266 | 267 | 268 | 269 | * [Quantization:](https://huggingface.co/docs/optimum/concept_guides/quantization) Conceptual guide from Hugging Face 270 | * [What are Quantized LLMs?](https://www.tensorops.ai/post/what-are-quantized-llms): Explainer post from TensorOps, including Hugging Face code and links to other resources. 271 | * [7 Ways To Speed Up Inference of Your Hosted LLMs](https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47): Medium post with techniques for speeding up Inference of LLMs, including an explainer on quantization. 272 | * [HuggingFace meets bitsandbytes for lighter models on GPU for inference (Colab):](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/HuggingFace_int8_demo.ipynb) Colab notebook demonstrating usage of bitsandbytes for model quantization with BLOOM 3B. 273 | 274 | 275 | ### Data Labeling 276 | 277 | 278 | 279 | * [Label Studio](https://labelstud.io/): Open source python library / framework for data labelling 280 | 281 | 282 | ### Code Examples and Cookbooks 283 | 284 | 285 | 286 | * [OpenAI Cookbook:](https://cookbook.openai.com/) Recipes and tutorial posts for working and building with OpenAI, all in one place. Example code in the [Github repo](https://github.com/openai/openai-cookbook). 287 | * [Cohere Guides](https://github.com/cohere-ai/notebooks/tree/main/notebooks/guides): Example notebooks for working with Cohere for various LLM usage cases. 288 | 289 | 290 | ### Local LLM Development 291 | 292 | 293 | 294 | * [GPT4All](https://gpt4all.io): Locally-hosted LLM from Nomic for offline development. 295 | * [LM Studio](https://lmstudio.ai/): Software framework for local LLM development and usage. 296 | * [Jan](https://jan.ai/): Offline GUI for working with LLMs. Mobile app under development. 297 | * [Open WebUI](https://github.com/open-webui/open-webui): Self-hosted WebUI for LLMS to operate entirely offline - formly Ollama Web UI. 298 | * [TransformerLab](https://github.com/transformerlab/transformerlab-app?tab=readme-ov-file): Open source project for GUI interface for working with LLMs locally. 299 | * [SuperWhisper](https://superwhisper.com/): Local usage of Whisper model on Mac OS, allows you to speak commands to your machine and have them transcribed (all locally). 300 | * [Cursor](https://cursor.sh/): Locally installable code editor with autocomplete, chat, etc. backed by OpenAI GPT3.5/4. 301 | * [llama.cpp](https://github.com/ggerganov/llama.cpp): Inference from Meta’s LLaMA model in pure C/C++. Python integration through [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/). 302 | * [Ollama](https://ollama.ai/): Host LLMs locally, includes models like LLaMA, Mistral, Zephyr, Falcon, etc. 303 | * [Exploring Ollama for On-Device AI](https://pyimagesearch.com/2024/05/20/inside-look-exploring-ollama-for-on-device-ai/): Comprehensive tutorial on Ollama from PyImageSearch 304 | * [llamafile](llamafile): Framework for LLMs as single executable files for local execution and development work, examples of one-liners and use from its creator here [Bash One-Liners for LLMs](https://justine.lol/oneliners/) 305 | * [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer): CPU/GPU LLM inference engine leveraging activation locality for fast on-device generation and serving of results from LLMs locally. 306 | * [MLC LLM](https://llm.mlc.ai/): Native deployment of LLMs with native APIs with compiler acceleration. Includes [WebLLM](https://webllm.mlc.ai/) for serving LLMs through the browser and examples of locally developed Android and iPhone LLM apps. 307 | * [DSPy](https://github.com/stanfordnlp/dspy): Framework for algorithmically optimizing LLM prompts and weights from Stanford NLP. 308 | * [AnythingLLM](https://useanything.com/): Docker-based framework for offline LLM usage with RAG. 309 | 310 | 311 | ### Multimodal LLMs 312 | 313 | 314 | #### Images 315 | 316 | 317 | 318 | * [Stable Diffusion](https://stability.ai/stablediffusion): The open model from Stability AI that brought AI-generated images to the forefront. Official [Github repo here](https://github.com/CompVis/stable-diffusion), and (one of many) [Hugging Face Space](https://huggingface.co/spaces/stabilityai/stable-diffusion) here (for SD 2.1). 319 | * [Deepfloyd Lab](https://www.deepfloyd.ai/): Multimodal research AI lab that is a part of Stability AI, has released the [IF Pixel Diffusion model](https://stability.ai/news/deepfloyd-if-text-to-image-model), which does much better on complex image generations such as those involving text. 320 | * [Finetune Stable Diffusion Models with DDPO via TRL](https://huggingface.co/blog/trl-ddpo): Blog post from Hugging Face on fine-tuning SD with reinforcement learning and Denoising Diffusion Policy Optimization (DDPO). 321 | * [Fast Stable Diffusion XL on TPU v5e](https://huggingface.co/spaces/google/sdxl): Hugging Face space with hosted SDXL on TPU for free and fast generation of high quality (1024x1024) images. 322 | * [SDXL in 4 steps with Latent Consistency LoRAs](https://huggingface.co/blog/lcm_lora): Distilling Stable Diffusion XL with Latent Consistency LoRA for highly compute-optimized synthetic image generation. 323 | * [DeciDiffusion](https://huggingface.co/Deci/DeciDiffusion-v1-0): Optimized SD 1.5 model from Deci.ai 324 | * [Segmind-Distill-SD](https://blog.segmind.com/introducing-segmind-ssd-1b/): Distilled Stable Diffusion model from Segmind, claims 50% smaller and 60% faster. [Github repo here](https://github.com/segmind/distill-sd) & [Hugging Face model](https://huggingface.co/segmind/SSD-1B) here. 325 | * [Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack AI](https://ai.meta.com/research/publications/emu-enhancing-image-generation-models-using-photogenic-needles-in-a-haystack): Fine-tuning of Stable Diffusion from Meta research focusing on high-quality images. 326 | * [DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation](https://dreambooth.github.io/): Method for fine-tuning diffusion models to generate custom images of a subject based on samples. 327 | * [AutoTrain Dreambooth (Colab)](https://colab.research.google.com/github/huggingface/autotrain-advanced/blob/main/colabs/AutoTrain_Dreambooth.ipynb#scrollTo=_LvIS7-7PcLT): Google Colab notebook for Autotraining Dreambooth models using Hugging Face. 328 | * [CM3leon](https://ai.meta.com/blog/generative-ai-text-images-cm3leon/): Mixed modal model from Meta Research, available here. 329 | * [Kosmos-G: Generating Images in Context with Multimodal Large Language Models](https://xichenpan.com/kosmosg/): Model from Microsoft Research for generating variations of images given text prompts with minimal to no training. 330 | * [CommonCanvas](https://huggingface.co/common-canvas): Series of models from MosaicML (Databricks) trained only on images with Creative Commons Licensing. 331 | * [Multimodal LLMs by Chip Hyugen](https://huyenchip.com/2023/10/10/multimodal.html): A good post on multimodal LLMs, including foundational / historical models leading up to SOTA like CLIP and Flamingo. 332 | * [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/): A kind of open-source GPT4-V, chat / instruction agent able to work with image data, from researchers at Microsoft, U of Wisconsin, and Columbia. [Demo site is here](https://llava.hliu.cc/). 333 | * [SPHINX](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX): Multimodal, multi-task LLM released by researchers at the University of Shanghai. [Demo is here](https://imagebind-llm.opengvlab.com/). 334 | * [Ferret](https://github.com/apple/ml-ferret/): Open model from Apple for grounding and object identification. 335 | * [XGen-MM](https://huggingface.co/collections/Salesforce/xgen-mm-1-models-662971d6cecbf3a7f80ecc2e): Continuation of (and rebranding) of Salesforce’s multimodal [BLIP](https://github.com/salesforce/BLIP?tab=readme-ov-file) model for image interrogation. 336 | * [Florence 2](https://huggingface.co/collections/microsoft/florence-6669f44df0d87d9c3bfb76de): Family of “small” (200M and 800M parameter) VLMs from Microsoft for a wide range of vision and vision-language tasks, _e.g. _captioning, object detection, segmentation, etc. Comes in 337 | 338 | 339 | #### Audio 340 | 341 | 342 | 343 | * [wav2vec 2.0](https://anwarvic.github.io/speech-recognition/wav2vec_2) And [w2v-BERT](https://anwarvic.github.io/speech-recognition/w2v-BERT): Explanations of the technical details behind these multimodal models from Meta’s FAIR group and Google Brain, by Mohamed Anwar 344 | * [Musenet](https://openai.com/research/musenet): Older research from OpenAI, Musenet applied the GPT architecture to MIDI files to compose music. 345 | * [AudioCraft:](https://ai.meta.com/resources/models-and-libraries/audiocraft/) Multiple models from Meta research, for music (MusicGen), sound effect (AudioGen), and a codec and diffusion model for recovering compressed audio (EnCodec and Multi-band Diffusion). Demo also available in a [Hugging Face space](https://huggingface.co/spaces/facebook/MusicGen), and a [sample Colab notebook here](https://huggingface.co/spaces/facebook/MusicGen/blob/main/demo.ipynb). 346 | * [Audiobox](https://ai.meta.com/blog/audiobox-generating-audio-voice-natural-language-prompts/): Text-to-audio and speech prompt to audio from Meta. Interactive [demo site here](https://audiobox.metademolab.com/). 347 | * [StableAudio](https://www.stableaudio.com/): Diffusion-based music generation model from Stability AI. [Blog post with technical details](https://stability.ai/research/stable-audio-efficient-timing-latent-diffusion). 348 | * [SALMONN](https://github.com/bytedance/SALMONN): Speech Audio Language Music Open Neural Network from researchers at Tsinghua University and ByteDance. Allows for things like inquiring about the content of audio files, multilingual speech recognition & translation and audio-speech co-reasoning. 349 | * Real-time translation and lip-synching: [https://blog.invgate.com/video-translator](https://blog.invgate.com/video-translator) 350 | * [HeyGen](https://www.heygen.com/): Startup creating AI generated avatars and multimedia content, _e.g. _for instructional videos. [Video demo](https://www.youtube.com/watch?v=FRMDJzYO1k4) of lip-synching (dubbing) and translation. 351 | * [Whisper](https://openai.com/research/whisper): OpenAI’s open source multilingual, text-to-speech transcription model. [Official Github repo](https://github.com/openai/whisper) with lots of details. 352 | * [whisper_real_time](https://github.com/davabase/whisper_real_time): Example of real-time audio transcription using Whisper 353 | * [whisper.cpp](https://github.com/ggerganov/whisper.cpp): High-performance plain C/C++ implementation of inference using OpenAI's Whisper without dependencies 354 | * [Deepgram](https://deepgram.com/): Audio AI company with enterprise offerings for TTS including both their own Nova-2 model as well as Whisper or custom models. 355 | * [AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios](https://speechresearch.github.io/adaspeech4/): Model for realistic audio generation (text-to-speech / TTS) from researchers at Microsoft. 356 | * [Project Gutenberg Audio Collection Project](https://marhamilresearch4.blob.core.windows.net/gutenberg-public/Website/index.html): Thousands of free audiobooks transcribed using AdaSpeech4, brought to you by Project Gutenberg, MIT, and Microsoft 357 | * [ElevenLabs](https://elevenlabs.io/): Well-known American software company with AI voice cloning and translation products. 358 | * [Projects: Create High-Quality Audiobooks in Minutes](https://elevenlabs.io/blog/introducing-projects-create-high-quality-audiobooks-in-minutes): Tool for creating high-quality audiobooks via TTS from ElevenLabs. 359 | * [Brain2Music](https://google-research.github.io/seanet/brain2music/): Research from Google for using fMRI scans to reconstruct audio perceived by the listener. 360 | * [WavJourney: Compositional Audio Creation with Large Language Models](https://audio-agi.github.io/WavJourney_demopage): An approach for generating audio combining generative text for scriptwriting plus audio generation models. 361 | * [XTTS](https://coqui.ai/blog/tts/xtts_taking_tts_to_the_next_level): Voice cloning model specifically designed with game creators in mind from coqui.ai. Available in a [Hugging Face space here](https://huggingface.co/spaces/coqui/xtts-streaming). 362 | * [The Future of Music - How Generative AI Is Transforming the Music Industry](https://a16z.com/the-future-of-music-how-generative-ai-is-transforming-the-music-industry/): Blog post from Anderssen-Horowitz covering a lot of recent developments in the intersection of the music industry and GenAI tools. 363 | * [StyleTTS2](https://github.com/yl4579/StyleTTS2): Diffusion and adversarial model for realistic speech synthesis (TTS). Audio samples and comparisons with previous models are [here](https://styletts2.github.io/). 364 | * [Qwen-Audio](https://qwen-audio.github.io/Qwen-Audio/): Multimodal audio understanding LLM from Alibaba Group 365 | * [Audio Diffusion Pytorch](https://github.com/archinetai/audio-diffusion-pytorch): A fully featured audio diffusion library in PyTorch, from researchers at ElevenLabs. 366 | * [MARS5-TTS](https://github.com/Camb-ai/MARS5-TTS): English TTS model from Camb.ai. With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more. 367 | * [IMS Toucan](https://github.com/DigitalPhonetics/IMS-Toucan): IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the Institute for Natural Language Processing (IMS), University of Stuttgart, Germany. 368 | 369 | 370 | #### Video and Animation 371 | 372 | 373 | 374 | * [Generative Image Dynamics](https://generative-dynamics.github.io/): Model from researchers at Google for creating looping images or interactive images from still ones. 375 | * [IDEFICS](https://huggingface.co/blog/idefics): Open multimodal text and image model from Hugging Face based on [Flamingo](https://deepmind.google/discover/blog/tackling-multiple-tasks-with-a-single-visual-language-model/), similar to GPT4-V. Updated version [IDEFICS 2](https://huggingface.co/blog/idefics2) released 04/2024 with [demo here](https://huggingface.co/spaces/HuggingFaceM4/idefics-8b). 376 | * [NeRF](https://www.matthewtancik.com/nerf); Neural Radiance fields creates multiple views of a scene from a single image. 377 | * [ZipNeRF](https://jonbarron.info/zipnerf/): Building on NeRF with more advanced techniques and impressive results, generating drone-style “fly-by” videos from still images of settings. 378 | * [Pegasus-1](https://app.twelvelabs.io/blog/introducing-pegasus-1): Multimodal model from TwelveLabs for describing videos and video-to-text generation. 379 | * [Gen-2 by RunwayML](https://research.runwayml.com/gen2): Video-generating multimodal model from Runway ML that takes text or images as input. 380 | * [Replay](https://blog.genmo.ai/log/replay-ai-video): Video (animated picture) generating model from Genmo AI 381 | * [Hotshot XL](https://www.hotshot.co/): Text to animated GIF generator based on Stable Diffusion XL. [Github](https://github.com/hotshotco/Hotshot-XL) and [Hugging Face model page](https://huggingface.co/hotshotco/Hotshot-XL). 382 | * [ModelScope](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis): Open model for text-to-video generation from Alibaba research 383 | * [Stable Video Diffusion](https://stability.ai/news/stable-video-diffusion-open-ai-video-model): Generative video diffusion model from Stability AI. 384 | * [VideoPoet](https://sites.research.google/videopoet/): Synthetic video generation from Google Research, taking a variety of inputs (text, image, video). 385 | * [Pika Labs](https://pika.art): AI startup for video creation with $55 million in backing. 386 | * [Assistive Video](https://assistive.chat/product/video): Video generation from text from AI startup Assistive 387 | * [Haiper](https://haiper.ai/): Text-to-video for short clips (2-4s) from Google Deepmind alumni. Free to use with an account. 388 | * [MagicVideo-V2](https://magicvideov2.github.io/): Multi-Stage High-Aesthetic Video Generation. Text-to-video model from ByteDance research. 389 | * [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA): Open model for visual question answering in images, video, and between video and image data. 390 | 391 | 392 | #### 3D Model Generation 393 | 394 | 395 | 396 | * [Stable Zero123](https://stability.ai/news/stable-zero123-3d-generation): 3D image generation model from Stability AI building on the Zero123-XL model. Weights available for non-commercial use on [HF here](https://huggingface.co/stabilityai/stable-zero123). 397 | * [DreamBooth3D](https://dreambooth3d.github.io/): Approach for generating high-quality custom 3D models from source images. 398 | * [MVDream:](https://mv-dream.github.io/gallery_0.html) 3D model generation from Diffusion from researchers at ByteDance. 399 | * [TADA! Text to Animatable Digital Avatars](https://tada.is.tue.mpg.de): Research on models for synthetic generation of 3D avatars from text prompts, from researchers in China and Germany 400 | * [TripoSR](https://github.com/vast-ai-research/triposr): Image to 3D generative model jointly developed by Tripo AI & Stability AI 401 | * [Microdreamer](https://github.com/ml-gsai/microdreamer): Github repo for implementation of Zero-shot 3D Generation in ~20 Seconds from researchers at Renmin University of China 402 | 403 | 404 | #### Powerpoint and Presentation Creation 405 | 406 | 407 | 408 | * [Tome](https://tome.app/): Startup for AI-generated slides (Powerpoint). Free to signup. 409 | * [Decktopus](https://www.decktopus.com/): “World’s #1 AI-Powered Presentation Generator”. Paid signup 410 | * [Beautiful.ai](Beautiful.ai): Another AI-based slide deck generator (paid) 411 | 412 | 413 | ### Domain-specific LLMs 414 | 415 | 416 | #### Code 417 | 418 | 419 | 420 | * [Github Copilot](https://github.com/features/copilot): Github’s AI coding assistant, based on OpenAI’s Codex model. 421 | * [GitHub Copilot Fundamentals - Understand the AI pair programmer](https://learn.microsoft.com/en-us/training/paths/copilot/|): Introductory online training / short course on Copilot from Microsoft. 422 | * [Gemini Code Assist:](https://cloud.google.com/gemini/docs/codeassist/overview) Code assistant from Google based on Gemini. Available in Google Cloud or in local IDEs via a plugin (requires subscription). 423 | * [CodeCompose](https://techcrunch.com/2023/05/18/meta-built-a-code-generating-ai-model-similar-to-copilot/): (TechCruch article): Meta’s internal coding LLM / answer to Copilot 424 | * [CodeInterpreter:](https://openai.com/blog/chatgpt-plugins#code-interpreter) Experimental ChatGPT plugin that provides it with access to executing python code. 425 | * [StableCode](https://stability.ai/blog/stablecode-llm-generative-ai-coding): Stability AI’s generative LLM coding model. Hugging Face [collection here](https://huggingface.co/collections/stabilityai/stablecode-64f9dfb4ebc8a1be0a3f7650). Github [here](https://github.com/Stability-AI/StableCode). 426 | * [Starcoder](https://huggingface.co/blog/starcoder): Coding LLM from Hugging Face. Github [is here](https://github.com/bigcode-project/starcoder). **Update**: [Starcoder 2](https://huggingface.co/blog/starcoder2) has been released as of Feb 2024! 427 | * [CodeQwen-1.5](https://qwenlm.github.io/blog/codeqwen1.5/): Code-specific version of Alibaba’s Qwen model. 428 | * [Codestral](https://huggingface.co/mistralai/Codestral-22B-v0.1): 22B coding model from Mistral AI, supports 80+ languages. 429 | * [Ghostwriter](https://replit.com/site/ghostwriter): an AI-powered programming assistant from Replit AI. 430 | * [DeciCoder 1B](https://huggingface.co/Deci/DeciCoder-1b): Code completion LLM from Deci AI, trained on Starcoder dataset. 431 | * [SQLCoder](https://github.com/defog-ai/sqlcoder): Open text-to-SQL query models fine-tuned on Starcoder, from Defog AI. Demo [is here](https://defog.ai/sqlcoder-demo/). 432 | * [CodeLLama](https://ai.meta.com/blog/code-llama-large-language-model-coding/): Fine-tuned version of LLaMA 2 for coding tasks, from Meta. 433 | * [Refact Code LLM](https://refact.ai/blog/2023/introducing-refact-code-llm/): 1.6B coding LLM with fill-in-the-middle (fim) capability, trained by Refact AI. 434 | * [Tabby](https://tabby.tabbyml.com/): Open source, locally-hosted coding assistant framework. Can use Starcoder or CodeLLaMA. 435 | * [DuetAI for Developers](https://cloud.google.com/blog/products/application-development/introducing-duet-ai-for-developers): Coding assistance based on PaLM as part of Google’s DuetAI offering. 436 | * [Gorilla LLM](https://gorilla.cs.berkeley.edu/): LLM model from researchers at UC Berkeley trained to generate API calls across many different platforms and tools. 437 | * [Deepseek Coder](https://deepseekcoder.github.io): Series of bilinginual English/Chinese coding LLMs from DeepSeek AI, trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language. 438 | * [Codestral Mamba](https://mistral.ai/news/codestral/): Open coding model from Mistral based on the MAMBA architecture. 439 | * [Phind 70B](https://www.phind.com/blog/introducing-phind-70b): Code generation model purported to rival GPT-4 from AI startup Phind. 440 | * [Granite](https://research.ibm.com/blog/granite-code-models-open-source?utm_source=tldrai): Open-sourced family of code-specific LLMs from IBM Research. On Hugging Face [here](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330). 441 | 442 | 443 | #### Mathematics 444 | 445 | 446 | 447 | * [MathGLM](https://github.com/THUDM/MathGLM): Open model from Tsinghua University researchers challenging the statement that LLMs cannot do mathematics well. Nonetheless, [math remains hard if you’re an LLM](https://garymarcus.substack.com/p/math-is-hard-if-you-are-an-llm-and). 448 | * [Llemma: An Open Language Model For Mathematics](https://blog.eleuther.ai/llemma): Fine-tuned version of CodeLLaMA on new dataset [Proof-Pile-2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) from Eleuther AI, a mixture of scientific papers and mathematical web content. 449 | * [Mathstral](https://mistral.ai/news/mathstral/): Open model 7B parameter model from Mistral AI specialized in mathematics and STEM tasks. 450 | 451 | 452 | #### Finance 453 | 454 | 455 | 456 | * [BloombergGPT](https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/): LLM trained by Bloomberg from scratch based on code / approaches from BLOOM 457 | * [FinGPT](https://github.com/AI4Finance-Foundation/FinGPT): Finance-specific family of models trained with RLHF, fine-tuned from various base foundation models. 458 | * [DocLLM](https://arxiv.org/pdf/2401.00908.pdf): Layout-aware large language model from JPMorgan 459 | 460 | 461 | #### Science and Health 462 | 463 | 464 | 465 | * [Galactica](https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/): (MIT Blog Post) Learnings from Meta’s Galactica LLM, trained on scientific research papers. 466 | * [BioGPT](https://github.com/microsoft/BioGPT): Generative Pre-trained Transformer for Biomedical Text Generation and Mining, open LLM from Microsoft Research trained on PubMeb papers. 467 | * [MedPALM](https://sites.research.google/med-palm/): A large language model from Google Research, designed for the medical domain. Google has continued this work with [MedLM](https://cloud.google.com/blog/topics/healthcare-life-sciences/introducing-medlm-for-the-healthcare-industry), 468 | * [Meditron](https://arxiv.org/abs/2311.16079): Fine-tuned LLaMAs on medical data from Swiss university EPFL. HuggingFace space [here](https://huggingface.co/epfl-llm/meditron-70b). Github [here](https://github.com/epfLLM/meditron). [Llama3 version](https://meditron-ddx.github.io/llama3-meditron.github.io/) released 2024/04/19. 469 | * [MedicalLLM](https://huggingface.co/blog/leaderboard-medicalllm): Evaluation benchmark for medical LLMs from Hugging Face including leaderboard. 470 | 471 | 472 | #### Law 473 | 474 | 475 | 476 | * [SaulLM-7B](https://arxiv.org/abs/2403.03883): Legal LLM from researchers at Equall.ai and other universities. A fine-tune of Mistral-7B trained on a legal corpus of over 30B tokens. 477 | 478 | 479 | #### Time Series 480 | 481 | 482 | 483 | * [TimeGPT](https://www.nixtla.io/timegpt): Transformer-based time series prediction models from NIXTLA. Requires using their service / an API token. 484 | * [Lag-Llama](https://github.com/time-series-foundation-models/lag-llama): Towards Foundation Models for Probabilistic Time Series Forecasting. Open-source foundation model for time series forecasting based on the transformer architecture. 485 | * [Granite](https://research.ibm.com/blog/granite-code-models-open-source?utm_source=tldrai): Time-series versions of open-sourced family of LLMs from IBM Research. On Hugging Face [here](https://huggingface.co/collections/ibm-granite/granite-time-series-models-663a90c6a2da73482bce3dc6). 486 | 487 | 488 | ### Vector Databases and Frameworks 489 | 490 | 491 | 492 | * [Docarray](https://docarray.jina.ai/): python library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, and so on. 493 | * [Faiss](https://faiss.ai/): Library for efficient similarity search and clustering of dense vectors from Meta Research. 494 | * [Pinecone](https://www.pinecone.io/): Vector database is a vector-based database that offers high-performance search and similarity matching. 495 | * [Weaviate](https://weaviate.io/): Open-source vector database to store data objects and vector embeddings from your favorite ML-models. 496 | * [Chroma](https://www.trychroma.com/): Open-source vector store used for storing and retrieving vector embeddings and metadata for use with large language models. 497 | * [Milvus](https://milvus.io/): Vector database built for scalable similarity search. 498 | * [AstraDB](https://www.datastax.com/products/datastax-astra): Datastax’s vector database offering built atop of Apache Cassandra. 499 | * [Activeloop](https://www.activeloop.ai/): Database for AI powered by a unique storage format optimized for deep-learning and Large Language Model (LLM) based applications. 500 | * [OSS Chat](https://osschat.io/): Demo of RAG from Zilliz, allowing chat with OSS documentation. 501 | 502 | 503 | ### Evaluation 504 | 505 | 506 | 507 | * [The Stanford Natural Language Inference (SNLI) Corpus](https://nlp.stanford.edu/projects/snli/): Foundational dataset for NLI-based evaluation, 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. 508 | * [GLUE](https://gluebenchmark.com/): General Language Understanding Evaluation Benchmark from NYU, University of Washington, and Google - model evaluation using Natural Language Inference (NLI) tasks. 509 | * [SuperGLUE](https://super.gluebenchmark.com/): The Super General Language Understanding Evaluation, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. 510 | * [SQuAD (Stanford Question Answering Dataset)](https://rajpurkar.github.io/SQuAD-explorer/): Reading comprehension question answering dataset for LLM evaluation. 511 | * [BigBench](https://github.com/google/BIG-bench): The Beyond the Imitation Game Benchmark (BIG-bench) from Google, a collaborative benchmark with over 200 tasks. 512 | * [BigBench Hard](https://github.com/suzgunmirac/BIG-Bench-Hard): Subset of BigBench tasks considered to be the most challenging, with associated paper. 513 | * [MMLU](https://github.com/hendrycks/test): Multi-task Language Understanding is a benchmark developed by researchers at UC Berkeley and others to specifically measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. 514 | * [HeLM](https://crfm.stanford.edu/helm/latest/): Holistic Evaluation of Language Models, a “living” benchmark designed to be comprehensive, from the Center for Research on Foundation Models (CRFM) at Stanford. 515 | * [HellaSwag](https://rowanzellers.com/hellaswag/): a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy). 516 | * [Dynabench](https://dynabench.org/): A “platform for dynamic data collection and benchmarking”. Sort of a Kaggle / collaborative site for benchmarks and data collaboration, an effort of researchers from Meta and American universities. 517 | * [LMSys Chatbot Area:](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) Leaderboard from LMSys group based upon human evaluation and Elo score. The only evaluation that [Andrej Karpathy trusts](https://www.reddit.com/r/LocalLLaMA/comments/18n3ar3/karpathy_on_llm_evals/). 518 | * [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard): Leaderboard from H4 (alignment) Group at Hugging Face. Largely open and fine-tuned models, though this can be filtered. 519 | * [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/): AlpacaEval an LLM-based automatic evaluation based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions. 520 | * [OpenCompass](https://opencompass.org.cn/leaderboard-llm): Leaderboard for Chinese LLMs. 521 | * [Evaluating LLMs is a minefield](https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/): Popular deck from researchers at Princeton (and authors of AI Snake Oil) on the pitfalls and intricacies of evaluating LLMs. 522 | * [LM Contamination Index](https://hitz-zentroa.github.io/lm-contamination/): The LM Contamination Index is a manually created database of contamination of LLM evaluation benchmarks. 523 | * [The Curious Case of LLM Evaluation](https://nlpurr.github.io/posts/case-of-llm-evals.html): In depth blog post, examining some of the finer nuances and sticking points of evaluating LLMs. 524 | * [LLM Benchmarks](https://benchmarks.llmonitor.com/): Dynamic dataset of crowd-sourced prompt that changes weekly for more realistic LLM evaluation. 525 | * [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness): EleutherAI’s language model evaluation harness, a unified framework to test generative language models on over 200 different evaluation tasks. 526 | * [PromptBench](https://github.com/microsoft/promptbench): Unified framework for LLM evaluation from Microsoft. 527 | * [HarmBench](https://www.harmbench.org/about): Standardized evaluation framework for automated red teaming for mitigating risks associated with malicious use of LLMs. Paper [on arxiv](https://arxiv.org/abs/2402.04249). 528 | 529 | 530 | ### Agents 531 | 532 | 533 | 534 | * [AutoGPT](https://github.com/Significant-Gravitas/AutoGPT): One of the most popular frameworks for using LLM agents, using the OpenAI API / GPT4. 535 | * [ThinkGPT](https://github.com/jina-ai/thinkgpt): python library for implementing Chain of Thoughts for LLMs, prompting the model to think, reason, and to create generative agents. 536 | * [AutoGen](https://microsoft.github.io/autogen/): Multi-agent LLM framework for building applications from Microsoft. 537 | * [XAgent](https://blog.x-agent.net/): Open-source experimental agent, designed to be a general-purpose and applied to a wide range of tasks. From students at Tsinghua University. 538 | * [Thought Cloning](https://github.com/ShengranHu/Thought-Cloning): Github repo for implementation of Thought Cloning (TC), an imitation learning framework by training agents to think like humans. 539 | * [Demonstrate-Search-Predict (DSP)](https://github.com/stanfordnlp/dsp): framework for solving advanced tasks with language models (LMs) and retrieval models (RMs). 540 | * [ReAct Framework](https://www.promptingguide.ai/techniques/react): Prompting method includes examples with actions, the observations gained by taking those actions, and transcribed thoughts (reasoning) for LLMs to take complex actions and reason or solve problems. 541 | * [Tree of Thoughts (ToT)](https://github.com/princeton-nlp/tree-of-thought-llm): LLM reasoning process as a tree, where each node is an intermediate "thought" or coherent piece of reasoning that serves as a step towards the final solution. 542 | * [GPT Engineer](https://github.com/AntonOsika/gpt-engineer): Python framework for attempting to get GPT to write code and build software. 543 | * [MetaGPT - The Multi-Agent Framework](https://github.com/geekan/MetaGPT): Agent framework where different assigned roles (product managers, architects, project managers, engineers) are used for building different products (user stories, competitive analysis, requirements, data structures, etc.) given a requirement. 544 | * [OpenGPTs](https://github.com/langchain-ai/opengpts): Open source effort from Langchain to create a similar experience to OpenAI's GPTs with greater flexibility and choice. 545 | * [Devin](https://www.cognition-labs.com/introducing-devin): “AI software engineer” from startup Cognition Labs. 546 | * [SWE-Agent:](https://swe-agent.com/) Open source software engineering agent framework from researchers at Princeton. 547 | * [GATO](https://www.deepmind.com/publications/a-generalist-agent): Generalist agent from Google Deepmind research for many tasks and media types 548 | * [WebLLaMa](https://webllama.github.io/): Fine-tuned version of LLaMa 3 from McGill University and optimized for web browsing tasks.. 549 | 550 | 551 | ### Application Frameworks: 552 | 553 | 554 | 555 | * [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/):[ ](https://gpt-index.readthedocs.io/en/latest/)LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Usedl for RAG and building LLM applications working with stored data. 556 | * [LangChain](https://python.langchain.com/docs/get_started/introduction.html): LangChain is a framework for developing applications powered by language models. 557 | * [Chainlit](https://docs.chainlit.io/get-started/overview): Chainlit is an open-source Python package that makes it incredibly fast to build ChatGPT-like applications with your own business logic and data. 558 | 559 | 560 | ### LLM Training, Training Frameworks, Training at Scale 561 | 562 | 563 | 564 | * [Deepspeed](https://www.microsoft.com/en-us/research/project/deepspeed/): Deep learning optimization software suite that enables unprecedented scale and speed for DL Training and Inference from Microsoft. 565 | * [Megatron-LM](https://github.com/NVIDIA/Megatron-LM): From NVIDIA, Megatron-LM enables training large transformer language models with efficient tensor, pipeline and sequence-based model parallelism for pre-training transformer based language models. 566 | * [GPT-NeoX](https://github.com/EleutherAI/gpt-neox): Eleuther AI’s library for large scale GPU training of LLMs, based on Megatron. 567 | * [TRL (Transformer Reinforcement Learning)](https://pypi.org/project/trl/): Library for Reinforcement Learning of Transformer and Stable Diffusion models built atop of the transformers library. 568 | * [Autotrain Advanced](https://huggingface.co/docs/autotrain/index): In development offering and [python library](https://github.com/huggingface/autotrain-advanced) from Hugging Face for easy and fast auto-training of LLMs and Stable Diffusion models. 569 | * [Transformer Math: ](https://www.eleuther.ai/mathematics/transformer_math/)Detailed blog post from Eleuther AI on the mathematics of compute requirements for training LLMs 570 | 571 | 572 | ### Reinforcement Learning from Human Feedback (RLHF) 573 | 574 | 575 | 576 | * [Reinforcement Learning from Human Feedback:](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback) ELI5 from Wikipedia 577 | * [RLHF: Reinforcement Learning from Human Feedback](https://huyenchip.com/2023/05/02/rlhf.html): Blog post from Chip Hyugen on breaking down RLHF. 578 | * [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf): Blog post from Hugging Face breaking down how RLHF works with accompanying visuals. 579 | 580 | 581 | ### Embeddings 582 | 583 | 584 | 585 | * [The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/): Explanation of word2vec from Jay Allamar 586 | * [Sentence Transformers](https://www.sbert.net/): Python framework for state-of-the-art sentence, text and image embeddings from Siamese BERT networks. 587 | * [Text Embeddings](https://docs.cohere.com/docs/text-embeddings): Documentation / explainer from Cohere with accompanying video. 588 | * [Text Embeddings Visually Explained](https://txt.cohere.com/text-embeddings/): Another Cohere post explaining the intuition and use cases behind text embeddings. 589 | * [A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings](https://datajenius.com/2022/03/13/a-deep-dive-into-nlp-tokenization-encoding-word-embeddings-sentence-embeddings-word2vec-bert/): Lengthy blog post going into detail on embeddings from a deep learning fundamentals perspective and building up to word2vec and BERT. 590 | 591 | 592 | ### LLM Serving 593 | 594 | 595 | 596 | * [vLLM](https://github.com/vllm-project/vllm): vLLM is a fast and easy-to-use library for LLM inference and serving, using Paged Attention for working in parallel. 597 | * [Skypilot](https://github.com/skypilot-org/skypilot): SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud. 598 | * [7 Frameworks for Serving LLMs](https://betterprogramming.pub/frameworks-for-serving-llms-60b7f7b23407): Medium post comparing different LLM serving frameworks. 599 | * [Deploying custom fine-tuned LLMs on Vertex AI](https://medium.com/@ashika.umanga/deploying-custom-fine-tuned-llms-on-vertex-ai-6f96752f9fc1#:~:text=These%20can%20be%20easily%20deployed%20and%20used%20for%20inference.&text=Models%20in%20the%20Registry%20are,traffic-split%20can%20be%20configured): Medium post with a how-to on serving LLMs via GCP and Vertex AI 600 | 601 | 602 | ### Preprocessing and Tokenization 603 | 604 | 605 | 606 | * [Tiktoken](https://github.com/openai/tiktoken): OpenAI’s BPE-based tokenizer 607 | * [SentencePiece](https://github.com/google/sentencepiece): Unsupervised text tokenizer and detokenizer for text generation systems from Google (but not an official product). 608 | 609 | 610 | ### Open LLMs 611 | 612 | 613 | 614 | * [LLaMa 2](https://llama.meta.com/): Incredibly popular open weights (with license) model from Meta AI which spawned a generation of offspring and fine-tunes. Comes in 7, 13, and 70B versions. 615 | * [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/): Popular open model from French startup Mistral with no fine-tuning (only pretraining). See also: the [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/) mixture of experts successor, [Mixtral 8x22B](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1-4bit) 616 | * [Mistral NeMO](https://mistral.ai/news/mistral-nemo/): Open model from Mistral with 128B parameters, trained in partnership with NVIDIA and a new updated tokenizer (Tekken). [Model on Hugging Face](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407). 617 | * [Gemma](https://blog.google/technology/developers/gemma-open-models/): Lightweight open models from Google based on the same architecture as Gemini. Comes in 2B and 7B base and instruction-tuned versions. 618 | * [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) and [GPT Neo-X](https://huggingface.co/EleutherAI/gpt-neox-20b): Open model trained from scratch by Eleuther AI. 619 | * [Falcon 40B](https://falconllm.tii.ae/): Open text generation LLM from UAE’s Technology Innovation Institute (TII). Available on Hugging Face [here](https://huggingface.co/tiiuae/falcon-40b). 620 | * [Falcon 2 11B](https://falconllm.tii.ae/falcon-2.html): Second set of models in the series from TII, released May 2024, including a multimodal model. On Hugging Face [here](https://huggingface.co/tiiuae/falcon-11B)c. 621 | * [StableLM](https://github.com/stability-AI/stableLM/): Open language model from Stability AI. Succeeded by StableLM 2, in [1.6B](https://stability.ai/news/introducing-stable-lm-2) (Jan 2024) and [12B versions](https://stability.ai/news/introducing-stable-lm-2-12b) (April 2024, try [live demo here](https://huggingface.co/spaces/stabilityai/stablelm-2-chat)) 622 | * [OLMo](https://allenai.org/olmo): Open Language Models from the Allen Institute for AI (AI2) 623 | * [DCLM-7B](https://huggingface.co/apple/DCLM-7B): 7 billion parameter language model from Apple designed to showcase the effectiveness of systematic data curation techniques for improving language model performance. 624 | * [Snowflake Arctic](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/): Open LLM from Snowflake, released April 2024. [Github here](https://github.com/Snowflake-Labs/snowflake-arctic) and on [Hugging Face here](https://huggingface.co/Snowflake). 625 | * [Minotaur 15B](https://huggingface.co/openaccess-ai-collective/minotaur-15b): Fine-tuned version of Starcoder on open code datasets from the OpenAccess AI Collective 626 | * [MPT](https://www.mosaicml.com/mpt): Family of open models free for commercial use from MosaicML. Includes [MPT Storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter) which has a 65K context window. 627 | * [DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm): Family of mixture-of-experts (MoE) large language model trained from scratch by Databricks Mosaic Research. Try it out in the Hugging Face [playground here](https://huggingface.co/spaces/databricks/dbrx-instruct). 628 | * [Qwen](https://github.com/QwenLM/Qwen): Open LLM models from Alibaba Cloud in 7B and 14B sizes, including chat versions. Model family [1.5 released Feb 2024](https://github.com/QwenLM/Qwen1.5) and [Qwen1.5-MoE](https://qwenlm.github.io/blog/qwen-moe/) Mixture of Experts model released 03/28/2024. 629 | * [Command-R](https://txt.cohere.com/command-r/) / [Command-R+](https://txt.cohere.com/command-r-plus-microsoft-azure/): Open LLM from Cohere for AI for long-context tasks such as retrieval augmented generation (RAG) and tool use. Available on HuggingFace [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01), [Command-R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus) 630 | * [Aya](https://cohere.com/research/aya): Massively multilingual models from Cohere for AI, [Aya 101](https://huggingface.co/CohereForAI/aya-101) and 23 which support those many languages respectively each. [Aya 23](https://huggingface.co/collections/CohereForAI/c4ai-aya-23-664f4cda3fa1a30553b221dc) comes in 8B and 35B versions. 631 | * [Grok-1](https://github.com/xai-org/grok-1): X.ai’s LLM, an MoE with 314B parameters, weights available via torrent. This is the (pre-trained) base model only, and not fine-tuned for chat. 632 | * [SmolLM](https://huggingface.co/blog/smollm): Family of small language models (SLMs) from Huggingface in 135M, 360M, and 1.7B parameters. On Hugging Face [here](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966). 633 | * [Jamba](https://www.ai21.com/blog/announcing-jamba): Hybrid SSM-Transformer model from AI21 Labs - “world’s first production grade Mamba based model”. Weights [on Hugging Face here](https://huggingface.co/ai21labs/Jamba-v0.1). 634 | * [Fuyu-8B](https://www.adept.ai/blog/fuyu-8b): Open multimodal model from Adept AI, a smaller version of the model that powers their commercial product. 635 | * [Yi](https://01.ai/): Bilingual open LLM from Chinese startup [01.AI](01.ai) founded by Kai-Fu Lee, with two versions Yi-34B & 6B. Also [Yi-9B](https://huggingface.co/01-ai/Yi-9B) open-sourced in March 2024. 636 | * [OpenHermes](https://huggingface.co/collections/NousResearch/hermes-650a66656fb511ba9ea86ff1): Popular series of open (and uncensored) LLMs from Nousresearch, fine tunes of models such as LLaMA, Mixtral, Yi, and SOLAR. 637 | * [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B?utm_source=substack&utm_medium=email): Fully open-source bilingual Finnish & English model trained in collaboration between Finnish startup Silo AI and the TurkuNLP group of the University of Turku. 638 | * [Nemotron-3 8B](https://developer.nvidia.com/blog/nvidia-ai-foundation-models-build-custom-enterprise-chatbots-and-co-pilots-with-production-ready-llms/): Family of “semi-open” (requires accepting a license) LLMs from NVIDIA, optimized for their Nemo framework. Find them all on the [collections page](https://huggingface.co/collections/nvidia/nemotron-3-8b-6553adeb226f6ab4ffc356f9) on HF. 639 | * [ML Foundations](https://github.com/mlfoundations): Github repo for Ludwig Schmidt from University of Washington, includes open versions of multimodal models Flamingo & CLIP 640 | 641 | 642 | ### Visualization 643 | 644 | 645 | 646 | * [BERTViz:](https://github.com/jessevig/bertviz) Interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5, based in Python and can be run in Colab. 647 | * [Jay Alammar’s Blog](https://jalammar.github.io/): Lots of great posts here visualizing and explaining LLMs such as [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) and [The Illustrated Stable Diffusion](https://jalammar.github.io/illustrated-stable-diffusion/) 648 | * [LLM Visualization](https://bbycroft.net/llm): Interactive 3-D visualizations of nanoGPT, GPT-2, and GPT-3, with explanation walking through each piece of the model in detail. 649 | 650 | 651 | ### Prompt Engineering 652 | 653 | 654 | 655 | * [Prompt Engineering Guide](https://www.promptingguide.ai/): Comprehensive site for all things prompting related and beyond. 656 | * [Prompts for Education: Enhancing Productivity & Learning](https://github.com/microsoft/prompts-for-edu): Github repo with resources on using generative AI and prompting in educational settings. 657 | * [How I Think About Prompt Engineering](https://fchollet.substack.com/p/how-i-think-about-llm-prompt-engineering): Post by Francis Chollet (creator of Keras) relating prompting back to programming paradigms. 658 | * [PromptIDE](https://x.ai/prompt-ide): Development environment and paradigm for prompt programming from xAI using their Grok model. 659 | * [Prompt Engineering Guide from OpenAI:](https://platform.openai.com/docs/guides/prompt-engineering) Official Guide from OpenAPI on prompt engineering best practices (December 2023). 660 | * [Introduction to prompt design:](https://docs.anthropic.com/claude/docs/introduction-to-prompt-design) Anthropic guide for prompt engineering with Claude. 661 | * [Prompt Library](https://docs.anthropic.com/claude/prompt-library): Library of prompts from Anthropic for use with their models. 662 | * [More Useful Things: Prompt Library](https://www.moreusefulthings.com/prompts): Prompt library from researchers at Wharton, primarily geared towards a classroom / teaching setting. 663 | 664 | 665 | ### Ethics, Bias, and Legal 666 | 667 | 668 | 669 | * [Awesome LLM Uncertainty Robustness](https://github.com/jxzhangjhu/Awesome-LLM-Uncertainty-Reliability-Robustness): Collection of resources and papers on Uncertainty, Reliability and Robustness in Large Language Models. 670 | * [Foundation Model Transparency Index](https://crfm.stanford.edu/fmti/): LLM Transparency Index from the Center for Research on Foundation Models (CRFM) Group at Stanford, based upon 100 transparency indicators. 671 | * [AI Alignment](https://ai-alignment.com/): Writings on AI alignment from Paul Christiano, of the Alignment Research Center (ARC) & previous head of the language model alignment team at OpenAI. 672 | * [LIMA - Less is More for Alignment](https://arxiv.org/abs/2305.11206): Paper from Meta showing the data quality can trump model size for performance of smaller models. 673 | * [Safety Guidance | PaLM API](https://developers.generativeai.google/guide/safety_guidance): Safety Guidance guidelines from Google for using their PaLM model, though are generally applicable. 674 | * [LLM Hacking: Prompt Injection Techniques](https://medium.com/@austin-stubbs/llm-security-types-of-prompt-injection-d7ad8d7d75a3#): Medium post describing different techniques for prompt injection. 675 | * [Anthropic Long-term Benefit Trust (LTBT)](https://www.anthropic.com/index/the-long-term-benefit-trust#:~:text=The%20Anthropic%20Long%2DTerm%20Benefit,public%20policy%2C%20and%20social%20enterprise.): Anthropic’s approach for governance of the company and addressing leadership and governance of AI. 676 | * [Guardrails](https://docs.guardrailsai.com/): Python library for assurance and validation of the outputs of LLMs. In alpha. 677 | * [Detoxify](https://github.com/unitaryai/detoxify): Toxic comment classification models based on BERT. 678 | * [Artificial Intelligence and Data Act (AIDA) Companion Document](https://ised-isde.canada.ca/site/innovation-better-canada/en/artificial-intelligence-and-data-act-aida-companion-document): High level details on the proposed Canadian AI legislation as part of [Bill C-27](https://www.parl.ca/DocumentViewer/en/44-1/bill/C-27/first-reading). 679 | * [Evaluating social and ethical risks from generative AI](https://deepmind.google/discover/blog/evaluating-social-and-ethical-risks-from-generative-ai/): Blog post and source paper from Deepmind on framework for risks from GenAI. 680 | * [The Alignment Handbook](https://github.com/huggingface/alignment-handbook): From the Hugging Face team, provides a series of robust training recipes that span the whole LLM pipeline for ensuring model alignment. 681 | * [Decoding Intentions](https://cset.georgetown.edu/publication/decoding-intentions/): Artificial Intelligence and Costly Signals: The paper from Helen Toner on the OpenAI board that ruffled Sam Altan’s feathers. 682 | * [Cold Takes](https://www.cold-takes.com/): Ethics and AI blog and podcast from Holden Karnofsky of Open Philanthropy 683 | 684 | 685 | ### Costing 686 | 687 | 688 | 689 | * [You don’t need hosted LLMs, do you?](https://betterprogramming.pub/you-dont-need-hosted-llms-do-you-1160b2520526): Comparison of costs and considerations for using self-hosted solutions vs OpenAI’s offerings. 690 | 691 | 692 | ## Books, Courses and other Resources 693 | 694 | 695 | ### Communities 696 | 697 | 698 | 699 | * [MLOps Community](https://mlops.community/): Community of machine learning operations (MLOps) practitioners, but lately very much focused on LLMs. 700 | * [LLMOps Space](https://llmops.space/): global community for LLM practitioners & enthusiasts, focused on topics related to deploying LLMs into production 701 | * [Aggregate Intellect Socratic Circles (AISC)](https://aisc.ai.science/about): Online community of ML and AI practitioners based in Toronto, with Slack server, journal club, and free talks 702 | * [/r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/): Reddit community on Natural Language Processing and LLMs with over 40K members 703 | * [/r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/): Subreddit to discuss training Llama and development around it, though also contains a lot of good general LLM discussion. 704 | 705 | 706 | ### MOOCS and Courses 707 | 708 | 709 | 710 | * [Stanford CS324](https://stanford-cs324.github.io/winter2022/): Large Language Models at Stanford. All course materials are freely available and viewable online. 711 | * [Stanford CS224U: Natural Language Understanding:](https://web.stanford.edu/class/cs224u/index.html) NLU course including a lot on LLMs and transformers, taught by Christopher Potts. Materials are in [the Github repo](https://github.com/cgpotts/cs224u). Lectures are [in a Youtube playlist](https://www.youtube.com/playlist?list=PLoROMvodv4rOwvldxftJTmoR3kRcWkJBp). 712 | * [Stanford CS224N](https://web.stanford.edu/class/cs224n/): NLP with Deep Learning: NLP with Deep Learning, Youtube [playlist of lectures is here](https://youtube.com/playlist?list=PLoROMvodv4rMFqRtEuo6SGjY4XbRIVRd4&feature=shared). 713 | * [Stanford CS25: Transformers United V3](https://web.stanford.edu/class/cs25/): Stanford course breaking down how transformers work, and dive deep into the different kinds of transformers and how they're applied in different fields. The associated playlist with all the lectures [is available on Youtube](https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM). 714 | * [CS685: Advanced Natural Language Processing (Spring 2023)](https://youtube.com/playlist?list=PLWnsVgP6CzaelCF_jmn5HrpOXzRAPNjWj&feature=shared): Content of this graduate-level course from the University of Massachusetts Amherst - lots of content on LLMs and Transformers. [Full course materials here](https://people.cs.umass.edu/~miyyer/cs685/schedule.html). 715 | * [CS5785 Applied Machine Learning](https://github.com/kuleshov/cornell-cs5785-2023-applied-ml): Lecture notes and slides from Cornell’s Applied Machine Learning Course, Fall 2023. 716 | * [CS388: Natural Language Processing](https://www.cs.utexas.edu/~gdurrett/courses/online-course/materials.html): University of Texas at Austin offers its Master-level NLP course online CS388 by Prof Greg Durrett from the University of Texas at Austin. Videos for lectures are in a [Youtube playlist here](https://www.youtube.com/playlist?list=PLofp2YXfp7TZZ5c7HEChs0_wfEfewLDs7). 717 | * [Hugging Face’s NLP Course](https://huggingface.co/learn/nlp-course/chapter1/1): With a focus on using the transformers library and transformer models. 718 | * [LLM University](https://docs.cohere.com/docs/llmu): Documentation and free learning on LLMs from Cohere. 719 | * [Large Language Model Course](https://github.com/mlabonne/llm-course): A microcourse composed of Colab notebooks and associated blog posts from Maxime Labonne @ JPMorganChase. 720 | * [Advanced NLP with spaCy](https://course.spacy.io/en/): Course from Ines Montani, one of the course developers at spaCY, one using it. Includes data analysis, pipelines, and fitting models. 721 | * [Applied Language Technology](https://applied-language-technology.mooc.fi): MOOC from the University of Helsinki on NLP, focusing on using spaCy. 722 | * [LangChain for LLM Application Development by Andrew Ng](https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/): Apply LLMs to your proprietary data to build personal assistants and specialized chatbots. 723 | * [Full Stack LLM Bootcamp](https://fullstackdeeplearning.com/llm-bootcamp/): Best practices and tools for building LLM-powered apps - materials are free. 724 | * [LangChain & Vector Databases in Production](https://learn.activeloop.ai/courses/langchain): Free course on LangChain using Deep Lake, ActiveLoop’s Vector database offering. 725 | * [UVA Deep Learning Course](https://uvadlc.github.io/): from MSc in Artificial Intelligence for the University of Amsterdam. Highly technical! Tutorial [notebooks here](https://uvadlc-notebooks.readthedocs.io/), [Youtube playlist](https://youtube.com/playlist?list=PLdlPlO1QhMiAkedeu0aJixfkknLRxk1nA&feature=shared) here. 726 | * [Intro to Text Analytics with R](https://github.com/datasciencedojo/IntroToTextAnalyticsWithR/tree/master): From Data Science Dojo. The associated [Youtube playlist](https://www.youtube.com/playlist?list=PL8eNk_zTBST8olxIRFoo0YeXxEOkYdoxi) is here. 727 | * [Natural Language Processing for Semantic Search](https://www.pinecone.io/learn/series/nlp/): Course from Pinecone focused on embeddings and information retrieval, with accompanying code and videos. 728 | * [Generative AI Foundations on AWS Technical Deep Dive Series](https://www.youtube.com/playlist?list=PLhr1KZpdzukf-xb0lmiU3G89GJXaDbAIF): Youtube playlist of working with GenAI, training and fine-tuning models with Sagemaker. 729 | * [FourthBrain Resources](https://fourthbrain.ai/resources): Free resources from Fourthbrain from their community sessions and webinars, mainly focused on LLM development. 730 | * [Natural Language Processing with Large Language Models](https://github.com/jonkrohn/NLP-with-LLMs): Technical notebooks here from Jon Krohn’s half-day ODSC East workshop. Includes using transformers library for fine-tuning with T5 and using OpenAI API. 731 | * [Spark NLP Workshops](https://github.com/JohnSnowLabs/spark-nlp-workshop): A lot of resources here on all things SparkNLP, including code in Jupyter notebooks for different applications of SparkNLP to many use cases. 732 | * [Generative AI for Beginners](https://github.com/microsoft/generative-ai-for-beginners#generative-ai-for-beginners---a-course): Free online course from Microsoft 733 | * [Anaconda Learning](https://freelearning.anaconda.cloud/): Free learning courses from Anaconda on Jupyter and conda basics. 734 | * [Weights & Biases Courses](https://www.wandb.courses): Free LLM-related courses from Weights & Biases using their platform (requires email signup) 735 | 736 | 737 | ### Books 738 | 739 | 740 | 741 | * [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/): by Dan Jurafsky and James H. Martin. A fundamental text on all things NLP. 742 | * [Foundations of Statistical Natural Language Processing](https://mitpress.mit.edu/9780262133609/): by Christopher Manning and Hinrich Schütze 743 | * [Foundations of Machine Learning](https://cs.nyu.edu/~mohri/mlbook/): by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Full text freely available as well as accompanying slides. 744 | * [Mathematics for Machine Learning](https://mml-book.github.io/): Free textbook from Cambridge University. Includes accompanying Jupyter notebooks. 745 | * [The Data Science Interview Book](https://book.thedatascienceinterviewproject.com/): A free online e-book for DS interview prep. Includes a growing section on NLP topics. 746 | * [Introduction to Modern Statistics](https://openintro-ims2.netlify.app/): Free online textbook on statistics. 747 | * [Dive into Deep Learning](http://d2l.ai/): Interactive deep learning book with code, math, and discussions implemented with multiple frameworks. Chapters 9-11 focus on RNNs & Transformers and 15-16 on NLP applications. 748 | * [Understanding Deep Learning](https://udlbook.github.io/udlbook/): by Simon J.D. Prince, free online and includes all code in Jupyter notebooks. Chapter 12 covers the transformer architecture. 749 | * [Natural Language Processing in Action, Second Edition](https://www.manning.com/books/natural-language-processing-in-action-second-edition): by Hobson Lane and Maria Dyshel from Manning. Currently a MEAP set for release in Jan 2024. 750 | * [Natural Language Processing with Transformers](https://transformersbook.com/): by Lewis Tunstall, Leandro von Werra, and Thomas Wolf. From O’Reilly. You can view the code associated with the book in the [Github repo here](https://github.com/nlp-with-transformers/notebooks). 751 | * [Applied Text Analysis with Python](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/): by Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda, from O’Reilly. Aimed at Python developers breaking into NLP and focuses on building product and includes using Spark. 752 | * [Build a Large Language Model (From Scratch)](https://github.com/rasbt/LLMs-from-scratch/): by Sebastian Raschka. This github report contains the code and examples from the book. 753 | 754 | 755 | ### Surveys 756 | 757 | 758 | 759 | * [Anaconda’s State of Data Science Report 2023](https://www.anaconda.com/state-of-data-science-report-2023): Anaconda’s annual survey for general DS. Requires form submission with email / personal details to download. 760 | * [State of AI Report 2023](https://www.stateof.ai/): From AirStreet Capital. Very dense report focusing on high-level trends, industry players, funding, etc. and focused on LLMs and generative AI. 761 | * [Kaggle’s AI Report 2023](https://www.kaggle.com/AI-Report-2023): State of AI from Kaggle, taking the form of community-written long-form essays as part of a Kaggle competition 762 | * [MLOps Community 2023 LLM Survey Report:](https://mlops.community/surveys/llm/) Survey from [MLOps.communtiy](MLOps.community) on trends in LLM usage and adoption. 763 | 764 | 765 | ### Aggregators and Online Resources 766 | 767 | 768 | 769 | * [LLM normcore reads](https://gist.github.com/veekaybee/be375ab33085102f9027853128dc5f0e): “Anti-hype LLM reading list” compiled by Vicky Boykis in a Github gist 770 | * [Machine Learning Glossary](https://developers.google.com/machine-learning/glossary): From Google Developers 771 | * [AI Canon](https://a16z.com/ai-canon/): A collection of links to fundamental resources for AI and LLMs, from Anderssen Horowitz 772 | * [Practical Large Language Models - Open Book](https://sherpa-ai.readthedocs.io/en/latest/Open%20Book/): Programmatically generated open book compiling summaries talks and events from aggregate intellect 773 | * [NLP Progress](https://nlpprogress.com/): Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. 774 | * [Awesome Legal NLP](https://github.com/maastrichtlawtech/awesome-legal-nlp): Compilation of links to NLP resources related to the field of law and legislation. 775 | * [Awesome LLM Fine-tuning](https://github.com/Curated-Awesome-Lists/awesome-llms-fine-tuning): Github awesome list of fine-tuning related resources. 776 | * [Instruction Tuning Papers](https://github.com/SinclairCoder/Instruction-Tuning-Papers): Repo with a list of papers related to instruction tuning LLMs. 777 | * List of foundation LLMS: [https://github.com/zhengzangw/awesome-huge-models#language-model](https://github.com/zhengzangw/awesome-huge-models#language-model) 778 | * [Awesome LLM](https://github.com/Hannibal046/Awesome-LLM): Curated list of important LLM papers, frameworks, resources, and other lists. Actively updated. 779 | * [LLM Survey:](https://github.com/RUCAIBox/LLMSurvey) A collection of papers and resources related to Large Language Models. 780 | * [Uni-LM](https://github.com/microsoft/unilm): Aggregate repo of LLM and foundation model work across Microsoft Research. They also have a repo specifically for [LLMOps](https://github.com/microsoft/lmops). 781 | 782 | 783 | ### Newsletters 784 | 785 | These are not referral links. 786 | 787 | 788 | 789 | * [GPTRoad](https://www.gptroad.com/subscribe): Daily no-nonsense newsletter covering developments in the AI / LLM space. They also [have a site](https://www.gptroad.com/) following the HackerNews template. 790 | * [TLDR AI](https://tldr.tech/ai): Daily newsletter with little fluff, covering developments in AI news. 791 | * [AI Tool Report](https://aitoolreport.beehiiv.com/): Newsletter from Respell, with AI headlines, jobs, 792 | * [The Memo from Lifearchitect.ai](https://lifearchitect.ai/memo/): Bi-weekly newsletter with future-focused updates on developments in the LLM-space. 793 | * [AI Breakfast](https://aibreakfast.beehiiv.com/): Curated weekly analysis of the latest AI projects, products, and news 794 | * [The Rundown AI](https://www.therundown.ai/): Another daily AI newsletter (400K+ readers) 795 | * [Interconnects](https://www.interconnects.ai/): LLM / AI newsletter for more technical readers. 796 | * [The Neuron](https://www.theneurondaily.com/): Another AI newsletter with cutesy and light tone. 797 | 798 | 799 | ### Papers (WIP) 800 | 801 | 802 | 803 | * [Attention is All You Need:](https://arxiv.org/abs/1706.03762) _The _paper that started it all in 2017 and introduced the Transformer Architecture, from Google Brain. 804 | * [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323): Post-training quantization paper from researchers at ETH Zurich ahd IST Austria 805 | * [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314): The efficient method combining quantization with LoRA that produced Guanaco from LLaMA. From researchers at the University of Washington. 806 | * [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862): Paper from Anthropic on using RLHF for desired safety behaviors in LLMs. 807 | * [BRIO - Bringing Order to Abstractive Summarization](https://arxiv.org/abs/2203.16804): Abstractive summarization model from researchers at Yale and Carnegie Mellon using contrastive learning to rank candidate summaries. 808 | 809 | 810 | ### Conferences and Societies 811 | 812 | 813 | 814 | * Association of Computational Linguistics (ACL): [https://www.aclweb.org/portal/](https://www.aclweb.org/portal/) 815 | * ACL 2023 in Toronto: [https://virtual2023.aclweb.org/](https://virtual2023.aclweb.org/) -------------------------------------------------------------------------------- /img/banner_large.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nlpfromscratch/nlp-llms-resources/df11a9969205bc0d626cbc410f6e6bf1dbe4f3e0/img/banner_large.png -------------------------------------------------------------------------------- /img/toc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nlpfromscratch/nlp-llms-resources/df11a9969205bc0d626cbc410f6e6bf1dbe4f3e0/img/toc.png --------------------------------------------------------------------------------