├── .DS_Store ├── .github └── workflows │ └── publish.yml ├── .gitignore ├── CNAME ├── LICENSE ├── README.md ├── _quarto.yml ├── _subscribe.yml ├── about.qmd ├── app.py ├── courses.qmd ├── favicon.ico ├── images ├── bryan.jpg ├── charles.jpg ├── eugene.png ├── evalgen.png ├── hamel.png ├── hamel_cover_img.png ├── how-it-started.jpg ├── jason.jpg ├── jobs_seo.png ├── models-prices.png ├── niah.png └── shreya.jpg ├── index.qmd ├── jobs.qmd ├── services.qmd ├── styles.css └── talks ├── _evals └── index.qmd ├── _fine_tuning └── index.qmd ├── _inference └── index.qmd ├── _listing_meta.yml ├── _listing_root.yml ├── _prompt_eng └── index.qmd ├── _rag └── index.qmd ├── applications ├── index.qmd └── simon_llm_cli │ ├── index.qmd │ └── simon_bash.png └── index.qmd /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/.DS_Store -------------------------------------------------------------------------------- /.github/workflows/publish.yml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | branches: 4 | - main 5 | 6 | name: Render and Publish 7 | 8 | permissions: 9 | contents: write 10 | pages: write 11 | 12 | jobs: 13 | build-deploy: 14 | runs-on: ubuntu-latest 15 | 16 | steps: 17 | - name: Check out repository 18 | uses: actions/checkout@v4 19 | 20 | - name: Set up Quarto 21 | uses: quarto-dev/quarto-actions/setup@v2 22 | env: 23 | GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} 24 | with: 25 | # To install LaTeX to build PDF book 26 | tinytex: true 27 | 28 | # See more at https://github.com/quarto-dev/quarto-actions/blob/main/examples/example-03-dependencies.md 29 | - name: Publish to GitHub Pages (and render) 30 | uses: quarto-dev/quarto-actions/publish@v2 31 | with: 32 | target: gh-pages 33 | env: 34 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # this secret is always available for github actions 35 | 36 | 37 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Quarto 2 | _site/ 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | share/python-wheels/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | MANIFEST 31 | 32 | # PyInstaller 33 | # Usually these files are written by a python script from a template 34 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .nox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *.cover 52 | *.py,cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | cover/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Django stuff: 62 | *.log 63 | local_settings.py 64 | db.sqlite3 65 | db.sqlite3-journal 66 | 67 | # Flask stuff: 68 | instance/ 69 | .webassets-cache 70 | 71 | # Scrapy stuff: 72 | .scrapy 73 | 74 | # Sphinx documentation 75 | docs/_build/ 76 | 77 | # PyBuilder 78 | .pybuilder/ 79 | target/ 80 | 81 | # Jupyter Notebook 82 | .ipynb_checkpoints 83 | 84 | # IPython 85 | profile_default/ 86 | ipython_config.py 87 | 88 | # pyenv 89 | # For a library or package, you might want to ignore these files since the code is 90 | # intended to run in multiple environments; otherwise, check them in: 91 | # .python-version 92 | 93 | # pipenv 94 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 95 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 96 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 97 | # install all needed dependencies. 98 | #Pipfile.lock 99 | 100 | # poetry 101 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 102 | # This is especially recommended for binary packages to ensure reproducibility, and is more 103 | # commonly ignored for libraries. 104 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 105 | #poetry.lock 106 | 107 | # pdm 108 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 109 | #pdm.lock 110 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 111 | # in version control. 112 | # https://pdm.fming.dev/#use-with-ide 113 | .pdm.toml 114 | 115 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 116 | __pypackages__/ 117 | 118 | # Celery stuff 119 | celerybeat-schedule 120 | celerybeat.pid 121 | 122 | # SageMath parsed files 123 | *.sage.py 124 | 125 | # Environments 126 | .env 127 | .venv 128 | env/ 129 | venv/ 130 | ENV/ 131 | env.bak/ 132 | venv.bak/ 133 | 134 | # Spyder project settings 135 | .spyderproject 136 | .spyproject 137 | 138 | # Rope project settings 139 | .ropeproject 140 | 141 | # mkdocs documentation 142 | /site 143 | !site/assets/images/social 144 | 145 | # mypy 146 | .mypy_cache/ 147 | .dmypy.json 148 | dmypy.json 149 | 150 | # Pyre type checker 151 | .pyre/ 152 | 153 | # pytype static type analyzer 154 | .pytype/ 155 | 156 | # Cython debug symbols 157 | cython_debug/ 158 | 159 | # PyCharm 160 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 161 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 162 | # and can be added to the global gitignore or merged into this file. For a more nuclear 163 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 164 | #.idea/ 165 | 166 | /.quarto/ 167 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | applied-llms.org 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 parlance-labs 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Render and Publish](https://github.com/parlance-labs/applied-llms/actions/workflows/publish.yml/badge.svg)](https://github.com/parlance-labs/applied-llms/actions/workflows/publish.yml) 2 | 3 | 4 | # applied-llms 5 | Source code for [https://applied-llms.org](https://applied-llms.org) 6 | 7 | # Running Locally 8 | 9 | This is built using [Quarto](https://quarto.org/). After installing Quarto, you can run the site locally with: 10 | 11 | ```bash 12 | quarto preview 13 | ``` 14 | 15 | When you merge, GitHub Actions will automatically deploy changes. 16 | 17 | # Generating Social Card 18 | 19 | Hamel manually made a social card from [these Google slides](https://docs.google.com/presentation/d/1PQ_16_ljMCitLu99mOllhrUeTpx95WLrUh8HsRu3Lkg/edit?usp=sharing). 20 | 21 | You can make a copy of these slides and modify if you like. The social card is generated by clicking on the "File" menu and selecting "Download" -> "PNG image". 22 | 23 | -------------------------------------------------------------------------------- /_quarto.yml: -------------------------------------------------------------------------------- 1 | project: 2 | type: website 3 | 4 | website: 5 | title: "Applied LLMs" 6 | image: images/hamel_cover_img.png 7 | favicon: favicon.ico 8 | google-analytics: "G-TL6F6S3QZP" 9 | twitter-card: true 10 | open-graph: true 11 | navbar: 12 | left: 13 | - text: Courses 14 | file: courses.qmd 15 | - text: Services 16 | file: services.qmd 17 | - text: Job Board 18 | href: https://jobs.applied-llms.org/ 19 | target: _blank 20 | - text: Team 21 | file: about.qmd 22 | repo-url: https://github.com/parlance-labs/applied-llms 23 | site-url: https://applied-llms.org 24 | # sidebar: 25 | # - title: Talks 26 | # pinned: true 27 | # collapse-level: 1 28 | # style: docked 29 | # contents: talks/** 30 | 31 | format: 32 | html: 33 | link-external-newwindow: true 34 | theme: 35 | light: flatly 36 | dark: darkly 37 | css: styles.css 38 | toc: true 39 | -------------------------------------------------------------------------------- /_subscribe.yml: -------------------------------------------------------------------------------- 1 | margin-footer: | 2 |
3 |

Join 2000+ readers getting updates on how to apply LLMs effectively

4 | 5 |
6 | 25 | -------------------------------------------------------------------------------- /about.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "About The Authors" 3 | metadata-files: ["_subscribe.yml"] 4 | --- 5 | 6 | ## Eugene Yan 7 | 8 | ![](images/eugene.png){width=70 style="float:left"} 9 | 10 | Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He's currently a Senior Applied Scientist at Amazon where he builds [RecSys for millions worldwide](https://eugeneyan.com/speaking/recsys2022-keynote/) and applies [LLMs to serve customers better](https://eugeneyan.com/speaking/ai-eng-summit/). Previously, he led machine learning at Lazada (acquired by Alibaba) and a Healthtech Series A. He writes & speaks about ML, RecSys, LLMs, and engineering at [eugeneyan.com](https://eugeneyan.com/) and [ApplyingML.com](https://applyingml.com/). 11 | 12 | ## Bryan Bischof 13 | 14 | ![](images/bryan.jpg){width=70 style="float:left"} 15 | 16 | Bryan Bischof is the Head of AI at Hex, where he leads the team of engineers building Magic – the data science and analytics copilot. Bryan has worked all over the data stack leading teams in analytics, machine learning engineering, data platform engineering, and AI engineering. He started the data team at Blue Bottle Coffee, led several projects at Stitch Fix, and built the data teams at Weights and Biases. Bryan previously co-authored the book Building Production Recommendation Systems with O’Reilly, and teaches Data Science and Analytics in the graduate school at Rutgers. His Ph.D. is in pure mathematics. 17 | 18 | ## Charles Frye 19 | 20 | ![](images/charles.jpg){width=70 style="float:left"} 21 | 22 | Charles Frye teaches people to build AI applications. After publishing research in [psychopharmacology](https://pubmed.ncbi.nlm.nih.gov/24316346/) and [neurobiology](https://journals.physiology.org/doi/full/10.1152/jn.00172.2016), he got his Ph.D. at the University of California, Berkeley, for dissertation work on [neural network optimization](https://arxiv.org/abs/2003.10397). He has taught thousands the entire stack of AI application development, from linear algebra fundamentals to GPU arcana and building defensible businesses, through educational and consulting work at Weights and Biases, [Full Stack Deep Learning](https://fullstackdeeplearning.com), and Modal. 23 | 24 | ## Hamel Husain 25 | 26 | ![](images/hamel.png){width=70 style="float:left"} 27 | 28 | [Hamel Husain](https://hamel.dev) is a machine learning engineer with over 25 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI for code understanding. He has also led and contributed to numerous popular [open-source machine-learning tools](https://hamel.dev/oss/opensource.html). Hamel is currently an independent consultant helping companies operationalize Large Language Models (LLMs) to accelerate their AI product journey. 29 | 30 | ## Jason Liu 31 | 32 | ![](images/jason.jpg){width=70 style="float:left"} 33 | 34 | Jason Liu is a distinguished machine learning [consultant](https://jxnl.co/services/) known for leading teams to successfully ship AI products. Jason's technical expertise covers personalization algorithms, search optimization, synthetic data generation, and MLOps systems. His experience includes companies like Stitch Fix, where he created a recommendation framework and observability tools that handled 350 million daily requests. Additional roles have included Meta, NYU, and startups such as Limitless AI and Trunk Tools. 35 | 36 | ## Shreya Shankar 37 | 38 | ![](images/shreya.jpg){width=70 style="float:left"} 39 | 40 | Shreya Shankar is an ML engineer and PhD student in computer science at UC Berkeley. She was the first ML engineer at 2 startups, building AI-powered products from scratch that serve thousands of users daily. As a researcher, her work focuses on addressing data challenges in production ML systems through a human-centered approach. Her work has appeared in top data management and human-computer interaction venues like VLDB, SIGMOD, CIDR, and CSCW. 41 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import modal 2 | 3 | app = modal.App() 4 | 5 | @app.function() 6 | def hello(): 7 | print("Running remotely on Modal!") -------------------------------------------------------------------------------- /courses.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Courses" 3 | description: "Courses By The Authors" 4 | metadata-files: ["_subscribe.yml"] 5 | --- 6 | 7 | ## Mastering LLMs 8 | 9 | All of the authors of "What We've Learned From A Year of Building with LLMs" are instructors in an online course covering RAG, fine-tuning, evals, building applications and more. **The material is [available for free here](https://parlance-labs.com/education/).** 10 | 11 | ## AI Evals For Engineers 12 | 13 | Learn a systematic approaches for improving AI applications, regardless of the domain or use-case. A 4-week course with weekly live sessions. See [the course page here](https://maven.com/parlance-labs/evals). 14 | 15 | ## AI Essentials for Tech Executives 16 | 17 | A free course for executives, where we cover: 18 | 19 | - The most common hiring mistakes (and how to avoid them) 20 | - Strategies to drive ROI instead of creating vanity projects 21 | - A curated list of our preferred AI tools and vendors 22 | - Real-world case studies 23 | 24 | **[View the course here](https://ai-execs.com/).** 25 | 26 | ## Systematically Improving RAG Applications 27 | 28 | Take the next step after Mastering LLMs by enhancing your RAG applications. Complete our brief [interest survey](https://q7gjsgfstrp.typeform.com/ragcourse) (less than 5 minutes) to discover how to improve feedback capture, understand user queries, and identify areas for improvement using data-driven methods. 29 | -------------------------------------------------------------------------------- /favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/favicon.ico -------------------------------------------------------------------------------- /images/bryan.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/bryan.jpg -------------------------------------------------------------------------------- /images/charles.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/charles.jpg -------------------------------------------------------------------------------- /images/eugene.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/eugene.png -------------------------------------------------------------------------------- /images/evalgen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/evalgen.png -------------------------------------------------------------------------------- /images/hamel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/hamel.png -------------------------------------------------------------------------------- /images/hamel_cover_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/hamel_cover_img.png -------------------------------------------------------------------------------- /images/how-it-started.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/how-it-started.jpg -------------------------------------------------------------------------------- /images/jason.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/jason.jpg -------------------------------------------------------------------------------- /images/jobs_seo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/jobs_seo.png -------------------------------------------------------------------------------- /images/models-prices.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/models-prices.png -------------------------------------------------------------------------------- /images/niah.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/niah.png -------------------------------------------------------------------------------- /images/shreya.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/images/shreya.jpg -------------------------------------------------------------------------------- /index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "What We've Learned From A Year of Building with LLMs" 3 | description: "A practical guide to building successful LLM products, covering the tactical, operational, and strategic." 4 | sidebar: false 5 | image: images/hamel_cover_img.png 6 | toc: true 7 | number-sections: true 8 | date: 2024-06-08 9 | metadata-files: 10 | - "_subscribe.yml" 11 | author: 12 | - name: "Eugene Yan" 13 | url: "https://eugeneyan.com" 14 | - name: Bryan Bischof 15 | url: "https://www.linkedin.com/in/bryan-bischof/" 16 | - name: "Charles Frye" 17 | url: "https://www.linkedin.com/in/charles-frye-38654abb/" 18 | - name: "Hamel Husain" 19 | url: "https://hamel.dev" 20 | - name: "Jason Liu" 21 | url: "https://jxnl.co/" 22 | - name: "Shreya Shankar" 23 | url: "https://www.sh-reya.com/" 24 | format: 25 | html: 26 | anchor-sections: true 27 | smooth-scroll: true 28 | comments: 29 | utterances: 30 | repo: parlance-labs/applied-llms 31 | --- 32 | 33 | :::{.callout-important} 34 | ## See Our Courses on Building with LLMs 35 | 36 | **We have a [series of courses](courses.qmd) that cover some of these topics in more depth.** 37 | 38 | 1. [AI Evals For Engineers](https://maven.com/parlance-labs/evals) 39 | 2. [Systematically Improving RAG Applications](https://maven.com/applied-llms/rag-playbook) 40 | 3. [Mastering LLMs](https://hamel.dev/blog/posts/course/) 41 | 4. [AI Essentials for Tech Executives](https://ai-execs.com) 42 | 43 | ::: 44 | 45 | It’s an exciting time to build with large language models (LLMs). Over the past year, LLMs have become “good enough” for real-world applications. And they're getting better and cheaper every year. Coupled with a parade of demos on social media, there will be an [estimated $200B investment in AI by 2025](https://www.goldmansachs.com/intelligence/pages/ai-investment-forecast-to-approach-200-billion-globally-by-2025.html){target="_blank}. Furthermore, provider APIs have made LLMs more accessible, allowing everyone, not just ML engineers and scientists, to build intelligence into their products. Nonetheless, while the barrier to entry for building with AI has been lowered, creating products and systems that are effective—beyond a demo—remains deceptively difficult. 46 | 47 | We've spent the past year building, and have discovered many sharp edges along the way. While we don't claim to speak for the entire industry, we'd like to share what we've learned to help you avoid our mistakes and iterate faster. These are organized into three sections: 48 | 49 | - [Tactical](#tactical-nuts--bolts-of-working-with-llms): Some practices for prompting, RAG, flow engineering, evals, and monitoring. Whether you're a practitioner building with LLMs, or hacking on weekend projects, this section was written for you. 50 | - [Operational](#operation-day-to-day-and-org-concerns): The organizational, day-to-day concerns of shipping products, and how to build an effective team. For product/technical leaders looking to deploy sustainably and reliably. 51 | - [Strategic](#strategy-building-with-llms-without-getting-out-maneuvered): The long-term, big-picture view, with opinionated takes such as "no GPU before PMF" and "focus on the system not the model", and how to iterate. Written with founders and executives in mind. 52 | 53 | We intend to make this a practical guide to building successful products with LLMs, drawing from our own experiences and pointing to examples from around the industry. 54 | 55 | Ready to ~~delve~~ dive in? Let’s go. 56 | 57 | --- 58 | 59 | # Tactical: Nuts & Bolts of Working with LLMs 60 | 61 | Here, we share best practices for core components of the emerging LLM stack: prompting tips to improve quality and reliability, evaluation strategies to assess output, retrieval-augmented generation ideas to improve grounding, how to design human-in-the-loop workflows, and more. While the technology is still nascent, we trust these lessons are broadly applicable and can help you ship robust LLM applications. 62 | 63 | ## Prompting 64 | We recommend starting with prompting when prototyping new applications. It’s easy to both underestimate and overestimate its importance. It’s underestimated because the right prompting techniques, when used correctly, can get us very far. It’s overestimated because even prompt-based applications require significant engineering around the prompt to work well. 65 | 66 | ### Focus on getting the most out of fundamental prompting techniques 67 | 68 | A few prompting techniques have consistently helped with improving performance across a variety of models and tasks: n-shot prompts + in-context learning, chain-of-thought, and providing relevant resources. 69 | 70 | The idea of in-context learning via n-shot prompts is to provide the LLM with examples that demonstrate the task and align outputs to our expectations. A few tips: 71 | 72 | - If n is too low, the model may over-anchor on those specific examples, hurting its ability to generalize. As a rule of thumb, aim for n ≥ 5. Don’t be afraid to go as high as a few dozen. 73 | - Examples should be representative of the prod distribution. If you're building a movie summarizer, include samples from different genres in roughly the same proportion you'd expect to see in practice. 74 | - You don't always need to provide the input-output pairs; examples of desired outputs may be sufficient. 75 | - If you plan for the LLM to use tools, include examples of using those tools. 76 | 77 | In Chain-of-Thought (CoT) prompting, we encourage the LLM to explain its thought process before returning the final answer. Think of it as providing the LLM with a sketchpad so it doesn’t have to do it all in memory. The original approach was to simply add the phrase "Let’s think step by step" as part of the instructions, but, we’ve found it helpful to make the CoT more specific, where adding specificity via an extra sentence or two often reduces hallucination rates significantly. 78 | 79 | For example, when asking an LLM to summarize a meeting transcript, we can be explicit about the steps: 80 | 81 | - First, list out the key decisions, follow-up items, and associated owners in a sketchpad. 82 | - Then, check that the details in the sketchpad are factually consistent with the transcript. 83 | - Finally, synthesize the key points into a concise summary. 84 | 85 | Note that in recent times, [some doubt](https://arxiv.org/abs/2405.04776) has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used. Regardless, this technique is one to experiment with when possible. 86 | 87 | Providing relevant resources is a powerful mechanism to expand the model’s knowledge base, reduce hallucinations, and increase the user’s trust. Often accomplished via Retrieval Augmented Generation (RAG), providing the model with snippets of text that it can directly utilize in its response is an essential technique. When providing the relevant resources, it’s not enough to merely include them; don’t forget to tell the model to prioritize their use, refer to them directly, and to mention when none of the resources are sufficient. These help “ground” agent responses to a corpus of resources. 88 | 89 | ### Structure your inputs and outputs 90 | 91 | Structured input and output help models better understand the input as well as return output that can reliably integrate with downstream systems. Adding serialization formatting to your inputs can help provide more clues to the model as to the relationships between tokens in the context, additional metadata to specific tokens (like types), or relate the request to similar examples in the model’s training data. 92 | 93 | As an example, many questions on the internet about writing SQL begin by specifying the SQL schema. Thus, you can expect that effective prompting for Text-to-SQL should include [structured schema definitions](https://www.researchgate.net/publication/371223615_SQL-PaLM_Improved_Large_Language_ModelAdaptation_for_Text-to-SQL). 94 | 95 | Structured input expresses tasks clearly and resembles how the training data is formatted, increasing the probability of better output. Structured output simplifies integration into downstream components of your system. [Instructor](https://github.com/jxnl/instructor) and [Outlines](https://github.com/outlines-dev/outlines) work well for structured output. (If you’re importing an LLM API SDK, use Instructor; if you’re importing Huggingface for a self-hosted model, use Outlines.) 96 | 97 | When using structured input, be aware that each LLM family has their own preferences. Claude prefers `` while GPT favors Markdown and JSON. With XML, you can even pre-fill Claude's responses by providing a `` tag like so. 98 | 99 | ```python 100 | messages=[ 101 | { 102 | "role": "user", 103 | "content": """Extract the , , , and from this product description into your . 104 | The SmartHome Mini is a compact smart home assistant available in black or white for only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other connected devices via voice or app—no matter where you place it in your home. This affordable little hub brings convenient hands-free control to your smart devices. 105 | """ 106 | }, 107 | { 108 | "role": "assistant", 109 | "content": "" 110 | } 111 | ] 112 | ``` 113 | 114 | ### Have small prompts that do one thing, and only one thing, well 115 | 116 | A common anti-pattern / code smell in software is the "[God Object](https://en.wikipedia.org/wiki/God_object)", where we have a single class or function that does everything. The same applies to prompts too. 117 | 118 | A prompt typically starts simple: A few sentences of instruction, a couple of examples, and we're good to go. But as we try to improve performance and handle more edge cases, complexity creeps in. More instructions. Multi-step reasoning. Dozens of examples. Before we know it, our initially simple prompt is now a 2,000 token Frankenstein. And to add injury to insult, it has worse performance on the more common and straightforward inputs! GoDaddy shared this challenge as their [No. 1 lesson from building with LLMs](https://www.godaddy.com/resources/news/llm-from-the-trenches-10-lessons-learned-operationalizing-models-at-godaddy#h-1-sometimes-one-prompt-isn-t-enough). 119 | 120 | Just like how we strive (read: struggle) to keep our systems and code simple, so should we for our prompts. Instead of having a single, catch-all prompt for the meeting transcript summarizer, we can break it into steps: 121 | 122 | - Extract key decisions, action items, and owners into structured format 123 | - Check extracted details against the original transcription for consistency 124 | - Generate a concise summary from the structured details 125 | 126 | As a result, we’ve split our single prompt into multiple prompts that are each simple, focused, and easy to understand. And by breaking them up, we can now iterate and eval each prompt individually. 127 | 128 | ### Craft your context tokens 129 | 130 | Rethink, and challenge your assumptions about how much context you actually need to send to the agent. Be like Michaelangelo, do not build up your context sculpture—chisel away the superfluous material until the sculpture is revealed. RAG is a popular way to collate all of the potentially relevant blocks of marble, but what are you doing to extract what’s necessary? 131 | 132 | We’ve found that taking the final prompt sent to the model—with all of the context construction, and meta-prompting, and RAG results—putting it on a blank page and just reading it, really helps you rethink your context. We have found redundancy, self-contradictory language, and poor formatting using this method. 133 | 134 | The other key optimization is the structure of your context. If your bag-of-docs representation isn’t helpful for humans, don’t assume it’s any good for agents. Think carefully about how you structure your context to underscore the relationships between parts of it and make extraction as simple as possible. 135 | 136 | More [prompting fundamentals](https://eugeneyan.com/writing/prompting/) such as prompting mental model, prefilling, context placement, etc. 137 | 138 | ## Information Retrieval / RAG 139 | 140 | Beyond prompting, another effective way to steer an LLM is by providing knowledge as part of the prompt. This grounds the LLM on the provided context which is then used for in-context learning. This is known as retrieval-augmented generation (RAG). Practitioners have found RAG effective at providing knowledge and improving output, while requiring far less effort and cost compared to finetuning. 141 | 142 | 143 | ### RAG is only as good as the retrieved documents' relevance, density, and detail 144 | 145 | The quality of your RAG's output is dependent on the quality of retrieved documents, which in turn can be considered along a few factors 146 | 147 | The first and most obvious metric is relevance. This is typically quantified via ranking metrics such as [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) or [Normalized Discounted Cumulative Gain (NDCG)](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). MRR evaluates how well a system places the first relevant result in a ranked list while NDCG considers the relevance of all the results and their positions. They measure how good the system is at ranking relevant documents higher and irrelevant documents lower. For example, if we're retrieving user summaries to generate movie review summaries, we'll want to rank reviews for the specific movie higher while excluding reviews for other movies. 148 | 149 | Like traditional recommendation systems, the rank of retrieved items will have a significant impact on how the LLM performs on downstream tasks. To measure the impact, run a RAG-based task but with the retrieved items shuffled—how does the RAG output perform? 150 | 151 | Second, we also want to consider information density. If two documents are equally relevant, we should prefer one that’s more concise and has fewer extraneous details. Returning to our movie example, we might consider the movie transcript and all user reviews to be relevant in a broad sense. Nonetheless, the top-rated reviews and editorial reviews will likely be more dense in information. 152 | 153 | Finally, consider the level of detail provided in the document. Imagine we're building a RAG system to generate SQL queries from natural language. We could simply provide table schemas with column names as context. But, what if we include column descriptions and some representative values? The additional detail could help the LLM better understand the semantics of the table and thus generate more correct SQL. 154 | 155 | ### Don’t forget keyword search; use it as a baseline and in hybrid search 156 | 157 | Given how prevalent the embedding-based RAG demo is, it's easy to forget or overlook the decades of research and solutions in information retrieval. 158 | 159 | Nonetheless, while embeddings are undoubtedly a powerful tool, they are not the be-all and end-all. First, while they excel at capturing high-level semantic similarity, they may struggle with more specific, keyword-based queries, like when users search for names (e.g., Ilya), acronyms (e.g., RAG), or IDs (e.g., claude-3-sonnet). Keyword-based search, such as BM25, is explicitly designed for this. Finally, after years of keyword-based search, users have likely taken it for granted and may get frustrated if the document they expect to retrieve isn't being returned. 160 | 161 | > Vector embeddings *do not* magically solve search. In fact, the heavy lifting is in the step before you re-rank with semantic similarity search. Making a genuine improvement over BM25 or full-text search is hard. — [Aravind Srinivas, CEO Perplexity.ai](https://x.com/AravSrinivas/status/1737886080555446552) 162 | 163 | > We've been communicating this to our customers and partners for months now. Nearest Neighbor Search with naive embeddings yields very noisy results and you're likely better off starting with a keyword-based approach. — [Beyang Liu, CTO Sourcegraph](https://twitter.com/beyang/status/1767330006999720318) 164 | 165 | Second, it's more straightforward to understand why a document was retrieved with keyword search—we can look at the keywords that match the query. In contrast, embedding-based retrieval is less interpretable. Finally, thanks to systems like Lucene and OpenSearch that have been optimized and battle-tested over decades, keyword search is usually more computationally efficient. 166 | 167 | In most cases, a hybrid will work best: keyword matching for the obvious matches, and embeddings for synonyms, hypernyms, and spelling errors, as well as multimodality (e.g., images and text). [Shortwave shared how they built their RAG pipeline](https://www.shortwave.com/blog/deep-dive-into-worlds-smartest-email-ai/), including query rewriting, keyword + embedding retrieval, and ranking. 168 | 169 | ### Prefer RAG over finetuning for new knowledge 170 | 171 | Both RAG and finetuning can be used to incorporate new information into LLMs and increase performance on specific tasks. However, which should we prioritize? 172 | 173 | Recent research suggests RAG may have an edge. [One study](https://arxiv.org/abs/2312.05934) compared RAG against unsupervised finetuning (aka continued pretraining), evaluating both on a subset of MMLU and current events. They found that RAG consistently outperformed finetuning for knowledge encountered during training as well as entirely new knowledge. In [another paper](https://arxiv.org/abs/2401.08406), they compared RAG against supervised finetuning on an agricultural dataset. Similarly, the performance boost from RAG was greater than finetuning, especially for GPT-4 (see Table 20). 174 | 175 | Beyond improved performance, RAG has other practical advantages. First, compared to continuous pretraining or finetuning, it's easier—and cheaper!—to keep retrieval indices up-to-date. Second, if our retrieval indices have problematic documents that contain toxic or biased content, we can easily drop or modify the offending documents. Consider it an andon cord for [documents that ask us to add glue to pizza](https://x.com/petergyang/status/1793480607198323196). 176 | 177 | In addition, the R in RAG provides finer-grained control over how we retrieve documents. For example, if we’re hosting a RAG system for multiple organizations, by partitioning the retrieval indices, we can ensure that each organization can only retrieve documents from their own index. This ensures that we don’t inadvertently expose information from one organization to another. 178 | 179 | ### Long-context models won't make RAG obsolete 180 | 181 | With Gemini 1.5 providing context windows of up to 10M tokens in size, some have begun to question the future of RAG. 182 | 183 | > I tend to believe that Gemini 1.5 is significantly overhyped by Sora. A context window of 10M tokens effectively makes most of existing RAG frameworks unnecessary — you simply put whatever your data into the context and talk to the model like usual. Imagine how it does to all the startups / agents / langchain projects where most of the engineering efforts goes to RAG 😅 184 | > Or in one sentence: the 10m context kills RAG. Nice work Gemini — [Yao Fu](https://x.com/Francis_YAO_/status/1758935954189115714) 185 | 186 | While it's true that long contexts will be a game-changer for use cases such as analyzing multiple documents or chatting with PDFs, the rumors of RAG's demise are greatly exaggerated. 187 | 188 | First, even with a context size of 10M tokens, we'd still need a way to select relevant context. Second, beyond the narrow needle-in-a-haystack eval, we've yet to see convincing data that models can effectively reason over large context sizes. Thus, without good retrieval (and ranking), we risk overwhelming the model with distractors, or may even fill the context window with completely irrelevant information. 189 | 190 | Finally, there’s cost. During inference, the Transformer’s time complexity scales linearly with context length. Just because there exists a model that can read your org’s entire Google Drive contents before answering each question doesn’t mean that’s a good idea. Consider an analogy to how we use RAM: we still read and write from disk, even though there exist compute instances with [RAM running into the tens of terabytes](https://aws.amazon.com/ec2/instance-types/high-memory/). 191 | 192 | So don't throw your RAGs in the trash just yet. This pattern will remain useful even as context sizes grow. 193 | 194 | ## Tuning and optimizing workflows 195 | 196 | Prompting an LLM is just the beginning. To get the most juice out of them, we need to think beyond a single prompt and embrace workflows. For example, how could we split a single complex task into multiple simpler tasks? When is finetuning or caching helpful with increasing performance and reducing latency/cost? Here, we share proven strategies and real-world examples to help you optimize and build reliable LLM workflows. 197 | 198 | ### Step-by-step, multi-turn “flows” can give large boosts 199 | 200 | It's common knowledge that decomposing a single big prompt into multiple smaller prompts can achieve better results. For example, [AlphaCodium](https://arxiv.org/abs/2401.08500): By switching from a single prompt to a multi-step workflow, they increased GPT-4 accuracy (pass@5) on CodeContests from 19% to 44%. The workflow includes: 201 | 202 | - Reflecting on the problem 203 | - Reasoning on the public tests 204 | - Generating possible solutions 205 | - Ranking possible solutions 206 | - Generating synthetic tests 207 | - Iterating on the solutions on public and synthetic tests. 208 | 209 | Small tasks with clear objectives make for the best agent or flow prompts. It’s not required that every agent prompt requests structured output, but structured outputs help a lot to interface with whatever system is orchestrating the agent’s interactions with the environment. Some things to try: 210 | 211 | - A tightly-specified, explicit planning step. Also, consider having predefined plans to choose from. 212 | - Rewriting the original user prompts into agent prompts, though this process may be lossy! 213 | - Agent behaviors as linear chains, DAGs, and state machines; different dependency and logic relationships can be more and less appropriate for different scales. Can you squeeze performance optimization out of different task architectures? 214 | - Planning validations; your planning can include instructions on how to evaluate the responses from other agents to make sure the final assembly works well together. 215 | - Prompt engineering with fixed upstream state—make sure your agent prompts are evaluated against a collection of variants of what may have happen before. 216 | 217 | ### Prioritize deterministic workflows for now 218 | 219 | While AI agents can dynamically react to user requests and the environment, their non-deterministic nature makes them a challenge to deploy. Each step an agent takes has a chance of failing, and the chances of recovering from the error are poor. Thus, the likelihood that an agent completes a multi-step task successfully decreases exponentially as the number of steps increases. As a result, teams building agents find it difficult to deploy reliable agents. 220 | 221 | A potential approach is to have agent systems produce deterministic plans which are then executed in a structured, reproducible way. First, given a high-level goal or prompt, the agent generates a plan. Then, the plan is executed deterministically. This allows each step to be more predictable and reliable. Benefits include: 222 | 223 | - Generated plans can serve as few-shot samples to prompt or finetune an agent. 224 | - Deterministic execution makes the system more reliable, and thus easier to test and debug. In addition, failures can be traced to the specific steps in the plan. 225 | - Generated plans can be represented as directed acyclic graphs (DAGs) which are easier, relative to a static prompt, to understand and adapt to new situations. 226 | 227 | The most successful agent builders may be those with strong experience managing junior engineers because the process of generating plans is similar to how we instruct and manage juniors. We give juniors clear goals and concrete plans, instead of vague open-ended directions, and we should do the same for our agents too. 228 | 229 | In the end, the key to reliable, working agents will likely be found in adopting more structured, deterministic approaches, as well as collecting data to refine prompts and finetune models. Without this, we’ll build agents that may work exceptionally well some of the time, but on average, disappoint users. 230 | 231 | ### Getting more diverse outputs beyond temperature 232 | 233 | Suppose your task requires diversity in an LLM’s output. Maybe you’re writing an LLM pipeline to suggest products to buy from your catalog given a list of products the user bought previously. When running your prompt multiple times, you might notice that the resulting recommendations are too similar—so you might increase the temperature parameter in your LLM requests. 234 | 235 | Briefly, increasing the temperature parameter makes LLM responses more varied. At sampling time, the probability distributions of the next token become flatter, meaning that tokens that are usually less likely get chosen more often. Still, when increasing temperature, you may notice some failure modes related to output diversity. For example, some products from the catalog that could be a good fit may never be output by the LLM. The same handful of products might be overrepresented in outputs, if they are highly likely to follow the prompt based on what the LLM has learned at training time. If the temperature is too high, you may get outputs that reference nonexistent products (or gibberish!) 236 | 237 | In other words, increasing temperature does not guarantee that the LLM will sample outputs from the probability distribution you expect (e.g., uniform random). Nonetheless, we have other tricks to increase output diversity. The simplest way is to adjust elements within the prompt. For example, if the prompt template includes a list of items, such as historical purchases, shuffling the order of these items each time they're inserted into the prompt can make a significant difference. 238 | 239 | Additionally, keeping a short list of recent outputs can help prevent redundancy. In our recommended products example, by instructing the LLM to avoid suggesting items from this recent list, or by rejecting and resampling outputs that are similar to recent suggestions, we can further diversify the responses. Another effective strategy is to vary the phrasing used in the prompts. For instance, incorporating phrases like "pick an item that the user would love using regularly" or "select a product that the user would likely recommend to friends" can shift the focus and thereby influence the variety of recommended products. 240 | 241 | ### Caching is underrated 242 | 243 | Caching saves cost and eliminates generation latency by removing the need to recompute responses for the same input. Furthermore, if a response has previously been guardrailed, we can serve these vetted responses and reduce the risk of serving harmful or inappropriate content. 244 | 245 | One straightforward approach to caching is to use unique IDs for the items being processed, such as if we're summarizing new articles or [product reviews](https://www.cnbc.com/2023/06/12/amazon-is-using-generative-ai-to-summarize-product-reviews.html). When a request comes in, we can check to see if a summary already exists in the cache. If so, we can return it immediately; if not, we generate, guardrail, and serve it, and then store it in the cache for future requests. 246 | 247 | For more open-ended queries, we can borrow techniques from the field of search, which also leverages caching for open-ended inputs. Features like autocomplete, spelling correction, and suggested queries also help normalize user input and thus increase the cache hit rate. 248 | 249 | ### When to finetune 250 | 251 | We may have some tasks where even the most cleverly designed prompts fall short. For example, even after significant prompt engineering, our system may still be a ways from returning reliable, high-quality output. If so, then it may be necessary to finetune a model for your specific task. 252 | 253 | Successful examples include: 254 | 255 | - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant): Initially, the "programming manual" was provided in the prompt together with n-shot examples for in-context learning. While this worked decently, finetuning the model led to better output on the syntax and rules of the domain-specific language. 256 | - [Rechat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0): The LLM needed to generate responses in a very specific format that combined structured and unstructured data for the frontend to render correctly. Finetuning was essential to get it to work consistently. 257 | 258 | Nonetheless, while finetuning can be effective, it comes with significant costs. We have to annotate finetuning data, finetune and evaluate models, and eventually self-host them. Thus, consider if the higher upfront cost is worth it. If prompting gets you 90% of the way there, then finetuning may not be worth the investment. However, if we do decide to finetune, to reduce the cost of collecting human-annotated data, we can [generate and finetune on synthetic data](https://eugeneyan.com/writing/synthetic/), or [bootstrap on open-source data](https://eugeneyan.com/writing/finetuning/). 259 | 260 | ## Evaluation & Monitoring 261 | 262 | Evaluating LLMs is a [minefield](https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/) and even the biggest labs find it [challenging](https://www.anthropic.com/news/evaluating-ai-systems). LLMs return open-ended outputs, and the tasks we set them to are varied. Nonetheless, rigorous and thoughtful evals are critical—it’s no coincidence that technical leaders at OpenAI [work on evaluation and give feedback on individual evals](https://twitter.com/eugeneyan/status/1701692908074873036). 263 | 264 | Evaluating LLM applications invites a diversity of definitions and reductions: it’s simply unit testing, or it’s more like observability, or maybe it’s just data science. We have found all of these perspectives useful. In this section, we provide some lessons on what is important in building evals and monitoring pipelines. 265 | 266 | ### Create a few assertion-based unit tests from real input/output samples 267 | 268 | Create [unit tests (i.e., assertions)](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) consisting of samples of inputs and outputs from production, with expectations for outputs based on at least three criteria. While three criteria might seem arbitrary, it's a practical number to start with; fewer might indicate that your task isn't sufficiently defined or is too open-ended, like a general-purpose chatbot. These unit tests, or assertions, should be triggered by any changes to the pipeline, whether it's editing a prompt, adding new context via RAG, or other modifications. This [write-up has an example](https://hamel.dev/blog/posts/evals/#step-1-write-scoped-tests) of an assertion-based test for an actual use case. 269 | 270 | Consider beginning with assertions that specify phrases that let us include or exclude responses. Also try checks to ensure that word, item, or sentence counts lie within a range. For other kinds of generations, assertions can look different. [Execution-based evaluation](https://www.semanticscholar.org/paper/Execution-Based-Evaluation-for-Open-Domain-Code-Wang-Zhou/1bed34f2c23b97fd18de359cf62cd92b3ba612c3) is one way to evaluate code generation, wherein you run the generated code and check if the state of runtime is sufficient for the user request. 271 | 272 | As an example, if the user asks for a new function named foo; then after executing the agent’s generated code, foo should be callable! One challenge in execution-based evaluation is that the agent code frequently leaves the runtime in a slightly different form than the target code. It can be effective to “relax” assertions to the absolute most weak assumptions that any viable answer would satisfy. 273 | 274 | Finally, using your product as intended for customers (i.e., “dogfooding”) can provide insight into failure modes on real-world data. This approach not only helps identify potential weaknesses, but also provides a useful source of production samples that can be converted into evals. 275 | 276 | ### LLM-as-Judge can work (somewhat), but it's not a silver bullet 277 | 278 | LLM-as-Judge, where we use a strong LLM to evaluate the output of other LLMs, has been met with skepticism. (Some of us were initially huge skeptics.) Nonetheless, when implemented well, LLM-as-Judge achieves decent correlation with human judgments, and can at least help build priors about how a new prompt or technique may perform. Specifically, when doing pairwise comparisons (control vs. treatment), LLM-as-Judge typically gets the direction right though the magnitude of the win/loss may be noisy. 279 | 280 | Here are some suggestions to get the most out of LLM-as-Judge: 281 | 282 | - Use pairwise comparisons: Instead of asking the LLM to score a single output on a [Likert](https://en.wikipedia.org/wiki/Likert_scale) scale, present it with two options and ask it to select the better one. This tends to lead to more stable results. 283 | - Control for position bias: The order of options presented can bias the LLM's decision. To mitigate this, do each pairwise comparison twice, swapping the order of pairs each time. Just be sure to attribute wins to the right option after swapping! 284 | - Allow for ties: In some cases, both options may be equally good. Thus, allow the LLM to declare a tie so it doesn't have to arbitrarily pick a winner. 285 | - Use Chain-of-Thought: Asking the LLM to explain its decision before giving a final answer can increase eval reliability. As a bonus, this lets you to use a weaker but faster LLM and still achieve similar results. Because this part of the pipeline is typically run in batch, the extra latency from CoT isn’t a problem. 286 | - Control for response length: LLMs tend to bias toward longer responses. To mitigate this, ensure response pairs are similar in length. 287 | 288 | A useful application of LLM-as-Judge is checking a new prompting strategy against regression. If you have tracked a collection of production results, sometimes you can rerun those production examples with a new prompting strategy, and use LLM-as-Judge to quickly assess where the new strategy may suffer. 289 | 290 | Here’s an example of a [simple but effective approach](https://hamel.dev/blog/posts/evals/#automated-evaluation-w-llms) to iterate on LLM-as-Judge, where we log the LLM response, judge's critique (i.e., CoT), and final outcome. They are then reviewed with stakeholders to identify areas for improvement. Over three iterations, agreement with humans and LLM improved from 68% to 94%! 291 | 292 | ![](https://hamel.dev/blog/posts/evals/images/spreadsheet.png) 293 | 294 | LLM-as-Judge is not a silver bullet though. There are subtle aspects of language where even the strongest models fail to evaluate reliably. In addition, we've found that [conventional classifiers](https://eugeneyan.com/writing/finetuning/) and reward models can achieve higher accuracy than LLM-as-Judge, and with lower cost and latency. For code generation, LLM-as-Judge can be weaker than more direct evaluation strategies like execution-evaluation. 295 | 296 | > [Read more](https://eugeneyan.com/writing/llm-evaluators/) on techniques, alignment workflows, finetuning models, critiques, etc. for LLM-evaluators 297 | 298 | ### The “intern test” for evaluating generations 299 | 300 | We like to use the following “intern test” when evaluating generations: If you took the exact input to the language model, including the context, and gave it to an average college student in the relevant major as a task, could they succeed? How long would it take? 301 | 302 | - If the answer is no because the LLM lacks the required knowledge, consider ways to enrich the context. 303 | - If the answer is no and we simply can’t improve the context to fix it, then we may have hit a task that’s too hard for contemporary LLMs. 304 | - If the answer is yes, but it would take a while, we can try to reduce the complexity of the task. Is it decomposable? Are there aspects of the task that can be made more templatized? 305 | - If the answer is yes, they would get it quickly, then it’s time to dig into the data. What’s the model doing wrong? Can we find a pattern of failures? Try asking the model to explain itself before or after it responds, to help you build a theory of mind. 306 | 307 | ### Overemphasizing certain evals can hurt overall performance 308 | 309 | > "When a measure becomes a target, it ceases to be a good measure." — Goodhart's Law. 310 | 311 | An example of this is the Needle-in-a-Haystack (NIAH) eval. The original eval helped quantify model recall as context sizes grew, as well as how recall is affected by needle position. However, it’s been so overemphasized that it's featured as [Figure 1 for Gemini 1.5's report](https://arxiv.org/abs/2403.05530). The eval involves inserting a specific phrase ("The special magic {city} number is: {number}") into a long document that repeats the essays of Paul Graham, and then prompting the model to recall the magic number. 312 | 313 | While some models achieve near-perfect recall, it's questionable whether NIAH truly measures the reasoning and recall abilities needed in real-world applications. Consider a more practical scenario: Given the transcript of an hour-long meeting, can the LLM summarize the key decisions and next steps, as well as correctly attribute each item to the relevant person? This task is more realistic, going beyond rote memorization, and considers the ability to parse complex discussions, identify relevant information, and synthesize summaries. 314 | 315 | Here’s an example of a [practical NIAH eval](https://observablehq.com/@shreyashankar/needle-in-the-real-world-experiments). Using [doctor-patient transcripts](https://github.com/wyim/aci-bench/tree/main/data/challenge_data), the LLM is queried about the patient's medication. It also includes a more challenging NIAH, inserting a phrase for random ingredients for pizza toppings, such as "_The secret ingredients needed to build the perfect pizza are: Espresso-soaked dates, Lemon, and Goat cheese._". Recall was around 80% on the medication task and 30% on the pizza task. 316 | 317 | ![](images/niah.png){width=80%} 318 | 319 | Tangentially, an overemphasis on NIAH evals can reduce performance on extraction and summarization tasks. Because these LLMs are so finetuned to attend to every sentence, they may start to treat irrelevant details and distractors as important, thus including them in the final output (when they shouldn't!) 320 | 321 | This could also apply to other evals and use cases. For example, summarization. An emphasis on factual consistency could lead to summaries that are less specific (and thus less likely to be factually inconsistent) and possibly less relevant. Conversely, an emphasis on writing style and eloquence could lead to more flowery, marketing-type language that could introduce factual inconsistencies. 322 | 323 | ### Simplify annotation to binary tasks or pairwise comparisons 324 | 325 | Providing open-ended feedback or ratings for model output on a [Likert scale](https://en.wikipedia.org/wiki/Likert_scale) is cognitively demanding. As a result, the data collected is more noisy—due to variability among human raters—and thus less useful. A more effective approach is to simplify the task and reduce the cognitive burden on annotators. Two tasks that work well are binary classifications and pairwise comparisons. 326 | 327 | In binary classifications, annotators are asked to make a simple yes-or-no judgment on the model's output. They might be asked whether the generated summary is factually consistent with the source document, or whether the proposed response is relevant, or if it contains toxicity. Compared to the Likert scale, binary decisions are more precise, have higher consistency among raters, and lead to higher throughput. This was how [Doordash set up their labeling queues](https://doordash.engineering/2020/08/28/overcome-the-cold-start-problem-in-menu-item-tagging/) for tagging menu items through a tree of yes-no questions. 328 | 329 | In pairwise comparisons, the annotator is presented with a pair of model responses and asked which is better. Because it's easier for humans to say "A is better than B" than to assign an individual score to either A or B individually, this leads to faster and more reliable annotations (over Likert scales). At a [Llama2 meetup](https://www.youtube.com/watch?v=CzR3OrOkM9w), Thomas Scialom, an author on the Llama2 paper, confirmed that pairwise-comparisons were faster and cheaper than collecting supervised finetuning data such as written responses. The former’s cost is $3.5 per unit while the latter’s cost is $25 per unit. 330 | 331 | If you’re writing labeling guidelines, here are some [example guidelines](https://eugeneyan.com/writing/labeling-guidelines/) from Google and Bing Search. 332 | 333 | ### (Reference-free) evals and guardrails can be used interchangeably 334 | 335 | Guardrails help to catch inappropriate or harmful content while evals help to measure the quality and accuracy of the model's output. And if your evals are reference-free, they can be used as guardrails too. Reference-free evals are evaluations that don't rely on a "golden" reference, such as a human-written answer, and can assess the quality of output based solely on the input prompt and the model's response. 336 | 337 | Some examples of these are [summarization evals](https://eugeneyan.com/writing/evals/#summarization-consistency-relevance-length), where we only have to consider the input document to evaluate the summary on factual consistency and relevance. If the summary scores poorly on these metrics, we can choose not to display it to the user, effectively using the eval as a guardrail. Similarly, reference-free [translation evals](https://eugeneyan.com/writing/evals/#translation-statistical--learned-evals-for-quality) can assess the quality of a translation without needing a human-translated reference, again allowing us to use it as a guardrail. 338 | 339 | ### LLMs will return output even when they shouldn't 340 | 341 | A key challenge when working with LLMs is that they'll often generate output even when they shouldn't. This can lead to harmless but nonsensical responses, or more egregious defects like toxicity or dangerous content. For example, when asked to extract specific attributes or metadata from a document, an LLM may confidently return values even when those values don't actually exist. Alternatively, the model may respond in a language other than English because we provided non-English documents in the context. 342 | 343 | While we can try to prompt the LLM to return a "not applicable" or "unknown" response, it's not foolproof. Even when the log probabilities are available, they're a poor indicator of output quality. While log probs indicate the likelihood of a token appearing in the output, they don’t necessarily reflect the correctness of the generated text. On the contrary, for instruction-tuned models that are trained to answer queries and generate coherent responses, log probabilities may not be well-calibrated. Thus, while a high log probability may indicate that the output is fluent and coherent, it doesn’t mean it’s accurate or relevant. 344 | 345 | While careful prompt engineering can help to an extent, we should complement it with robust guardrails that detect and filter/regenerate undesired output. For example, OpenAI provides a [content moderation API](https://platform.openai.com/docs/guides/moderation) that can identify unsafe responses such as hate speech, self-harm, or sexual output. Similarly, there are numerous packages for [detecting personally identifiable information](https://github.com/topics/pii-detection). One benefit is that guardrails are largely agnostic of the use case and can thus be applied broadly to all output in a given language. In addition, with precise retrieval, our system can deterministically respond “I don’t know” if there are no relevant documents. 346 | 347 | A corollary here is that LLMs may fail to produce outputs when they are expected to. This can happen for various reasons, from straightforward issues like long-tail latencies from API providers to more complex ones such as outputs being blocked by content moderation filters. As such, it’s important to consistently log inputs and (potentially a lack of) outputs for debugging and monitoring. 348 | 349 | ### Hallucinations are a stubborn problem 350 | 351 | Unlike content safety or PII defects which have a lot of attention and thus seldom occur, factual inconsistencies are stubbornly persistent and more challenging to detect. They're more common and occur at a baseline rate of 5 - 10%, and from what we've learned from LLM providers, it can be challenging to get it below 2%, even on simple tasks such as summarization. 352 | 353 | To address this, we can combine prompt engineering (upstream of generation) and factual inconsistency guardrails (downstream of generation). For prompt engineering, techniques like CoT help reduce hallucination by getting the LLM to explain its reasoning before finally returning the output. Then, we can apply a [factual inconsistency guardrail](https://eugeneyan.com/writing/finetuning/) to assess the factuality of summaries and filter or regenerate hallucinations. In some cases, hallucinations can be deterministically detected. When using resources from RAG retrieval, if the output is structured and identifies what the resources are, you should be able to manually verify they’re sourced from the input context. 354 | 355 | # Operational: Day-to-day and Org concerns 356 | 357 | ## Data 358 | 359 | Just as the quality of ingredients determines the taste of a dish, the quality of input data constrains the performance of machine learning systems. In addition, output data is the only way to tell whether the product is working or not. All the authors focus on the data, looking at inputs and outputs for several hours a week to better understand the data distribution: its modes, its edge cases, and the limitations of models of it. 360 | 361 | ### Check for development-prod skew 362 | 363 | A common source of errors in traditional machine learning pipelines is _train-serve skew_. This happens when the data used in training differs from what the model encounters in production. Although we can use LLMs without training or finetuning, hence there’s no training set, a similar issue arises with development-prod data skew. Essentially, the data we test our systems on during development should mirror what the systems will face in production. If not, we might find our production accuracy suffering. 364 | 365 | LLM development-prod skew can be categorized into two types: structural and content-based. Structural skew includes issues like formatting discrepancies, such as differences between a JSON dictionary with a list-type value and a JSON list, inconsistent casing, and errors like typos or sentence fragments. These errors can lead to unpredictable model performance because different LLMs are trained on specific data formats, and prompts can be highly sensitive to minor changes. Content-based or "semantic" skew refers to differences in the meaning or context of the data.  366 | 367 | As in traditional ML, it's useful to periodically measure skew between the LLM input/output pairs. Simple metrics like the length of inputs and outputs or specific formatting requirements (e.g., JSON or XML) are straightforward ways to track changes. For more “advanced” drift detection, consider clustering embeddings of input/output pairs to detect semantic drift, such as shifts in the topics users are discussing, which could indicate they are exploring areas the model hasn't been exposed to before.  368 | 369 | When testing changes, such as prompt engineering, ensure that hold-out datasets are current and reflect the most recent types of user interactions. For example, if typos are common in production inputs, they should also be present in the hold-out data. Beyond just numerical skew measurements, it's beneficial to perform qualitative assessments on outputs. Regularly reviewing your model's outputs—a practice colloquially known as "vibe checks"—ensures that the results align with expectations and remain relevant to user needs. Finally, incorporating nondeterminism into skew checks is also useful—by running the pipeline multiple times for each input in our testing dataset and analyzing all outputs, we increase the likelihood of catching anomalies that might occur only occasionally. 370 | 371 | ### Look at samples of LLM inputs and outputs every day 372 | 373 | LLMs are dynamic and constantly evolving. Despite their impressive zero-shot capabilities and often delightful outputs, their failure modes can be highly unpredictable. For custom tasks, regularly reviewing data samples is essential to developing an intuitive understanding of how LLMs perform. 374 | 375 | Input-output pairs from production are the “real things, real places” (_genchi genbutsu_) of LLM applications, and they cannot be substituted. [Recent research](https://arxiv.org/abs/2404.12272) highlighted that developers' perceptions of what constitutes "good" and "bad" outputs shift as they interact with more data (i.e., _criteria drift_). While developers can come up with some criteria upfront for evaluating LLM outputs, these predefined criteria are often incomplete. For instance, during the course of development, we might update the prompt to increase the probability of good responses and decrease the probability of bad ones. This iterative process of evaluation, reevaluation, and criteria update is necessary, as it's difficult to predict either LLM behavior or human preference without directly observing the outputs. 376 | 377 | To manage this effectively, we should log LLM inputs and outputs. By examining a sample of these logs daily, we can quickly identify and adapt to new patterns or failure modes. When we spot a new issue, we can immediately write an assertion or eval around it. Similarly, any updates to failure mode definitions should be reflected in the evaluation criteria. These "vibe checks" are signals of bad outputs; code and assertions operationalize them. Finally, this attitude must be socialized, for example by adding review or annotation of inputs and outputs to your on-call rotation. 378 | 379 | ## Working with models 380 | 381 | With LLM APIs, we can rely on intelligence from a handful of providers. While this is a boon, these dependencies also involve trade-offs on performance, latency, throughput, and cost. Also, as newer, better models drop (almost every month in the past year), we should be prepared to update our products as we deprecate old models and migrate to newer models. In this section, we share our lessons from working with technologies we don’t have full control over, where the models can’t be self-hosted and managed. 382 | 383 | ### Generate structured output to ease downstream integration 384 | 385 | For most real-world use cases, the output of an LLM will be consumed by a downstream application via some machine-readable format. For example, [Rechat](https://www.youtube.com/watch?v=B_DMMlDuJB0), a real-estate CRM, required structured responses for the front end to render widgets. Similarly, [Boba](https://martinfowler.com/articles/building-boba.html), a tool for generating product strategy ideas, needed structured output with fields for title, summary, plausibility score, and time horizon. Finally, LinkedIn shared about [constraining the LLM to generate YAML](https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product), which is then used to decide which skill to use, as well as provide the parameters to invoke the skill. 386 | 387 | This application pattern is an extreme version of Postel’s Law: be liberal in what you accept (arbitrary natural language) and conservative in what you send (typed, machine-readable objects). As such, we expect it to be extremely durable. 388 | 389 | Currently, [Instructor](https://github.com/jxnl/instructor) and [Outlines](https://github.com/outlines-dev/outlines) are the de facto standards for coaxing structured output from LLMs. If you're using an LLM API (e.g., Anthropic, OpenAI), use Instructor; if you're working with a self-hosted model (e.g., Huggingface), use Outlines. 390 | 391 | ### Migrating prompts across models is a pain in the ass 392 | 393 | Sometimes, our carefully crafted prompts work superbly with one model but fall flat with another. This can happen when we're switching between various model providers, as well as when we upgrade across versions of the same model.  394 | 395 | For example, Voiceflow found that [migrating from gpt-3.5-turbo-0301 to gpt-3.5-turbo-1106 led to a 10% drop](https://www.voiceflow.com/blog/how-much-do-chatgpt-versions-affect-real-world-performance) in their intent classification task. (Thankfully, they had evals!) Similarly, [GoDaddy observed a trend in the positive direction](https://www.godaddy.com/resources/news/llm-from-the-trenches-10-lessons-learned-operationalizing-models-at-godaddy#h-3-prompts-aren-t-portable-across-models), where upgrading to version 1106 narrowed the performance gap between gpt-3.5-turbo and gpt-4. (Or, if you’re a glass-half-full person, you might be disappointed that gpt-4’s lead was reduced with the new upgrade) 396 | 397 | Thus, if we have to migrate prompts across models, expect it to take more time than simply swapping the API endpoint. Don't assume that plugging in the same prompt will lead to similar or better results. Also, having reliable, automated evals helps with measuring task performance before and after migration, and reduces the effort needed for manual verification. 398 | 399 | ### Version and pin your models 400 | 401 | In any machine learning pipeline, "[changing anything changes everything](https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html)". This is particularly relevant as we rely on components like large language models (LLMs) that we don't train ourselves and that can change without our knowledge. 402 | 403 | Fortunately, many model providers offer the option to “pin” specific model versions (e.g., gpt-4-turbo-1106). This enables us to use a specific version of the model weights, ensuring they remain unchanged. Pinning model versions in production can help avoid unexpected changes in model behavior, which could lead to customer complaints about issues that may crop up when a model is swapped, such as overly verbose outputs or other unforeseen failure modes. 404 | 405 | Additionally, consider maintaining a shadow pipeline that mirrors your production setup but uses the latest model versions. This enables safe experimentation and testing with new releases. Once you've validated the stability and quality of the outputs from these newer models, you can confidently update the model versions in your production environment. 406 | 407 | ### Choose the smallest model that gets the job done 408 | 409 | When working on a new application, it’s tempting to use the biggest, most powerful model available. But once we’ve established that the task is technically feasible, it’s worth experimenting if a smaller model can achieve comparable results. 410 | 411 | The benefits of a smaller model are lower latency and cost. While it may be weaker, techniques like chain-of-thought, n-shot prompts, and in-context learning can help smaller models punch above their weight. Beyond LLM APIs, finetuning our specific tasks can also help increase performance. 412 | 413 | Taken together, a carefully crafted workflow using a smaller model can often match, or even surpass, the output quality of a single large model, while being faster and cheaper. For example, this [tweet](https://twitter.com/mattshumer_/status/1770823530394833242) shares anecdata of how Haiku + 10-shot prompt outperforms zero-shot Opus and GPT-4. In the long term, we expect to see more examples of [flow-engineering](https://twitter.com/karpathy/status/1748043513156272416) with smaller models as the optimal balance of output quality, latency, and cost. 414 | 415 | As another example, take the humble classification task. Lightweight models like DistilBERT (67M parameters) are a surprisingly strong baseline. The 400M parameter DistilBART is another great option—when finetuned on open-source data, it could [identify hallucinations with an ROC-AUC of 0.84](https://eugeneyan.com/writing/finetuning/), surpassing most LLMs at less than 5% of the latency and cost. 416 | 417 | The point is, don’t overlook smaller models. While it’s easy to throw a massive model at every problem, with some creativity and experimentation, we can often find a more efficient solution.  418 | 419 | ## Product 420 | 421 | While new technology offers new possibilities, the principles of building great products are timeless. Thus, even if we’re solving new problems for the first time, we don’t have to reinvent the wheel on product design. There’s a lot to gain from grounding our LLM application development in solid product fundamentals, allowing us to deliver real value to the people we serve. 422 | 423 | ### Involve design early and often 424 | 425 | Having a designer will push you to understand and think deeply about how your product can be built and presented to users. We sometimes stereotype designers as folks who take things and make them pretty. But beyond just the user interface, they also rethink how the user experience can be improved, even if it means breaking existing rules and paradigms. 426 | 427 | Designers are especially gifted at reframing the user's needs into various forms. Some of these forms are more tractable to solve than others, and thus, they may offer more or fewer opportunities for AI solutions. Like many other products, building AI products should be centered around the job to be done, not the technology that powers them. 428 | 429 | Focus on asking yourself: “What job is the user asking this product to do for them? Is that job something a chatbot would be good at? How about autocomplete? Maybe something different!” Consider the existing [design patterns](https://www.tidepool.so/blog/emerging-ux-patterns-for-generative-ai-apps-copilots) and how they relate to the job-to-be-done. These are the invaluable assets that designers add to your team’s capabilities. 430 | 431 | ### Design your UX for Human-In-The-Loop 432 | 433 | One way to get quality annotations is to integrate Human-in-the-Loop (HITL) into the user experience (UX). By allowing users to provide feedback and corrections easily, we can improve the immediate output and collect valuable data to improve our models. 434 | 435 | Imagine an e-commerce platform where users upload and categorize their products. There are several ways we could design the UX: 436 | 437 | - The user manually selects the right product category; an LLM periodically checks new products and corrects miscategorization on the backend. 438 | - The user doesn't select any category at all; an LLM periodically categorizes products on the backend (with potential errors). 439 | - An LLM suggests a product category in real-time, which the user can validate and update as needed. 440 | 441 | While all three approaches involve an LLM, they provide very different UXes. The first approach puts the initial burden on the user and has the LLM acting as a post-processing check. The second requires zero effort from the user but provides no transparency or control. The third strikes the right balance. By having the LLM suggest categories upfront, we reduce cognitive load on the user and they don't have to learn our taxonomy to categorize their product! At the same time, by allowing the user to review and edit the suggestion, they have the final say in how their product is classified, putting control firmly in their hands. As a bonus, the third approach creates a [natural feedback loop for model improvement](https://eugeneyan.com/writing/llm-patterns/#collect-user-feedback-to-build-our-data-flywheel). Suggestions that are good are accepted (positive labels) and those that are bad are updated (negative followed by positive labels). 442 | 443 | This pattern of suggestion, user validation, and data collection is commonly seen in several applications: 444 | 445 | - Coding assistants: Where users can accept a suggestion (strong positive), accept and tweak a suggestion (positive), or ignore a suggestion (negative) 446 | - Midjourney: Where users can choose to upscale and download the image (strong positive), vary an image (positive), or generate a new set of images (negative) 447 | - Chatbots: Where users can provide thumbs up (positive) or thumbs down (negative) on responses, or choose to regenerate a response if it was really bad (strong negative). 448 | 449 | Feedback can be explicit or implicit. Explicit feedback is information users provide in response to a request by our product; implicit feedback is information we learn from user interactions without needing users to deliberately provide feedback. Coding assistants and Midjourney are examples of implicit feedback while thumbs up and thumb downs are explicit feedback. If we design our UX well, like coding assistants and Midjourney, we can collect plenty of implicit feedback to improve our product and models. 450 | 451 | ### Prioritize your hierarchy of needs ruthlessly 452 | 453 | As we think about putting our demo into production, we'll have to think about the requirements for: 454 | 455 | - Reliability: 99.9% uptime, adherence to structured output 456 | - Harmlessness: Not generate offensive, NSFW, or otherwise harmful content 457 | - Factual consistency: Being faithful to the context provided, not making things up 458 | - Usefulness: Relevant to the users' needs and request 459 | - Scalability: Latency SLAs, supported throughput 460 | - Cost: Because we don't have unlimited budget 461 | - And more: Security, privacy, fairness, GDPR, DMA, etc, etc. 462 | 463 | If we try to tackle all these requirements at once, we're never going to ship anything. Thus, we need to prioritize. Ruthlessly. This means being clear what is non-negotiable (e.g., reliability, harmlessness) without which our product can't function or won't be viable. It's all about identifying the minimum lovable product. We have to accept that the first version won't be perfect, and just launch and iterate. 464 | 465 | ### Calibrate your risk tolerance based on the use case 466 | 467 | When deciding on the language model and level of scrutiny of an application, consider the use case and audience. For a customer-facing chatbot offering medical or financial advice, we’ll need a very high bar for safety and accuracy. Mistakes or bad output could cause real harm and erode trust. But for less critical applications, such as a recommender system, or internal-facing applications like content classification or summarization, excessively strict requirements only slow progress without adding much value. 468 | 469 | This aligns with a recent [a16z report](https://a16z.com/generative-ai-enterprise-2024/) showing that many companies are moving faster with internal LLM applications compared to external ones (image below). By experimenting with AI for internal productivity, organizations can start capturing value while learning how to manage risk in a more controlled environment. Then, as they gain confidence, they can expand to customer-facing use cases. 470 | 471 | ![](https://a16z.com/wp-content/uploads/2024/03/How-willing-are-enterprises-to-use-LLMs-for-different-use-cases_-1-scaled.jpg) 472 | Proportion of enterprise LLM use across internal and external-facing use cases ([source: a16z report](https://a16z.com/generative-ai-enterprise-2024/)) 473 | 474 | ## Team & Roles 475 | 476 | No job function is easy to define, but writing a job description for the work in this new space is more challenging than others. We’ll forgo Venn diagrams of intersecting job titles, or suggestions for job descriptions. We will, however, submit to the existence of a new role—the AI engineer—and discuss its place. Importantly, we’ll discuss the rest of the team and how responsibilities should be assigned. 477 | 478 | ### Focus on the process, not tools 479 | 480 | When faced with new paradigms, such as LLMs, software engineers tend to favor tools. As a result, we overlook the problem and process the tool was supposed to solve. In doing so, many engineers assume accidental complexity, which has negative consequences for the team's long-term productivity. 481 | 482 | For example, [this write-up](https://hamel.dev/blog/posts/prompt/) discusses how certain tools can automatically create prompts for large language models. It argues (rightfully IMHO) that engineers who use these tools without first understanding the problem-solving methodology or process end up taking on unnecessary technical debt. 483 | 484 | In addition to accidental complexity, tools are often underspecified. For example, there is a growing industry of LLM evaluation tools that offer “LLM Evaluation In A Box” with generic evaluators for toxicity, conciseness, tone, etc. We have seen many teams adopt these tools without thinking critically about the specific failure modes of their domains. Contrast this to EvalGen. It focuses on teaching users the process of creating domain-specific evals by deeply involving the user each step of the way, from specifying criteria, to labeling data, to checking evals. The software leads the user through a workflow that looks like this: 485 | 486 | ![](images/evalgen.png) 487 | Shankar, S., et al. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. Retrieved from [https://arxiv.org/abs/2404.12272](https://arxiv.org/abs/2404.12272) 488 | 489 | EvalGen guides the user through a best practice of crafting LLM evaluations, namely: 490 | 491 | - Defining domain-specific tests (bootstrapped automatically from the prompt). These are defined as either assertions with code or with LLM-as-a-Judge. 492 | - The importance of aligning the tests with human judgment, so that the user can check that the tests capture the specified criteria. 493 | - Iterating on your tests as the system (prompts, etc) changes.  494 | 495 | EvalGen provides developers with a mental model of the evaluation-building process without anchoring them to a specific tool. We have found that after providing AI Engineers with this context, they often decide to select leaner tools or build their own.   496 | 497 | There are too many components of LLMs beyond prompt writing and evaluations to list exhaustively here.  However, it is important that AI Engineers seek to understand the processes before adopting tools. 498 | 499 | ### Always be experimenting 500 | 501 | ML products are deeply intertwined with experimentation. Not only the A/B, Randomized Control Trials kind, but the frequent attempts at modifying the smallest possible components of your system, and doing offline evaluation. The reason why everyone is so hot for evals is not actually about trustworthiness and confidence—it’s about enabling experiments! The better your evals, the faster you can iterate on experiments, and thus the faster you can converge on the best version of your system.  502 | 503 | It’s common to try different approaches to solving the same problem because experimentation is so cheap now. The high cost of collecting data and training a model is minimized—prompt engineering costs little more than human time. Position your team so that everyone is taught [the basics of prompt engineering](https://eugeneyan.com/writing/prompting/). This encourages everyone to experiment and leads to diverse ideas from across the organization. 504 | 505 | Additionally, don’t only experiment to explore—also use them to exploit! Have a working version of a new task? Consider having someone else on the team approach it differently. Try doing it another way that’ll be faster. Investigate prompt techniques like Chain-of-Thought or Few-Shot to make it higher quality. Don’t let your tooling hold you back on experimentation; if it is, rebuild it, or buy something to make it better.  506 | 507 | Finally, during product/project planning, set aside time for building evals and running multiple experiments. Think of the product spec for engineering products, but add to it clear criteria for evals. And during roadmapping, don’t underestimate the time required for experimentation—expect to do multiple iterations of development and evals before getting the green light for production. 508 | 509 | ### Empower everyone to use new AI technology 510 | 511 | As generative AI increases in adoption, we want the entire team—not just the experts—to understand and feel empowered to use this new technology. There’s no better way to develop intuition for how LLMs work (e.g., latencies, failure modes, UX) than to, well, use them. LLMs are relatively accessible: You don’t need to know how to code to improve performance for a pipeline, and everyone can start contributing via prompt engineering and evals. 512 | 513 | A big part of this is education. It can start as simple as [the basics of prompt engineering](https://eugeneyan.com/writing/prompting/), where techniques like n-shot prompting and CoT help condition the model towards the desired output. Folks who have the knowledge can also educate about the more technical aspects, such as how LLMs are autoregressive when generating output. In other words, while input tokens are processed in parallel, output tokens are generated sequentially. As a result, latency is more a function of output length than input length—this is a key consideration when designing UXes and setting performance expectations. 514 | 515 | We can go further and provide opportunities for hands-on experimentation and exploration. A hackathon perhaps? While it may seem expensive to have a team spend a few days hacking on speculative projects, the outcomes may surprise you. We know of a team that, through a hackathon, accelerated and almost completed their three-year roadmap within a year. Another team had a hackathon that led to paradigm-shifting UXes that are now possible thanks to LLMs, which have been prioritized for the year and beyond. 516 | 517 | ### Don’t fall into the trap of “AI Engineering is all I need” 518 | 519 | As new job titles are coined, there is an initial tendency to overstate the capabilities associated with these roles. This often results in a painful correction as the actual scope of these jobs becomes clear. Newcomers to the field, as well as hiring managers, might make exaggerated claims or have inflated expectations. Notable examples over the last decade include: 520 | 521 | - Data Scientist: “[someone who is better at statistics than any software engineer and better at software engineering than any statistician](https://x.com/josh_wills/status/198093512149958656).”   522 | - Machine Learning Engineer (MLE): a software engineering-centric view of machine learning  523 | 524 | Initially, many assumed that data scientists alone were sufficient for data-driven projects. However, it became apparent that data scientists must collaborate with software and data engineers to develop and deploy data products effectively.  525 | 526 | This misunderstanding has shown up again with the new role of AI Engineer, with some teams believing that AI Engineers are all you need. In reality, building machine learning or AI products requires a [broad array of specialized roles](https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html). We’ve consulted with more than a dozen companies on AI products and have consistently observed that they fall into the trap of believing that "AI Engineering is all you need." As a result, products often struggle to scale beyond a demo as companies overlook crucial aspects involved in building a product. 527 | 528 | For example, evaluation and measurement are crucial for scaling a product beyond vibe checks. The skills for effective evaluation align with some of the strengths traditionally seen in machine learning engineers—a team composed solely of AI Engineers will likely lack these skills. Co-author Hamel Husain illustrates the importance of these skills in his recent work around detecting [data drift](https://github.com/hamelsmu/ft-drift) and [designing domain-specific evals](https://hamel.dev/blog/posts/evals/). 529 | 530 | Here is a rough progression of the types of roles you need, and when you’ll need them, throughout the journey of building an AI product: 531 | 532 | - First, focus on building a product. This might include an AI engineer, but it doesn’t have to. AI Engineers are valuable for prototyping and iterating quickly on the product (UX, plumbing, etc).  533 | - Next, create the right foundations by instrumenting your system and collecting data. Depending on the type and scale of data, you might need platform and/or data engineers. You must also have systems for querying and analyzing this data to debug issues. 534 | - Next, you will eventually want to optimize your AI system. This doesn’t necessarily involve training models. The basics include steps like designing metrics, building evaluation systems, running experiments, optimizing RAG retrieval, debugging stochastic systems, and more. MLEs are really good at this (though AI engineers can pick them up too). It usually doesn’t make sense to hire an MLE unless you have completed the prerequisite steps. 535 | 536 | Aside from this, you need a domain expert at all times. At small companies, this would ideally be the founding team—and at bigger companies, product managers can play this role. Being aware of the progression and timing of roles is critical. Hiring folks at the wrong time (e.g., [hiring an MLE too early](https://jxnl.co/writing/2024/04/08/hiring-mle-at-early-stage-companies/)) or building in the wrong order is a waste of time and money, and causes churn.  Furthermore, regularly checking in with an MLE (but not hiring them full-time) during phases 1-2 will help the company build the right foundations.  537 | 538 | More on how to interview and hire ML/AI Engineers [here](https://eugeneyan.com/writing/how-to-interview), where we discuss: (i) what to interview for, (ii) how to conduct phone screens, interview loops, and debriefs, and (iii) tips for interviewers and hiring managers. 539 | 540 | # Strategy: Building with LLMs without Getting Out-Maneuvered 541 | 542 | Successful products require thoughtful planning and prioritization, not endless prototyping or following the latest model releases or trends. In this final section, we look around the corners and think about the strategic considerations for building great AI products. We also examine key trade-offs teams will face, like when to build and when to buy, and suggest a “playbook” for early LLM application development strategy. 543 | 544 | ## No GPUs before PMF 545 | 546 | To be great, your product needs to be more than just a thin wrapper around somebody else’s API. But mistakes in the opposite direction can be even more costly. The past year has also seen a mint of venture capital, including an eye-watering six billion dollar Series A, spent on training and customizing models without a clear product vision or target market. In this section, we’ll explain why jumping immediately to training your own models is a mistake and consider the role of self-hosting. 547 | 548 | ### Training from scratch (almost) never makes sense 549 | 550 | For most organizations, pretraining an LLM from scratch is an impractical distraction from building products. 551 | 552 | As exciting as it is and as much as it seems like everyone else is doing it, developing and maintaining machine learning infrastructure takes a lot of resources. This includes gathering data, training and evaluating models, and deploying them. If you’re still validating product-market fit, these efforts will divert resources from developing your core product. Even if you had the compute, data, and technical chops, the pretrained LLM may become obsolete in months. 553 | 554 | Consider [BloombergGPT](https://arxiv.org/abs/2303.17564), an LLM specifically trained for financial tasks. The model was pretrained on 363B tokens via a heroic effort by [nine full-time employees](https://twimlai.com/podcast/twimlai/bloomberggpt-an-llm-for-finance/), four from AI Engineering and five from ML Product and Research. Despite this, it was [outclassed by gpt-3.5-turbo and gpt-4](https://arxiv.org/abs/2305.05862) on those very tasks within a year. 555 | 556 | This story and others like it suggest that for most practical applications, pretraining an LLM from scratch, even on domain-specific data, is not the best use of resources. Instead, teams are better off finetuning the strongest open-source models available for their specific needs. 557 | 558 | There are of course exceptions. One shining example is [Replit’s code model](https://blog.replit.com/replit-code-v1_5), trained specifically for code generation and understanding. With pretraining, Replit was able to outperform other models of larger sizes such as CodeLlama7b. But as other, increasingly capable models have been released, maintaining utility has required continued investment. 559 | 560 | ### Don’t finetune until you’ve proven it’s necessary 561 | 562 | For most organizations, finetuning is driven more by FOMO than by clear strategic thinking. 563 | 564 | Organizations invest in finetuning too early, trying to beat the “just another wrapper” allegations. In reality, finetuning is heavy machinery, to be deployed only after you’ve collected plenty of examples that convince you other approaches won’t suffice. 565 | 566 | A year ago, many teams were telling us they were excited to finetune. Few have found product-market fit and most regret their decision. If you’re going to finetune, you’d better be *really* confident that you’re set up to do it again and again as base models improve—see the [“The model isn’t the product"](#the-model-isnt-the-product-the-system-around-it-is) and [“Build LLMOps”](#build-llmops-but-build-it-for-the-right-reason-faster-iteration) below. 567 | 568 | When might finetuning actually be the right call? If the use case requires data not available in the mostly-open web-scale datasets used to train existing models—and if you’ve already built an MVP that demonstrates the existing models are insufficient. But be careful: if great training data isn’t readily available to the model builders, where are _you_ getting it? 569 | 570 | LLM-powered applications aren’t a science fair project. Investment in them should be commensurate with their contribution to your business’ strategic objectives and competitive differentiation. 571 | 572 | ### Start with inference APIs, but don’t be afraid of self-hosting 573 | 574 | With LLM APIs, it’s easier than ever for startups to adopt and integrate language modeling capabilities without training their own models from scratch. Providers like Anthropic, and OpenAI offer general APIs that can sprinkle intelligence into your product with just a few lines of code. By using these services, you can reduce the effort spent and instead focus on creating value for your customers—this allows you to validate ideas and iterate towards product-market fit faster. 575 | 576 | But, as with databases, managed services aren’t the right fit for every use case, especially as scale and requirements increase. Indeed, self-hosting may be the only way to use models without sending confidential / private data out of your network, as required in regulated industries like healthcare and finance, or by contractual obligations or confidentiality requirements. 577 | 578 | Furthermore, self-hosting circumvents limitations imposed by inference providers, like rate limits, model deprecations, and usage restrictions. In addition, self-hosting gives you complete control over the model, making it easier to construct a differentiated, high-quality system around it. Finally, self-hosting, especially of finetunes, can reduce cost at large scale. For example, [Buzzfeed shared how they finetuned open-source LLMs to reduce costs by 80%](https://tech.buzzfeed.com/lessons-learned-building-products-powered-by-generative-ai-7f6c23bff376#9da5). 579 | 580 | ## Iterate to something great 581 | 582 | To sustain a competitive edge in the long run, you need to think beyond models and consider what will set your product apart. While speed of execution matters, it shouldn’t be your only advantage. 583 | 584 | ### The model isn’t the product, the system around it is 585 | 586 | For teams that aren’t building models, the rapid pace of innovation is a boon as they migrate from one SOTA model to the next, chasing gains in context size, reasoning capability, and price-to-value to build better and better products. This progress is as exciting as it is predictable. Taken together, this means models are likely to be the least durable component in the system. 587 | 588 | Instead, focus your efforts on what's going to provide lasting value, such as: 589 | 590 | - Evals: To reliably measure performance on your task across models 591 | - Guardrails: To prevent undesired outputs no matter the model 592 | - Caching: To reduce latency and cost by avoiding the model altogether 593 | - Data flywheel: To power the iterative improvement of everything above 594 | 595 | These components create a thicker moat of product quality than raw model capabilities. 596 | 597 | But that doesn’t mean building at the application layer is risk-free. Don’t point your shears at the same yaks that OpenAI or other model providers will need to shave if they want to provide viable enterprise software. 598 | 599 | For example, some teams invested in building custom tooling to validate structured output from proprietary models; minimal investment here is important, but a deep one is not a good use of time. OpenAI needs to ensure that when you ask for a function call, you get a valid function call—because all of their customers want this. Employ some “strategic procrastination” here, build what you absolutely need, and await the obvious expansions to capabilities from providers. 600 | 601 | ### Build trust by starting small 602 | 603 | Building a product that tries to be everything to everyone is a recipe for mediocrity. To create compelling products, companies need to specialize in building sticky experiences that keep users coming back. 604 | 605 | Consider a generic RAG system that aims to answer any question a user might ask. The lack of specialization means that the system can’t prioritize recent information, parse domain-specific formats, or understand the nuances of specific tasks. As a result, users are left with a shallow, unreliable experience that doesn’t meet their needs, leading to churn. 606 | 607 | To address this, focus on specific domains and use cases. Narrow the scope by going deep rather than wide. This will create domain-specific tools that resonate with users. Specialization also allows you to be upfront about your system’s capabilities and limitations. Being transparent about what your system can and cannot do demonstrates self-awareness, helps users understand where it can add the most value, and thus builds trust and confidence in the output. 608 | 609 | ### Build LLMOps, but build it for the right reason: faster iteration 610 | 611 | DevOps is not fundamentally about reproducible workflows or shifting left or empowering two pizza teams—and it’s definitely not about writing YAML files. 612 | 613 | DevOps is about shortening the feedback cycles between work and its outcomes so that improvements accumulate instead of errors. Its roots go back, via the Lean Startup movement, to Lean Manufacturing and the Toyota Production System, with its emphasis on Single Minute Exchange of Die and Kaizen. 614 | 615 | MLOps has adapted the form of DevOps to ML. We have reproducible experiments and we have all-in-one suites that empower model builders to ship. And Lordy, do we have YAML files. 616 | 617 | But as an industry, MLOps didn’t adopt the function of DevOps. It didn’t shorten the feedback gap between models and their inferences and interactions in production. 618 | 619 | Hearteningly, the field of LLMOps has shifted away from thinking about hobgoblins of little minds like prompt management and towards the hard problems that block iteration: production monitoring and continual improvement, linked by evaluation. 620 | 621 | Already, we have interactive arenas for neutral, crowd-sourced evaluation of chat and coding models – an outer loop of collective, iterative improvement. Tools like LangSmith, Log10, LangFuse, W&B Weave, HoneyHive, and more promise to not only collect and collate data about system outcomes in production, but also to leverage them to improve those systems by integrating deeply with development. Embrace these tools or build your own. 622 | 623 | ### Don’t Build LLM Features You Can Buy 624 | 625 | Most successful businesses are not LLM businesses. Simultaneously, most businesses have opportunities to be improved by LLMs. 626 | 627 | This pair of observations often mislead leaders into hastily retrofitting systems with LLMs at increased cost and decreased quality and releasing them as ersatz, vanity “AI” features, complete with the [now-dreaded sparkle icon](https://x.com/nearcyan/status/1783351706031718412 628 | ). There’s a better way: focus on LLM applications that truly align with your product goals and enhance your core operations. 629 | 630 | Consider a few misguided ventures that waste your team’s time: 631 | 632 | - Building custom text-to-SQL capabilities for your business. 633 | - Building a chatbot to talk to your documentation. 634 | - Integrating your company’s knowledge base with your customer support chatbot. 635 | 636 | While the above are the hellos-world of LLM applications, none of them make sense for a product company to build themselves. These are general problems for many businesses with a large gap between promising demo and dependable component—the customary domain of software companies. Investing valuable R&D resources on general problems being tackled en masse by the current Y Combinator batch is a waste. 637 | 638 | If this sounds like trite business advice, it’s because in the frothy excitement of the current hype wave, it’s easy to mistake anything “LLM” as cutting-edge, accretive differentiation, missing which applications are already old hat. 639 | 640 | ### AI in the loop; Humans at the center 641 | 642 | Right now, LLM-powered applications are brittle. They required an incredible amount of safe-guarding and defensive engineering, yet remain hard to predict. Additionally, when tightly scoped these applications can be wildly useful. This means that LLMs make excellent tools to accelerate user workflows. 643 | 644 | While it may be tempting to imagine LLM-based applications fully replacing a workflow, or standing in for a job function, today the most effective paradigm is a human-computer centaur ([Centaur chess](https://en.wikipedia.org/wiki/Advanced_chess)). When capable humans are paired with LLM capabilities tuned for their rapid utilization, productivity and happiness doing tasks can be massively increased. One of the flagship applications of LLMs, GitHub CoPilot, demonstrated the power of these workflows: 645 | 646 | > “Overall, developers told us they felt more confident because coding is easier, more error-free, more readable, more reusable, more concise, more maintainable, and more resilient with GitHub Copilot and GitHub Copilot Chat than when they’re coding without it.” - [Mario Rodriguez, GitHub](https://resources.github.com/learn/pathways/copilot/essentials/measuring-the-impact-of-github-copilot/) 647 | 648 | For those who have worked in ML for a long time, you may jump to the idea of “human-in-the-loop”, but not so fast: HITL Machine Learning is a paradigm built on Human experts ensuring that ML models behave as predicted. While related, here we are proposing something more subtle. LLM-driven systems should not be the primary drivers of most workflows today, they should merely be a resource. 649 | 650 | By centering humans, and asking how an LLM can support their workflow, this leads to significantly different product and design decisions. Ultimately, it will drive you to build different products than competitors who try to rapidly offshore all responsibility to LLMs; better, more useful, and less risky products. 651 | 652 | ## Start with prompting, evals, and data collection 653 | 654 | The previous sections have delivered a firehose of techniques and advice. It’s a lot to take in. Let’s consider the minimum useful set of advice: if a team wants to build an LLM product, where should they begin? 655 | 656 | Over the past year, we’ve seen enough to be confident that successful LLM applications follow a consistent trajectory. We walk through this basic “getting started” playbook in this section. The core idea is to start simple and only add complexity as needed. A decent rule of thumb is that each level of sophistication typically requires at least an order of magnitude more effort than the one before it. With this in mind… 657 | 658 | ### Prompt engineering comes first 659 | 660 | Start with prompt engineering. Use all the techniques we discussed in the tactics section before. Chain-of-thought, n-shot examples, and structured input and output are almost always a good idea. Prototype with the most highly capable models before trying to squeeze performance out of weaker models. 661 | 662 | Only if prompt engineering cannot achieve the desired level of performance should you consider finetuning. This will come up more often if there are non-functional requirements (e.g., data privacy, complete control, cost) that block the use of proprietary models and thus require you to self-host. Just make sure those same privacy requirements don’t block you from using user data for finetuning! 663 | 664 | ### Build evals and kickstart a data flywheel 665 | 666 | Even teams that are just getting started need evals. Otherwise, you won’t know whether your prompt engineering is sufficient or when your finetuned model is ready to replace the base model. 667 | 668 | Effective evals are [specific to your tasks](https://twitter.com/thesephist/status/1707839140018974776) and mirror the intended use cases. The first level of evals that we [recommend](https://hamel.dev/blog/posts/evals/) is unit testing. These simple assertions detect known or hypothesized failure modes and help drive early design decisions. Also see other [task-specific evals](https://eugeneyan.com/writing/evals/) for classification, summarization, etc. 669 | 670 | While unit tests and model-based evaluations are useful, they don’t replace the need for human evaluation. Have people use your model/product and provide feedback. This serves the dual purpose of measuring real-world performance and defect rates while also collecting high-quality annotated data that can be used to finetune future models. This creates a positive feedback loop, or data flywheel, which compounds over time: 671 | 672 | - Human evaluation to assess model performance and/or find defects 673 | - Use the annotated data to finetune the model or update the prompt 674 | - Repeat 675 | 676 | For example, when auditing LLM-generated summaries for defects we might label each sentence with fine-grained feedback identifying factual inconsistency, irrelevance, or poor style. We can then use these factual inconsistency annotations to [train a hallucination classifier](https://eugeneyan.com/writing/finetuning/) or use the relevance annotations to train a [relevance-reward model](https://arxiv.org/abs/2009.01325). As another example, LinkedIn shared about their success with using [model-based evaluators](https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product) to estimate hallucinations, responsible AI violations, coherence, etc. in their write-up 677 | 678 | By creating assets that compound their value over time, we upgrade building evals from a purely operational expense to a strategic investment, and build our data flywheel in the process. 679 | 680 | ## The high-level trend of low-cost cognition 681 | 682 | In 1971, the researchers at Xerox PARC predicted the future: the world of networked personal computers that we are now living in. They helped birth that future by playing pivotal roles in the invention of the technologies that made it possible, from Ethernet and graphics rendering to the mouse and the window. 683 | 684 | But they also engaged in a simple exercise: they looked at applications that were very useful (e.g. video displays) but were not yet economical (i.e. enough RAM to drive a video display was many thousands of dollars). Then they looked at historic price trends for that technology (a la Moore’s Law) and predicted when those technologies would become economical. 685 | 686 | We can do the same for LLM technologies, even though we don’t have something quite as clean as transistors per dollar to work with. Take a popular, long-standing benchmark, like the Massively-Multitask Language Understanding dataset, and a consistent input approach (five-shot prompting). Then, compare the cost to run language models with various performance levels on this benchmark over time. 687 | 688 | ![](images/models-prices.png) 689 | Figure. For a fixed cost, capabilities are rapidly increasing. For a fixed capability level, costs are rapidly decreasing. Created by co-author Charles Frye using public data on May 13, 2024. 690 | 691 | In the four years since the launch of OpenAI’s davinci model as an API, the cost of running a model with equivalent performance on that task at the scale of one million tokens (about one hundred copies of this document) has dropped from $20 to less than 10¢ – a halving time of just six months. Similarly, the cost to run Meta’s LLaMA 3 8B, via an API provider or on your own, is just 20¢ per million tokens as of May of 2024, and it has similar performance to OpenAI’s text-davinci-003, the model that enabled ChatGPT. That model also cost about $20 per million tokens when it was released in late November of 2023. That’s two orders of magnitude in just 18 months – the same timeframe in which Moore’s Law predicts a mere doubling. 692 | 693 | Now, let’s consider an application of LLMs that is very useful (powering generative video game characters, a la [Park et al](https://arxiv.org/abs/2304.03442)) but is not yet economical (their cost was estimated at $625 per hour [here](https://arxiv.org/abs/2310.02172)). Since that paper was published in August of 2023, the cost has dropped roughly one order of magnitude, to $62.50 per hour. We might expect it to drop to $6.25 per hour in another nine months. 694 | 695 | Meanwhile, when Pac-Man was released in 1980, $1 of today’s money would buy you a credit, good to play for a few minutes or tens of minutes – call it six games per hour, or $6 per hour. This napkin math suggests that a compelling LLM-enhanced gaming experience will become economical sometime in 2025. 696 | 697 | These trends are new, only a few years old. But there is little reason to expect this process to slow down in the next few years. Even as we perhaps use up low-hanging fruit in algorithms and datasets, like scaling past the “Chinchilla ratio” of ~20 tokens per parameter, deeper innovations and investments inside the data center and at the silicon layer promise to pick up the slack. 698 | 699 | And this is perhaps the most important strategic fact: what is a completely infeasible floor demo or research paper today will become a premium feature in a few years and then a commodity shortly after. We should build our systems, and our organizations, with this in mind. 700 | 701 | # Enough 0 to 1 demos, it’s time for 1 to N products 702 | 703 | We get it, building LLM demos is a ton of fun. With just a few lines of code, a vector database, and a carefully crafted prompt, we create ✨magic ✨. And in the past year, this magic has been compared to the internet, the smartphone, and even the printing press. 704 | 705 | Unfortunately, as anyone who has worked on shipping real-world software knows, there's a world of difference between a demo that works in a controlled setting and a product that operates reliably at scale. 706 | 707 | > There's a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It's easy to demo a car self-driving around a block; making it into a product takes a decade. - [Andrej Karpathy](https://x.com/eugeneyan/status/1672692174704766976) 708 | 709 | Take, for example, self-driving cars. The first car was driven by a neural network in [1988](https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf). Twenty-five years later, Andrej Karpathy [took his first demo ride in a Waymo](https://x.com/karpathy/status/1689819017610227712). A decade after that, the company received its [driverless permit](https://x.com/Waymo/status/1689809230293819392). That's thirty-five years of rigorous engineering, testing, refinement, and regulatory navigation to go from prototype to commercial product. 710 | 711 | Across industry and academia, we've observed the ups and downs for the past year: Year 1 of N for LLM applications. We hope that the lessons we've learned—from [tactics](#tactical-nuts-bolts-of-working-with-llms) like evals, prompt engineering, and guardrails, to [operational](#operational-day-to-day-and-org-concerns) techniques and building teams to [strategic](#strategy-building-with-llms-without-getting-out-maneuvered) perspectives like which capabilities to build internally—help you in year 2 and beyond, as we all build on this exciting new technology together. 712 | 713 | --- 714 | 715 | # Stay In Touch 716 | 717 | If you found this useful and want updates on write-ups, courses, and activities, subscribe below. 718 | 719 | 720 | 721 | You can also find our individual contact information on our [about page](about.qmd). 722 | 723 | # Acknowledgements 724 | 725 | This series started as a convo in a group chat, where Bryan quipped that he was inspired to write “A Year of AI Engineering”. Then, ✨magic✨ happened, and we were all pitched in to share what we’ve learned so far. 726 | 727 | The authors would like to thank Eugene for leading the bulk of the document integration and overall structure in addition to a large proportion of the lessons. Additionally, for primary editing responsibilities and document direction. The authors would like to thank Bryan for the spark that led to this writeup, restructuring the write-up into tactical, operational, and strategic sections and their intros, and for pushing us to think bigger on how we could reach and help the community. The authors would like to thank Charles for his deep dives on cost and LLMOps, as well as weaving the lessons to make them more coherent and tighter—you have him to thank for this being 30 instead of 40 pages! The authors thank Hamel and Jason for their insights from advising clients and being on the front lines, for their broad generalizable learnings from clients, and for deep knowledge of tools. And finally, thank you Shreya for reminding us of the importance of evals and rigorous production practices and for bringing her research and original results. 728 | 729 | Finally, we would like to thank all the teams who so generously shared your challenges and lessons in your own write-ups which we’ve referenced throughout this series, along with the AI communities for your vibrant participation and engagement with this group. 730 | 731 | ## About the authors 732 | 733 | See the [about page](about.qmd) for more information on the authors. 734 | 735 | If you found this useful, please cite this write-up as: 736 | 737 | > Yan, Eugene, Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu, and Shreya Shankar. 2024. ‘Applied LLMs - What We’ve Learned From A Year of Building with LLMs’. Applied LLMs. 8 June 2024. https://applied-llms.org/. 738 | 739 | or 740 | 741 | ```default 742 | @article{AppliedLLMs2024, 743 | title = {What We've Learned From A Year of Building with LLMs}, 744 | author = {Yan, Eugene and Bischof, Bryan and Frye, Charles and Husain, Hamel and Liu, Jason and Shankar, Shreya}, 745 | journal = {Applied LLMs}, 746 | year = {2024}, 747 | month = {Jun}, 748 | url = {https://applied-llms.org/} 749 | } 750 | ``` 751 | 752 | ## Related Posts 753 | 754 | - This work was published on O'Reilly Media in three parts: [Tactical](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/){target="_blank"}, [Operational](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/){target="_blank"}, [Strategic](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-iii-strategy/){target="_blank"} ([podcast](https://lu.ma/e8huz3s6){target="_blank"}). 755 | 756 | - This article was translated to [Japanese](https://zenn.dev/seya/articles/12c67b5d80670a){target="_blank"} and Chinese ([Parts 1](https://iangyan.github.io/2024/09/08/building-with-llms-part-1/){target="_blank"}, [2](https://iangyan.github.io/2024/10/05/building-with-llms-part-2/){target="_blank"}, [3](https://iangyan.github.io/2024/10/06/building-with-llms-part-3/){target="_blank"}). -------------------------------------------------------------------------------- /jobs.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | pagetitle: "Applied LLMs Job Board" 3 | description: "Quality opportunities for execs/senior ICs, with a 10-year post-termination expiration period, vetted by the team." 4 | image: /images/jobs_seo.png 5 | --- 6 | 7 | ::: {.column-screen} 8 | 9 | ::: -------------------------------------------------------------------------------- /services.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Services 3 | description: "Paid consulting services we offer" 4 | metadata-files: 5 | - "_subscribe.yml" 6 | --- 7 | 8 | ## Problems We Solve 9 | 10 | You should consider hiring us if: 11 | 12 | - **You feel lost on what to prioritize and what experiments you should run.** 13 | - _Goal: get to an MVP quickly._ 14 | - **You need your team to get quickly upskilled on AI, and vet potential hires.** 15 | - _Goal: become self-sufficient in months._ 16 | - **You are hitting a plateau in making your AI better.** 17 | - _Goal: processes to consistently measure & improve your AI._ 18 | - **Your LLMs need to be cheaper or faster.** 19 | - _Goal: tools that help you make the best tradeoffs between performance, cost and latency._ 20 | - **You are overwhelmed by tools & frameworks.** 21 | - _Goal: curated tools and infrastructure specific to your use case._ 22 | 23 | ## Services 24 | 25 | We offer three types of services: 26 | 27 | ### 1. Strategic Consulting 28 | 29 | We will save your engineering team time and money by steering them away from common pitfalls, selecting the best tools, and walking them through an approach to building AI products. Furthermore, we will introduce you to talent, vendors and partners to accelerate your goals. 30 | 31 | ### 2. Comprehensive Consulting 32 | 33 | Everything in strategic consulting, plus we build and deploy components of AI products for you. We will work with your team to understand your needs, build a roadmap, and execute on it. 34 | 35 | ### 3. Workshops For Enterprise Teams 36 | 37 | We offer 3-day workshops to help your team get up to speed on the fundamentals of AI such as RAG, Evals, Fine-Tuning, and more. 38 | 39 | ## Pricing 40 | 41 | The cost of our services depends on the scope of the project. However, the following are typical prices for our services: 42 | 43 | - Strategic Consulting: $183,600 for a 3-month engagement. 44 | - 2-day Workshop: $125,800 45 | - Comprehensive Consulting: $975,000 total, for an engagement that will last a minimum of 3 months. 46 | 47 | All of our work has a money-back guarantee[^2]. To get started, please [fill out this quick form](https://llms.typeform.com/to/M8f89mlM). 48 | 49 | ## Why Us 50 | 51 | We have a [track record](about.qmd) of building and deploying AI products in a variety of industries. Our services are led by Hamel and Jason, with access to experts in the field including open-source maintainers and operators who are building with AI. 52 | 53 | To get in touch, please fill out [this form](https://llms.typeform.com/to/M8f89mlM). If you have questions, you can also reach out to us at `consulting@applied-llms.org` 54 | 55 | [^2]: The quality of our work is guaranteed. If you do not believe we have met mutually established objectives, we will continue to work toward those goals with you for no additional fee. If, after such an additional attempt, you still believe we have not met your objectives, we will refund your fees in total. -------------------------------------------------------------------------------- /styles.css: -------------------------------------------------------------------------------- 1 | /* css styles */ 2 | 3 | .margin-subscribe { 4 | border-style: solid; 5 | border-color: #8b8b91 !important; 6 | } 7 | 8 | div[data-style="clean"] { 9 | /* Your CSS styles here */ 10 | padding: 10px!important; 11 | } 12 | 13 | .img-fluid { 14 | margin-right: 5px; 15 | } -------------------------------------------------------------------------------- /talks/_evals/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Evals 3 | search: false 4 | description: Measuring the performance of LLM products 5 | metadata-files: ["../_listing_meta.yml"] 6 | listing: 7 | contents: 8 | - "**index.qmd" 9 | - "/*.qmd" 10 | --- -------------------------------------------------------------------------------- /talks/_fine_tuning/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Fine-Tuning 3 | search: false 4 | description: Fine-tuning LLMs for specific tasks 5 | metadata-files: ["../_listing_meta.yml"] 6 | listing: 7 | contents: 8 | - "**index.qmd" 9 | - "/*.qmd" 10 | --- -------------------------------------------------------------------------------- /talks/_inference/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: LLM Inference 3 | search: false 4 | description: Serving and deploying LLMs 5 | metadata-files: ["../_listing_meta.yml"] 6 | listing: 7 | contents: 8 | - "**index.qmd" 9 | - "/*.qmd" 10 | --- -------------------------------------------------------------------------------- /talks/_listing_meta.yml: -------------------------------------------------------------------------------- 1 | search: false 2 | image: "https://user-images.githubusercontent.com/1483922/208359222-2b7e938e-27c4-4556-aacb-f5a81ce77b2d.png" 3 | listing: 4 | fields: [title, Speaker, Venue, date, description] 5 | type: table 6 | sort-ui: [title, Speaker, Venue, date, description] 7 | filter-ui: [title, Speaker, Venue, date, description] 8 | max-description-length: 400 -------------------------------------------------------------------------------- /talks/_listing_root.yml: -------------------------------------------------------------------------------- 1 | search: false 2 | image: "https://user-images.githubusercontent.com/1483922/208359222-2b7e938e-27c4-4556-aacb-f5a81ce77b2d.png" 3 | listing: 4 | fields: [title, description] 5 | type: table 6 | sort-ui: false 7 | filter-ui: false 8 | max-description-length: 200 -------------------------------------------------------------------------------- /talks/_prompt_eng/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Prompt Engineering 3 | search: false 4 | description: Effective prompt engineering techniques for LLMs 5 | metadata-files: ["../_listing_meta.yml"] 6 | listing: 7 | contents: 8 | - "**index.qmd" 9 | - "/*.qmd" 10 | --- -------------------------------------------------------------------------------- /talks/_rag/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: RAG 3 | search: false 4 | description: Retrieval Augmented Generation 5 | metadata-files: ["../_listing_meta.yml"] 6 | listing: 7 | contents: 8 | - "**index.qmd" 9 | - "/*.qmd" 10 | --- -------------------------------------------------------------------------------- /talks/applications/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: Building Applications 3 | search: false 4 | description: Examples and best practices of building applications with LLMs 5 | metadata-files: ["../_listing_meta.yml"] 6 | listing: 7 | contents: 8 | - "**index.qmd" 9 | - "/*.qmd" 10 | --- -------------------------------------------------------------------------------- /talks/applications/simon_llm_cli/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: LLMs on the command line 3 | date: 2024-06-11 4 | Speaker: Simon Willison 5 | Venue: Mastering LLMs Conf 6 | metadata-files: 7 | - "../../../_subscribe.yml" 8 | abstract: | 9 | The Unix command-line philosophy has always been about joining different tools together to solve larger problems. LLMs are a fantastic addition to this environment: they can help wire together different tools and can directly participate themselves, processing text directly as part of more complex pipelines. Simon shows how to bring these worlds together, using LLMs - both remote and local - to unlock powerful tools and solve all kinds of interesting productivity and automation problems. 10 | 11 | Simon will show how to bring these worlds together, using LLMs - both remote and local - to unlock powerful tools and solve all kinds of interesting productivity and automation problems. He'll cover LLM, his plugin-based command-line toolkit for accessing over a hundred different LLMs - and show how it can be used to automate prompts and compare them across different models. 12 | 13 | LLM logs prompts and responses to SQLite, which can then be explored and analyzed further using Datasette. Other CLI tools covered will include ollama, llamafile, shot-scraper, files-to-prompt, ttok and LLM's support for calculating, storing and querying embeddings. 14 | categories: ["applications", "llm-conf-2024"] 15 | --- 16 | 17 | 18 | {{< video https://www.youtube.com/watch?v=QUXQNi6jQ30 >}} 19 | 20 | ## Chapters 21 | 22 | **[00:00](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=0s) Introduction** 23 | 24 | Simon Willison introduces LLM - a command line tool for interacting with large language models. 25 | 26 | **[01:40](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=100s) Installing and Using LLM** 27 | 28 | Simon demonstrates how to install LLM using pip or homebrew and run prompts against OpenAI's API. He showcases features like continuing conversations and changing default models. 29 | 30 | **[10:30](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=630s) LLM Plugins** 31 | 32 | The LLM tool has a plugin system that allows access to various remote APIs and local models. Simon installs the Claude plugin and discusses why he considers Claude models his current favorites. 33 | 34 | **[13:14](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=794s) Local Models with LLM** 35 | 36 | Simon explores running local language models using plugins for tools like GPT4All and llama.cpp. He demonstrates the llmchat command for efficient interaction with local models. 37 | 38 | **[26:16](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=1576s) Writing Bash Scripts with LLM** 39 | 40 | A practical example of creating a script to summarize Hacker News threads. 41 | 42 | **[35:01](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=2101s) Piping and Automating with LLM** 43 | 44 | By piping commands and outputs, Simon shows how to automate tasks like summarizing Hacker News threads or generating Bash commands using LLM and custom scripts. 45 | 46 | **[37:08](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=2228s) Web Scraping and LLM** 47 | 48 | Simon introduces ShotScraper, a tool for browser automation and web scraping. He demonstrates how to pipe scraped data into LLM for retrieval augmented generation (RAG). 49 | 50 | **[41:13](https://www.youtube.com/watch?v=QUXQNi6jQ30&t=2473s) Embeddings with LLM** 51 | 52 | 53 | LLM has built-in support for embeddings through various plugins. Simon calculates embeddings for his blog content and performs semantic searches, showcasing how to build RAG workflows using LLM. 54 | 55 | 56 | ## Notes 57 | 58 | :::{.callout-note} 59 | These notes were originally published by Simon Willison [here](https://github.com/simonw/language-models-on-the-command-line/blob/main/README.md) 60 | ::: 61 | 62 | Notes for a talk I gave at [Mastering LLMs: A Conference For Developers & Data Scientists](https://maven.com/parlance-labs/fine-tuning). 63 | 64 | ### Links 65 | 66 | - [Datasette](https://datasette.io/) 67 | - [My blog](https://simonwillison.net/) 68 | - [LLM](https://llm.datasette.io/en/stable/) 69 | 70 | ### Getting started 71 | 72 | ```bash 73 | brew install llm # or pipx or pip 74 | llm keys set openai 75 | # paste key here 76 | llm "Say hello in Spanish" 77 | ``` 78 | ### Installing Claude 3 79 | ```bash 80 | llm install llm-claude-3 81 | llm keys set claude 82 | # Paste key here 83 | llm -m haiku 'Say hello from Claude Haiku' 84 | ``` 85 | 86 | ### Local model with llm-gpt4all 87 | 88 | ```bash 89 | llm install llm-gpt4all 90 | llm models 91 | llm chat -m mistral-7b-instruct-v0 92 | ``` 93 | ### Browsing logs with Datasette 94 | 95 | https://datasette.io/ 96 | 97 | ```bash 98 | pipx install datasette # or brew or pip 99 | datasette "$(llm logs path)" 100 | # Browse at http://127.0.0.1:8001/ 101 | ``` 102 | #### Templates 103 | ```bash 104 | llm --system 'You are a sentient cheesecake' -m gpt-4o --save cheesecake 105 | ``` 106 | Now you can chat with a cheesecake: 107 | ```bash 108 | llm chat -t cheesecake 109 | ``` 110 | 111 | More plugins: https://llm.datasette.io/en/stable/plugins/directory.html 112 | 113 | #### llm-cmd 114 | 115 | Help with shell commands. Blog entry is here: https://simonwillison.net/2024/Mar/26/llm-cmd/ 116 | 117 | #### files-to-prompt and shot-scraper 118 | 119 | `files-to-prompt` is described here: 120 | https://simonwillison.net/2024/Apr/8/files-to-prompt/ 121 | 122 | `shot-scraper javascript` documentation: https://shot-scraper.datasette.io/en/stable/javascript.html 123 | 124 | JSON output for Google search results: 125 | 126 | ```bash 127 | shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' ' 128 | Array.from( 129 | document.querySelectorAll("h3"), 130 | el => ({href: el.parentNode.href, title: el.innerText}) 131 | )' 132 | ``` 133 | This version gets the HTML that includes the snippet summaries, then pipes it to LLM to answer a question: 134 | ```bash 135 | shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' ' 136 | () => { 137 | function findParentWithHveid(element) { 138 | while (element && !element.hasAttribute("data-hveid")) { 139 | element = element.parentElement; 140 | } 141 | return element; 142 | } 143 | return Array.from( 144 | document.querySelectorAll("h3"), 145 | el => findParentWithHveid(el).innerText 146 | ); 147 | }' | llm -s 'describe slop' 148 | ``` 149 | ### Hacker news summary 150 | 151 | https://til.simonwillison.net/llms/claude-hacker-news-themes describes my Hacker News summary script in detail. 152 | 153 | ### Embeddings 154 | 155 | Full documentation: https://llm.datasette.io/en/stable/embeddings/index.html 156 | 157 | I ran this: 158 | 159 | ```bash 160 | curl -O https://datasette.simonwillison.net/simonwillisonblog.db 161 | llm embed-multi links \ 162 | -d simonwillisonblog.db \ 163 | --sql 'select id, link_url, link_title, commentary from blog_blogmark' \ 164 | -m 3-small --store 165 | ``` 166 | Then looked for items most similar to a string like this: 167 | ```bash 168 | llm similar links \ 169 | -d simonwillisonblog.db \ 170 | -c 'things that make me angry' 171 | ``` 172 | 173 | ### More links 174 | 175 | - [Coping strategies for the serial project hoarder](https://simonwillison.net/2022/Nov/26/productivity/) talk about personal productivity on different projects 176 | - [Figure out how to serve an AWS Lambda function with a Function URL from a custom subdomain](https://github.com/simonw/public-notes/issues/1) as an example of how I use GitHub Issues 177 | 178 | --- 179 | 180 | ## Full Transcript 181 | 182 | ::: {.callout-tip collapse="true"} 183 | ## Expand to see transcript 184 | 185 |
[0:00] Simon Willison: Hey, hey everyone, it's great to be here. So yeah, the talk today, it's about command line tools and large language models. And effectively the argument I want to make is that the Unix command line dating back probably 50 years now is it turns out the perfect environment to play around with this new cutting edge technology, because the Unix philosophy has always been about tools that output things that get piped into other tools as input. And that's really what a language model is, right? 186 |
[0:27] Simon Willison: An LLM is a, it's effectively a function that you pipe a prompt to, and then you get a response back out, or you pipe a big chunk of context to, and you get a response that you can do things with. So I realized this last year and also realized that nobody had grabbed the namespace on PyPI, the Python Packaging Index, for the term LLM. So I leapt at that. I was like, okay, this is an opportunity to grab a really cool name for something. And I built this… 187 |
[0:54] Simon Willison: LittleTool, which originally was just a command line tool for talking to OpenAI. So you could be in your terminal and you could type LLM, say hi in French, and that would fire it through the OpenAI API and get back response and print it to your terminal. That was all it did. And then over time, as other model providers became interesting, and as local models emerged that you could run on your computer, I realized there was an opportunity to have this tool do way more than that. 188 |
[1:24] Simon Willison: So I started adding plugin support to it so you can install plugins that give you access to flawed and local mistral and all sorts of other models. There's hundreds of models that you can access through this tool now. I'll dive into that in a little bit more detail in a moment. First thing you need to know is how to install it. If you are Python people, that's great. Pip install LLM works. I recommend using pip x to install it because then the dependencies end up packaged away somewhere nice. Or you can install it using homebrew. 189 |
[1:54] Simon Willison: I think the one I've got here, yeah, this one I installed with brew install LLM. I made the mistake of running that command about half an hour ago and of course it's homebrew so it took half an hour to install everything. So Treat with caution. But it works. And once it's installed, you have the command. And so when you start using LLM, the default is for it to talk to OpenAI. And of course, you need an OpenAI API key for that. So I'm going to grab my API key. There's a command you can run. 190 |
[2:25] Simon Willison: LLM secrets. Is it LLM secrets? 191 |
[2:30] Hugo Bowne-Anderson: Yes. 192 |
[2:32] Simon Willison: Yes. No, it's not. It's. What is it? LLM keys. That's it. LLM. So I can type LLM keys, set OpenAI, and then I paste in the key, and I'm done. And having done this, I've actually got all sorts of keys in here, but my OpenAI API key has now been made available. And that means I can run prompts. Five great names for a pet pelican. This is my favorite test prompt. And it gives me five great names for a pet pelican. So that's fun. And that's running over the API. 193 |
[3:05] Simon Willison: And because it's a Unix command line thing, you can do stuff with the app. So you can do things like write that to pelicans.txt. The greater than sign means take the output and run it to a file. Now I've got a nice permanent pelicans.txt file with my five distinctive names for a pet pelican. Another thing you can do is you can continue. If you say dash C. which stands for continue, I can say now do walruses. And it will continue that same conversation, say here are five fitting names for a pet walrus. 194 |
[3:35] Simon Willison: I'm going to say justify those. Oops. And now it says why each of these names are justified for that walrus. That's super, super, super delightful. 195 |
[3:48] Hugo Bowne-Anderson: I like Gustav. 196 |
[3:50] Simon Willison: Gustav, what was, let's do LLM logs dash. Gustav, a touch of personality and grandeur, strong regal name that suits the impressive size and stature of a walrus, evoking a sense of dignity. It's good. The justifications are quite good. This is GPT-4-0 I'm running here. You can actually say LLM models default to see what the default model is, and then you can change that as well. So if I want to… be a bit cheaper, I can set it to chat GPT. 197 |
[4:24] Simon Willison: And now when I do this, oops, these are the GPT 3.5 names, which are slightly less exciting. So that's all good and fun. But there are a whole bunch of other useful things you can do when you start working with these things in the terminal. So I'm going to… Let's grab another model. So LLM, as I mentioned, has plugins. If you go to the LLM website, look at list of plugins, there is a plugin directory. This is all of the plugins that are currently available for the tool. And most of these are remote APIs. 198 |
[5:06] Simon Willison: These are plugins for talking to Claude or Rekha or Perplexity or any scale endpoints, all of these different providers. And then there are also some local model plugins that we'll jump into in a moment. But let's grab Claude 3. Claude 3 is my current favorite, my favorite family of models. So I can say LLM install LLM Claude 3, and it will go ahead and install that plugin. If I type LLM plugins, it shows me the plugins that it has installed. Oh, I didn't mean to install LLM Claude, but never mind. And if I see LLM… 199 |
[5:39] Hugo Bowne-Anderson: Is Claude 3 currently your favorite? Just out of interest? 200 |
[5:41] Simon Willison: Two reasons. One, Claude 3 Haiku is incredible, right? Claude 3 Haiku is cheaper than GPT 3.5. I think it's the cheapest decent model out there. It's better than 3.5. It has the 100,000 token limit. So you can dump a huge amount of stuff on there. And it can do images. So we've got, I think it's the most exciting model that we have right now if you're actually building because for the price, you get an enormous array of capabilities. And then Opus, I think Opus was better than GPT-4 Turbo. 201 |
[6:15] Simon Willison: I think 4.0 is just about caught up for the kind of stuff that I do. But Opus is still like a really interesting model. The other thing I'll say about Claude is the… There's this amazing article that just came up about Claude's personality, which talks about how they gave Claude a personality. And this is one of the most interesting essays I have read about large language models in months. Like, what they did to get Claude to behave the way it does and the way they thought about it is super fascinating. 202 |
[6:45] Simon Willison: But anyway, so I've installed Claude. I've now got Claude 3 Opus, Claude 3 Sonnet and Claude 3 Haiku. LLM gives everything long names, so you can say that, say hi in Spanish, and it'll say hola. If you say it with a flourish, it'll add an emoji. That's cute. Hola mi amigo. But you can also say lm-m and just the word haiku because that's set up as a shorter alias. So if I do that, I'll get the exact same response. Crucially, you can… 203 |
[7:22] Simon Willison: This is how I spend most of my time when I'm messing around with models. install plugins for them, or often I'll write a new plugin because it doesn't take much effort to write new plugins for this tool. And then I can start mucking around with them in the terminal, trying out different things against them. Crucially, one of the key features of this tool is that it logs absolutely everything that you do with it. It logs that to a SQLite database. 204 |
[7:47] Simon Willison: So if I type lmat logs path, it will show me the path to the SQLite database that it's using. And I absolutely adore SQLite databases, partly because my main project, the thing I spend most of my time building, is this thing called Dataset, which is a tool for exploring data in SQLite databases. So I can actually do this. I can say Dataset that, if you put it in double quotes, it makes up that space in the file name. This is taking, this is a good command line trick. 205 |
[8:18] Simon Willison: It's taking the path to the log database and passing it to dataset. And now I've got a web interface where I can start browsing all of my conversations. So let's have a look at responses. We sort by ID. Here we go. Say hi in Spanish with a flourish. Hola mi amigo. There it is. It stores the options that we used. It stores the full JSON that came back. It also organizes these things into conversation threads. So earlier we started with, we built, we had that conversation that started five great names for a pet pelican. 206 |
[8:52] Simon Willison: We replied to it twice and each of those messages was logged under that same conversation idea. So we started five great names for pet pelican. Now do walruses justify those. As a tool for evaluating language models, this is incredibly powerful because I've got every experiment I've ever run through this tool. I've got two and a half thousand responses that I've run all in one place, and I can use SQL to analyze those in different ways. 207 |
[9:18] Simon Willison: If I facet by them, I can see that I've spent the most time talking to GPT 3.5 Turbo, Cloud 3 Opus. I've done 334 prompts through. I've got all of these other ones, Gemini, Gemini Pro. Orca Mini, all of these different things that I've been messing around with. And I can actually search these as well. If I search for pelican, these are the six times that I've talked to Claude 3 Opus about pelicans. And it's mostly… Oh, interesting. 208 |
[9:46] Simon Willison: It's mostly asking names from Pelicun, but I did at one point ask it to describe an image that was a close-up view of a Pelicun wearing a colorful party hat. The image features aren't in the main shipped version of the software yet. That's a feature I need to release quite soon. Basically, it lets you say LLM. like a dash I and then give it the path to an image and that will be sent as part of the prompt. So that's kind of fun. 209 |
[10:10] Simon Willison: But yeah, so if you want to be meticulous in tracking your experiments, I think this is a really great way to do that. Like having this database where I can run queries against everything I've ever done with them. I can try and compare the different models and the responses they gave to different prompts. That's super, super useful. Let's talk a little bit more about plugins. So I mentioned that we've got those plugins that add additional models. We also have plugins that add extra features. 210 |
[10:45] Simon Willison: My favorite of those is this plugin called LLM-CMD, which basically lets you do this. If I say in the LLM-CMD, that's installing the plugin. That gives me a new command, llmcmd. That command wasn't there earlier. I can now say llmcmd-help, and it will tell me that this will generate and execute commands in your shell. So as an example, let's do llm convert the file demos.md to uppercase and spit out the results. Oh, I forgot. So llmcmd. And what it does is it passes that up to, I think, GPT-4.0 is my default model right now. 211 |
[11:36] Simon Willison: Gets back the command that does this thing. Why is this taking so long? Hang on. Models. Default. Let's set that to GPT-4.0. Maybe GPT-3.5 isn't very good at that. Anyway, when this works, it populates my shell with the command that I'm trying to run. And when I hit enter, it will run that command. But crucially, it doesn't just run the command, because that's a recipe for complete disaster. It lets you review that command before it goes. And I don't know why this isn't working. This is the one live demo I didn't test beforehand. 212 |
[12:21] Simon Willison: I will drop, I will make notes available afterwards as well. But this is a right, here we go. Here's an animated GIF showing me, showing you exactly what happens. This is show the first three lines, show the first three lines of every file in this directory. And it's spats out head dash N three star. And that does exactly the job. 213 |
[12:41] Simon Willison: A fun thing about this is that because it's a command here, it actually, And tab completion works as well, so you can give it file names by tab completing, and when the command is babing itself, that will do the right thing. But that's kind of fun. It's kind of neat to be able to build additional features into the tool, which use all of the other features of LLM, so it'll log things to SQLite and it'll give you access to all of those different models. 214 |
[13:07] Simon Willison: But it's a really fun playground for sort of expanding out the kind of command line features that we might want to use. Let's talk about local models. So local models that run on your laptop are getting shockingly effective these days. And there are a bunch of LLM plugins that let you run those. One of my favorites of those is called LLM GPT-4-all. It's a wrapper around the GPT-4-all library that Nomic put out. 215 |
[13:35] Simon Willison: So Nomic have a desktop application that can run models that is accompanied by a Python library that can run models, which is very neatly designed. And so what I can do is I can install that plugin, LLM install. LLM GPT for all. This will go around fetch that. And now when I run the LLM models command, we've got a whole bunch of additional models. This is all of these GPT-4 ones. And you'll note that some of these say installed. That's because I previously installed them and they're sat on my hard drive somewhere. 216 |
[14:08] Simon Willison: I'm not going to install any new models right now because I don't want to suck down a four gigabyte file while I'm on a Zoom call. But quite a few of these are installed, including Mistral 7b instruct. So I can grab that and I can say, lm-m Mistral, let's do. Five great names for a pet seagull. Explanations. And I'm going to fire activity monitor for this right now. Let's see if we can spot it doing its thing. Right now we've just asked a command line tool to load in a 4 gigabyte model file. 217 |
[14:42] Simon Willison: There we go. It's loading it in. It's at 64.3 megabytes. 235 megabytes. It's spitting out the answer. And then it goes away again. So this was actually a little bit wasteful, right? We just ran a command which loaded four gigabits of memory, of file into memory, and I think onto the GPU, ran a prompt for it, and then threw all of that away again. And it works. You know, we got our responses. And this here ran entirely on my laptop. There was no internet connection needed for this to work. 218 |
[15:13] Simon Willison: But it's a little bit annoying to have to load the model each time. So I have another command I wrote, a command I added, llm chat. And llm chat, you can feed it the ID of a model, llm chat dash m mistral 7b. And now it's giving me a little… 219 |
[15:33] Simon Willison: chat interface so I can say say hello in Spanish the first time I run this it will load the model again and now now in French and So this is if you're working with local models This is a better way to do it because you don't have to pay the cost of loading that model into memory every single time there's also You can stick in pipe multi and now you can copy and paste a whole block of code into here and it translates And then you type exclamation mark end at the end. 220 |
[16:08] Simon Willison: And if we're lucky, this will now give me… Huh. Okay. Well, that didn't. I may have hit the… I wonder if I've hit the context length. Yeah, something went a little bit wrong with that bit. But yeah, being able to hold things in memory is obviously really useful. There are better ways to do this. One of the best ways to do this is using the O-Lama tool, which I imagine some people here have played with already. And O-Lama is an absolutely fantastic tool. 221 |
[16:50] Simon Willison: It's a Mac, Linux, and Windows app that you can download that lets you start running local language models. And they do a much better job than I do of curating their collection of models. They have a whole team of people who are making sure that newly available models were available in that tool and work as effectively as possible. But you can use that with LLM as well. If I do LLM install LLM-o-Lama, actually… That will give me a new plugin called LLM-OLAMA. And now I can type LLM models. 222 |
[17:22] Simon Willison: And now, this time, it's giving me the OLAMA models that I have available in my machine as well. So in this case, we can do Mixtral. Let's do LLM-M, LLM-CHAT, LLM-M Mixtral Latest. I'll write it in Spanish. And this is now running against the Ollama server that's running on my machine, which I think might be loading Mixtral into memory at the moment, the first time I'm calling it. Once it's loaded into memory, it should work for following prompts without any additional overhead. Again, I spend a lot of time in an activity monitor these days. 223 |
[18:00] Simon Willison: There we go. Ollama Runner has four gigabytes in residence, so you'd expect that to be doing the right thing. 224 |
[18:07] Hugo Bowne-Anderson: So Simon, I just have a quick question. I love how you can use Ollama directly from LLM. I do think one of the other value props about Ollama is the ecosystem of tools built around it. Like you go to Olamas GitHub and there are all types of front ends you can use. And so I love using your LLM client, for example, with like a quick and dirty Gradio app or something like that. But I'm wondering, are there any front ends you recommend or any plugins or anything in the ecosystem to work with? 225 |
[18:38] Simon Willison: I've not explored the Olam ecosystem in much detail. I tend to do everything on the command line and I'm perfectly happy there. But one of the features I most want to add to LLM as a plugin is a web UI. So you can type LLM web, hit enter, it starts a web server for you and gives you a slightly more… slightly more modern interface for messing around with things. And I've got a demo that relates to that that I'll show in a moment. 226 |
[19:04] Simon Willison: But yeah, so front-end, and actually the other front-end I spend a little bit of time with is LM Studio, which is very nice. That's a very polished GUI front-end for working with models. There's a lot of… 227 |
[19:19] Hugo Bowne-Anderson: It's quite around with getting two LLMs answering your same question. But there's a mode where you can… 228 |
[19:24] Simon Willison: get two or n llms if you have enough processing power to answer the same questions and compare their responses in real time yeah it's a new feature very cool that is cool i've been me i've been planning a plugin that will let you do that with llms like llm multi dash m llama dash m something and then give it a prompt um but one of the many ideas on the on the on the backlog at the moment would be super super useful um the other one of course that people should know that if they don't 229 |
[19:53] Simon Willison: is llama file I'm going to demonstrate that right now. 230 |
[19:57] Hugo Bowne-Anderson: Sorry, I'm going to shush now. 231 |
[20:00] Simon Willison: This is one of the most bizarrely brilliant ways of running language models is there's this project that's sponsored by Mozilla called Llama File. And Llama File effectively lets you download a single file, like a single binary that's like four gigabytes or whatever. And then that file will run, it comes with both the language model and the software that you need to run the language model. And it's bundled together in a single file and one binary works on Windows, Mac OS, Linux and BST, which is ridiculous. What this thing does is technically impossible. 232 |
[20:38] Simon Willison: You cannot have a single binary that works unmodified across multiple operating systems. But LamaFile does exactly that. It's using a technology called Cosmopolitan. which I've got here we go I've got an article about Cosmopolitan when I dug into it just a while ago to try and figure out how this thing works astonishing project by Justin Tunney But anyway, the great thing about LamaFile is, firstly, you can just download a file and you've got everything that you need. And because it's self-contained, I'm using this as my end-of-the-world backup of human knowledge. 233 |
[21:16] Simon Willison: I've got a hard drive here, which has a bunch of LamaFiles on. If the world ends, provided I can still run any laptop and plug this hard drive in, I will have a GPT 3.5 class. language model that I can just start using. And so the one that I've got on there at the moment, I have, I've actually got a whole bunch of them, but this is the big one. I've got Lama 370B, which is by far, I mean, how big is that thing? That's a 37 gigabyte file. It's the four byte quantized version. 234 |
[21:55] Simon Willison: That's a genuinely a really, really good model. I actually started this running earlier because it takes about two minutes to load it from over USB-C from drive into memory. So this right here is that. And all I had to do was download that file, shimod7558, and then do.slash metal armor 370B and hit enter. And that's it. And that then fires up this. In this case, it fires up a web server which loads the model and then starts running on port. Which port is it? 235 |
[22:32] Simon Willison: So now if I go to localhost 8080, this right here is the default web interface for Llama. It's all based on Llama.cpp but compiled in a special way. And so I can say, turn names for a pet pelican and hit go. And this will start firing back tokens. Oh, I spelt pelican wrong, which probably won't matter. 236 |
[22:58] Hugo Bowne-Anderson: And while this is executing, maybe I'll also add for those who want to spin up LamaFile now you can do so as Simon has just done. One of the first models in their README they suggest playing around with, which is 4 gigs or 4.7 gigs or something, is the Lava model, which is a really cool multimodal model that you can play around with locally from one file immediately, which is actually mind blowing if you think about it for a second. 237 |
[23:22] Simon Willison: It really is. Yeah. Do I have that one? Let's see. Let's grab. 238 |
[23:30] Hugo Bowne-Anderson: With your end of the world scenario, now you've got me thinking about like post-apocalyptic movies where people have like old LLMs that they use to navigate the new world. 239 |
[23:40] Simon Willison: Completely. Yeah, that's exactly how I. 240 |
[23:43] Hugo Bowne-Anderson: Preppers with LLMs. 241 |
[23:47] Simon Willison: I'm going to. Wow. Wow. Yeah, I very much broke that one, didn't I? 242 |
[23:58] Hugo Bowne-Anderson: We've got a couple of questions that may be relevant as we move on. One from Alex Lee is how does LLM compare with tools like Ollama for local models? So I just want to broaden this question. I think it's a great question. It's the Python challenge, right? The ecosystem challenge. When somebody wants to start with a tool like this, how do they choose between the plethora? How would you encourage people to make decisions? 243 |
[24:21] Simon Willison: I would say LLMs… LLM's unique selling point is the fact that it's scriptable and it's a command line tool. You can script it on the command line and it can run access both the local models and the remote models. Like, that I feel is super useful. So you can try something out against a hosted like cloud-free haiku and then you can run the exact same thing against a local Lama 3 or whatever. And that, I think, is the reason to pay attention to that. I just… 244 |
[24:52] Hugo Bowne-Anderson: I also, I do want to build on that. I'd say if like I've been using dataset for some time now, and I love local SQLite databases as well. So the integration of those three with all the dataset plugins as well, make it really, really interesting also. So I think that's a huge selling point. 245 |
[25:07] Simon Willison: So what I've done here, I closed Llama 370B and I have switched over to that. I switched over to that Lava demo. And if we're lucky, look at this. Person features a person sitting in a chair with a rooster nearby. She's a chicken, not a rooster. A white ball filled with eggs. This I think is astoundingly good for a four gigabyte file. This is a four gigabyte file. It can describe images. This is remarkably cool. And then from LLM's point of view, if I saw LLM Lava file. And then run LM models. 246 |
[25:50] Simon Willison: I've now got a model in here which is Lama file. So I can say LM dash Lama file. Describe chickens. This doesn't yet. Ooh, what happened there? Error file not found. Not sure what's going on there. I'll dig into that one a little bit later. 247 |
[26:11] Simon Willison: the joys of live demos but yeah um so we've got all of this stuff what are some of the things that we can start doing with it well the most exciting opportunity i think is that we can now start um we can now start writing little bash scripts writing little tools on top of this and so if i do This is a script that I've been running for quite a while called HN summary, which is a way of summarizing posts on Hacker News or entire conversations on Hacker News. Because Hacker News gets pretty noisy. 248 |
[26:47] Simon Willison: What's a good one of these to take a look at? Yeah, I do not have time to read 119 comments, but I'd like to know a rough overview of what's going on here. So I can say HN-summary. And then paste in that ID. And this is giving me a summary of the themes from the hack news point. So, he's in theme static versus dynamic linking, package management dependency, Swift cross-platform language. That totally worked. And if we look at the way this worked, it's just a bash script which does a curl command to get the full. 249 |
[27:25] Simon Willison: This is from one of the hack news APIs. If you hit this URL here. You will get back JSON of everything that's going on in that thread as this giant terrifying nested structure. I then pipe it through the JQ command. I use ChatGP to write this because I can never remember how to use JQ. That takes that and turns it into plain text. And actually, I'll run that right now just to show you what that looks like. There we go. 250 |
[28:01] Simon Willison: So that essentially strips out all of the JSON stuff and just gives me back the names of the people and what they said. And then we pipe that into LLM-M model. The model defaults to Haiku, but you can change it to other models if you like. And then we feed it this, the dash S option. to LLM, also known as dash dash system, is the way of feeding in a system prompt. 251 |
[28:26] Simon Willison: So here I'm feeding the output of that curl command, goes straight into LLM as the prompt, and then the system prompt is the thing that tells it what to do. So I'm saying summarize the themes of the opinions expressed here for each theme, output a markdown header. Let's try that. I'm going to try that one more time, but this time I'll use GPT-4-0. So we're running the exact same prompt, but this time through a different model. And here we're actually getting back quotes. 252 |
[28:52] Simon Willison: So when it says, you man wizard said this, Jerry Puzzle said this about dynamic and static linking. I really like this as a mechanism of sort of summarizing conversations because the moment you ask for direct prompts, you're not completely safe from hallucination, but you do at least have a chance of fact checking what the thing said to you. And as a general rule, models are quite good at outputting texts that they've just seen. So if you ask for direct quotes from the input, you'll often get bad results. But this is really good, right? 253 |
[29:20] Simon Willison: This is a pretty decent, quick way of digesting 128 comments in that giant thread. And it's all been logged to my SQLite database. I think if I go and look in SQLite, I've got hundreds of hack and use threads that I've summarized in this way, which if I wanted to do fancy things with later, that's probably all sorts of fun I could have with them. And again, I will. Here we go. 254 |
[29:49] Simon Willison: I will share full notes later on, but there's a TIL that I wrote up with the full notes on how I built up this script. That's another reason I love Cloud3 Haiku, is that running this kind of thing through Cloud3 Haiku is incredibly inexpensive. It costs… And it may be a couple of cents for a long conversation. And the other model I use for this is Gemini Flash that just came out from Google. It's also a really good long context model with a very low price per token. But yeah. Where did we go to? 255 |
[30:28] Hugo Bowne-Anderson: So we have a couple of questions that maybe you want to answer now, maybe we want to leave until later and maybe you cover. And the fact that we saw some sort of formatted output leads to one of these. Is LLM compatible with tools like Instructor, Kevin asks, to generate formatted output? And there are other questions around like piping different. different things together. So I wonder if you can kind of use that as a basis to talk about how you pipe, 256 |
[30:56] Simon Willison: you know? Yes, I will jump into some quite complex piping in just a moment. LLM does not yet have structured output function calling support. I am so excited about getting that in there. The thing I want to do, there are two features I care about. There's the feature where you can like get a bunch of unstructured text and feed in a JSON schema and get back JSON. That works incredibly well. A lot of the models are really good at that now. 257 |
[31:19] Simon Willison: I actually have a tool I built for dataset that uses that feature, but that's not yet available as a command line tool. And the other thing I want to do is full-blown like tool execution where the tools themselves are plugins. Like imagine if you could install an LLM plugin that added Playwright functions, and now you can run prompts that can execute Playwright automations as part of those prompts, because it's one of the functions that gets made available to the model. 258 |
[31:46] Simon Willison: So that, I'm still gelling through exactly how that's going to work, but I think that's going to be enormously powerful. 259 |
[31:53] Hugo Bowne-Anderson: Amazing. And on that point as well, there are some questions around evals. And if you, when using something like this, you can do evals and how that would work. 260 |
[32:05] Simon Willison: So work in progress. Two months ago, I started hacking on a plugin for LLM for running evals. And it is. Very, very alpha right now. The idea is I want to be able to find my evals as YAML files and then say things like LMEVAL simple dot YML with the 40 model and the chat GPT models. I'm running the same eval against two different models and then get those results back, log them to SQLite, all of that kind of thing. This is a very, very early prototype at the moment. 261 |
[32:37] Simon Willison: But it's partly, I just, I've got really frustrated with how difficult it is to run evals and how little I understand about them. And when I don't understand something, I tend to write code as my way of thinking through a problem. So I will not promise that this will turn into a generally useful thing for other people. I hope it will. At the moment, it's a sort of R&D prototype for me to experiment with some ideas. 262 |
[32:58] Hugo Bowne-Anderson: I also know the community has generated a bunch of plugins and that type of stuff for Dataset. I'm not certain about LLM, but I am wondering if people here are pretty, you know, pretty sophisticated audience here. So if people wanted to contribute or that type of thing. 263 |
[33:13] Simon Willison: OK, the number one way to contribute to LLM right now is is by writing plugins for it. And I wrote a very detailed tutorial. The most exciting is the ones that enable new models. So I wrote a very detailed tutorial on exactly. how to write a plugin that exposes new models. A bunch of people have written plugins for API-based models. Those are quite easy. The local models are a little bit harder, but the documentation is here. 264 |
[33:39] Simon Willison: And I mean, my dream is that someday this tool is widely enough to use that when somebody releases a new model, they build a plugin for that model themselves. That would be the ideal. But in the absence of that, it's pretty straightforward building new models. I'm halfway through building a plugin for… the MLX Apple framework, which is getting really interesting right now. And I just this morning got to a point where I have a prototype of a plugin that can run MLX models locally, which is great. But yeah, let's do some commands. 265 |
[34:17] Simon Willison: Okay, I'll show you a really cute thing you can do first. LLM has support for templates. So you can say things like LLM dash dash system, you are a sentient cheesecake, tell it the model, and you can save that as a template called cheesecake. Now I can say LLM chat dash T cheesecake, tell me about yourself. And it'll say, I'm a sentient cheesecake, a delightful fusion of creamy textures. So this is, I have to admit, I built this feature. I haven't used it as much as I expected it I would. 266 |
[34:47] Simon Willison: It's effectively LLM's equivalent of GPT's, of chat GPT's. I actually got this working before GPT's came along. And it's kind of fun, but it's, yeah, like I said, I don't use it on a daily basis. And I thought I would when I built it. Let's do some really fun stuff with piping. So I've got a couple of the, one of the most powerful features of LLM is that you can pipe things into it with a system prompt to have it, to then process those things further. 267 |
[35:17] Simon Willison: And so you can do that by just like catting files to it. So I can say cat demos.md pipe LLM dash S summary short. And this will give me a short summary of that document that I just piped into it, which works really well. That's really nice. Cool. A little bit longer than I wanted it to be. Of course, the joy of this is that once this is done, I can then say lm-c, no, much, much, much shorter and in haikus. 268 |
[35:50] Simon Willison: And now it will write me some haikus that represent the demo that I'm giving you right now. These are sentient cheesecake, templates to save brilliant minds, cheesecake chats with us. That's lovely. So being able to pipe things in is really powerful. I built another command called files to prompt, where the idea of this one is if you've got a project with multiple files in it, running files to prompt will turn those into a single prompt. 269 |
[36:14] Simon Willison: And the way it does that is it outputs the name of the file and then the contents of the file and then name of the next file, contents of the file, et cetera, et cetera, et cetera. But because of this, I can now do things like suggest tests to add to this project. Oh. I'm sorry. I'm sorry. I forgot the LLM-S. Here we go. And this is now, here we go, reading all of that code and suggesting, okay, you should have tests that, oh, wow, it actually, it's writing me sample tests. 270 |
[36:46] Simon Willison: This is very, this is a very nice result. So I use this all the time. When I'm hacking on stuff on my machine, I will very frequently just. cat a whole directory of files into a prompt in one go, and use the system prompt to say, what tests should I add? Or write me some tests, or explain what this is doing, or figure out this bug. Very, very powerful way of working with the tool. But way more fun than that is another tool I built called ShotScraper. 271 |
[37:14] Simon Willison: So ShotScraper is a browser automation tool which started out as a way of taking screenshots. So once you've got it installed, you can do ShotScraper and then the URL to a web page, and it will generate a PNG file with a screenshot of that web page. That's great. I use that to automate screenshots in my documentation using this. But then I realized that you can do really fun things by running JavaScript from the command line. 272 |
[37:38] Simon Willison: So a very simple example, if I say, shot scraper JavaScript, give it the URL to a website and then give it that string, document.title, it will load that website up in a hidden browser. It'll execute that piece of JavaScript and it will return the result of that JavaScript directly to my terminal. So I've now got the title of this webpage and that's kind of fun. Where that gets super, super fun is when you start doing much more complicated and interesting things with it. So let's scrape Google. Google hate being scraped. We'll do it anyway. 273 |
[38:12] Simon Willison: Here is a Google search for NY Times plus slop. There's an article in the New York Times today with a quote for me about the concept of slop in AI, which I'm quite happy about. And then so you can open that up and start looking in the. If you start looking at the HTML, you'll see that there's H3s for each results, and the H3s are wrapped by a link that links to that page. 274 |
[38:37] Simon Willison: So what I can do is I can write a little bit of JavaScript here that finds all of the H3s on the page, and for each H3, it finds the parent link and its href, and it finds the title, and it outputs those in an array. And if I do this… This should fire up that browser. That just gave me a JSON array of links to search results on the New York Times. Now I could pipe that to LLM. So I'm gonna do pipe LLM. 275 |
[39:06] Simon Willison: Actually, no, I'm gonna do a slightly more sophisticated version of this. This one goes a little bit further. It tries to get the entire… It tries to get the snippet as well, because the snippet gives you that little bit of extra context. So if I take that, and I'm just going to say dash S describe slot. And what we have just done. is we have done RAG, right? This is retrieval augmented generation against Google search results using their snippets to answer a question done as a bash one-liner effectively. 276 |
[39:42] Simon Willison: Like we're using ShotScraper to load up that web page. We're scraping some stuff out with JavaScript. We're piping the results into LLM, which in this case is sending it up to GPT-4.0, but I could equally tell it to send it to Claude or to run it against a local model or any of those things. And it's a full RAG pipeline. I think that's really fun. I do a lot of my experiments around the concept of RAG, just as these little shell scripts here. 277 |
[40:09] Simon Willison: You could consider the hack and use example earlier was almost an example of RAG, but this one, because we've got an actual search term and a question that we're answering, I feel like this is it. This is a very quick way to start prototyping different forms of retrieval augmented generation. 278 |
[40:27] Hugo Bowne-Anderson: Let me ask though, does it use? I may have missed, does it use embeddings? 279 |
[40:32] Simon Willison: Not yet, no, but I'll get into embeddings in just a second. 280 |
[40:35] Hugo Bowne-Anderson: And that's something we decided not to talk too much about today, but it'd be sad if people didn't find out about your embeddings. 281 |
[40:42] Simon Willison: I have a closing demo I can do with embeddings. Yeah, this right here, effectively, we're just copying and pasting these search results from the browser into the model and answering a question, but we're doing it entirely on the command line, which means that we can hook up our own bash scripts that… automate that and pull that all together. There's all sorts of fun bits and pieces we can do with that. But yeah, let's… The ShotScraper JavaScript thing I'll share later. Let's jump into the last… Let's jump into embedding stuff. So… 282 |
[41:16] Simon Willison: If you run llm dash help, it'll show you all of the commands that are available in the LLM family. The default command is prompt. That's for running prompts. There are also these collections for dealing with embeddings. I would hope everyone in this course is familiar enough with embeddings now that I don't need to dig into them in too much detail. But it's exactly the same pattern as the language models. Embeddings are provided by plugins. There are API-based embeddings. There are local embedding models. It all works exactly the same way. 283 |
[41:46] Simon Willison: So if I type LLM embed models, that will show me the models that I have installed right now. And actually, these are the open AI ones, the three small, three large, and so on. If I were to install additional plugins, is the embeddings documentation. There's a section in the plugin directory for embedding models. So you can install sentence transformers, you can get clip running, and various other bits and pieces like that. But let's embed something. So if I say lm embed, let's use the OpenAI 3 small model and give it some text. 284 |
[42:28] Simon Willison: It will embed that text and it will return an array of, I think, 6,000. How many is that? Like JQ length. An array of 1,536 floating point numbers. This is admittedly not very interesting or useful. There's not a lot that we can do with that JSON array of floating point numbers right here. You can get it back in different shapes and things. You can ask for it in, I think I can say, dash, dash, X. I can say dash f hex and get back a hexadecimal blob of those. Again, not particularly useful. 285 |
[43:04] Simon Willison: Where embeddings get interesting is when you calculate embeddings across a larger amount of text and then start storing them for comparison. And so we can do that in a whole bunch of different ways. There is a command called, where is it? Embed multi. Where's my embed multi documentation gone? Here we go. The embed multi command lets you embed multiple strings in one go, and it lets you store the results of those embeddings in a SQLite database because I use SQLite databases for everything. 286 |
[43:40] Simon Willison: So when I have here a SQLite database, I'm going to open it up actually using Dataset Desktop, which is my Mac OS Electron app version of Dataset. How big is that file? That's a 129 megabyte file. Wow. Does this have embeddings in already? It does not. Okay, so this right here is a database of all of the content on my blog. And one of the things I have on my blog is I have a link blog, this thing down the side, which has 7,000 links in it. 287 |
[44:16] Simon Willison: And each of those links is a title and a description and a URL, effectively. So I've got those here in a SQLite database, and I'm going to create embeddings for every single one of those 7,168 bookmarks. And the way I can do that is with a, well, firstly, I need to figure out a SQL query that will get me back the data I want to embed. That's going to be select ID, link URL, link title, commentary from blog, blogmark. 288 |
[44:44] Simon Willison: The way LLM works is when you give it a query like this, it treats the ID there as the unique identifier for that document, and then everything else gets piped into the embedding model. So once I've got that in place, I can run this command. I can say, LLM embed multi. I'm going to create a collection of embeddings called links. I'm going to do it against that Simon Wilson blog SQLite database. I'm going to run this SQL query, and I'm using that three small model. 289 |
[45:11] Simon Willison: And then dash dash store causes it to store the text in the SQLite database as well. Without that, it'll just store the IDs. So I've set that running, and it's doing its thing. It's got 7,000 items. Each of those has to be sent to the OpenAI API in this case, or if it was a local model, it would run it locally. And while that's running, we can actually see what it's doing by taking a look at the embeddings table. Here we go. 290 |
[45:38] Simon Willison: So this table right here is being populated with the being populated by that script. We're at one thousand two hundred rows now. I hit refresh. We're at two thousand rows. And you can see for each one, we've got the content which was glued together. And then we've got the embedding itself, which is a big binary blob of it's a. binary encoded version of that array of 1,500 floating point numbers. But now that we've got those stored, we can start doing fun things with them. I'm going to open up another. There we go. 291 |
[46:20] Simon Willison: So I've opened up another window here so that I can say LLM similar. I'm going to look for similar items in the links collection to the text, things that make me angry. Oh, why doesn't the, oh, because I've got to add the dash D. Here we go. So this right here is taking the phrase, things that make me angry, it's embedding it, and it's finding the most similar items in my database to that. And there's an absolutely storming rant from somebody. There's death threats against bloggers. There's a bunch of things that might make me angry. 292 |
[46:56] Simon Willison: This is the classic sort of embedding semantic search right here. And this is kind of cool. I now have embedding search against my blog. Let's try something a little bit more useful. I'm going to say. Let's do dataset plugins. So we'll get back everything that looks like it's a dataset plugin. There we go. And I can now pipe that into LLM itself. So I can pipe it to LLM and I'm gonna say system prompt, most interesting plugins. And here we are. Again, this is effectively another version of command line rag. 293 |
[47:39] Simon Willison: I have got an embeddings database in this case. I can search it for things that are similar to things. I like this example because we're running LLM twice. We're doing the LLM similar command to get things out of that vector database. And then we're biking to the LLM prompt command to summarize that data and turn it into something interesting. And so you can build a full rag-based system again as a… little command line script. I think I've got one of those. Log answer. Yes, there we go. 294 |
[48:13] Simon Willison: This one I don't think is working at the moment, but this is an example of what it would take. What it would take to… take a question, run a embedding search against, in this case, it's every paragraph in my blog. I've got a little bit of JQ to clean that up. And then I'm piping it into… In this case, the local Lama file, but I've typed into other models as well to answer questions. 295 |
[48:38] Simon Willison: So you can build a full RAG Q&A workflow as a bash script that runs against this local SQLite database and does everything that way. It's worth noting that this is not a fancy vector database at all. This is a SQLite database with those embedding vectors as binary blobs. Anytime you run a search against this, it's doing effectively it's a… Effectively, it's doing a brute force. It's calculating the vector similarity difference between your input and every one of those things, and then it's sorting by those records. 296 |
[49:13] Simon Willison: I find for less than 100,000 records, it's so fast it doesn't matter. If you were using millions and millions of records, that's the point where the brute force approach doesn't work anymore, and you'll want to use some kind of specialized vector index. There are SQLite vector indexing tools that I haven't integrated with yet, but they're looking really promising. You can use pinecone and things like that as well. One of my future goals for LLM is to teach it how to work with external vector indexes. 297 |
[49:40] Simon Willison: Cause I feel like once you've got those embeddings stored, having a command that like synchronizes up your pinecone to run searches, that feels like that would be a reasonable thing to do. I realize we're running a little bit short on time. So I'm gonna switch to questions for the rest of the section. I think I went through all of the demos that I wanted to provide. 298 |
[49:59] Hugo Bowne-Anderson: Awesome. Well, thank you so much, Simon. That was illuminating as always. And there are a lot more things I want to try now. And I hope for those who have played around with LLM and these client utilities that I've got a lot more ideas about how to do so. And for those who haven't, please jump in and let us know on Discord or whatever, like what type of stuff you, what type of fun you get to have. 299 |
[50:20] Hugo Bowne-Anderson: Question wise, I haven't been able to rank all of them some reason with respect to upvotes, unfortunately, this time. There was one, and I don't know if you mentioned this, there was one quickly about Hugging Face hub models. Are there plugins? 300 |
[50:37] Simon Willison: No, there is not. And that's because I am GPU poor. I'm running on a Mac. Most of the Hugging Face models appear to need an NVIDIA GPU. If you have an NVIDIA GPU and want to write the LLM Hugging Face plugin, I think it would be quite a straightforward plugin to write, and it would be enormously useful. So yeah, that's open right now for somebody to do that. Same thing with, is it VLLX or something? 301 |
[51:05] Simon Willison: There's a few different serving technologies that I haven't dug into because I don't have an NVIDIA GPU to play with on a daily basis. But yeah, the Hugging Face Models thing would be fantastic. 302 |
[51:15] Hugo Bowne-Anderson: Awesome. And how about some of the serverless inference stuff? Is there a way we can use LLM to ping those in? 303 |
[51:23] Simon Willison: Do you mean like the Cloudflare ones and so on? 304 |
[51:27] Hugo Bowne-Anderson: I'm thinking like, let me… If you go to any given model, there is some sort of serverless inference you can… 305 |
[51:36] Simon Willison: you can just do to ping the apis that they've already got set up there oh interesting i mean so as you can see we've got like 20 different plugins for any scale endpoints is a very good one fireworks um open router so if it's available via an api you can build a plugin for it the other thing is if it's an open air compatible if it's open ai compatible as the api you don't have to build anything at all you can actually configure llm you can teach it about additional Yeah, you can teach about additional 306 |
[52:08] Simon Willison: OpenAI compatible models by just dropping some lines into a YAML file. So if it already speaks OpenAI, without writing additional plugins, you can still talk to it. 307 |
[52:20] Hugo Bowne-Anderson: Amazing. And just check out the link I shared with you. If you want to open that one, it should be in the chat. 308 |
[52:27] Simon Willison: Is it the… 309 |
[52:29] Hugo Bowne-Anderson: It's the hugging face. API. 310 |
[52:34] Simon Willison: No, I've not built something against this yet. 311 |
[52:38] Hugo Bowne-Anderson: This could actually be really exciting. Yeah. Because I've got a lot of pretty heavy-duty models that you can just ping as part of their serverless. 312 |
[52:49] Simon Willison: I don't think anyone's built that yet, but that would be a really good one to get going, absolutely. 313 |
[52:55] Hugo Bowne-Anderson: So if anyone's interested in that, definitely jump in there. We do have questions around using your client utility for agentic workflows. 314 |
[53:07] Simon Willison: Yeah, not yet, because I haven't done the function calling piece. Once the function calling piece is in, I think that's going to get really interesting. And that's also the kind of thing where I'd like to, I feel like you could explore that really by writing additional plugins, like an LLM agents plugin or something. So yeah, there is the other side of LLM which isn't as mature is there is a Python API. So you can pip install LLM and use it from Python code. 315 |
[53:35] Simon Willison: I'm not I'm not completely happy with the interface of this yet, so I don't tend to push people towards it. Once I've got that stable, once I have a 1.0 release, I think this will be a very nice sort of abstraction layer over a hundred different models, because any model that's available through a plugin to the command line tool will be available as a plugin that you can use from Python directly. So that's going to get super fun as well, especially in Jupyter notebooks and such like. 316 |
[53:58] Hugo Bowne-Anderson: Awesome. We actually have some questions around hardware. Would you mind sharing the system info of your Mac? Is it powerful? Is it with all the commands you demoed? Wonder if I need a new Mac? I can tell people that I've got an M1 from 2021. So my MacBook's got a GPU, but it's like three, four years old or whatever. And it runs this stuff wonderfully. 317 |
[54:20] Simon Willison: So yeah, I'm an M2 Mac, 64 gig. I wish I'd got more RAM. At the same time, the local models, the Mistral 7Bs and such like, run flawlessly. PHY3, absolutely fantastic model, that runs flawlessly. They don't even gobble up that much RAM as well. The largest model I've run so far is Lama 370B, which takes about 40 gigabytes of RAM. And it's definitely the most GPT 3.5-like local model that I've ever run. 318 |
[54:53] Simon Willison: I have a hunch that within a few months the Mac will be an incredible platform for running models because Apple are finally like all speed ahead on local model stuff. Their MLX library is really, really good. So it might be that in six months time, an M4 MacBook Pro with 192 gigabytes of RAM is the best machine out there. But I wouldn't spend any money now based on future potential. 319 |
[55:18] Hugo Bowne-Anderson: Right. And I'm actually very bullish on Apple and excited about what happens in the space as well. Also, we haven't talked about this at all, but the ability to run all this cool stuff on your cell phone is people are complaining about all types of stuff at the moment and Apple hasn't done this. But this is wild. This is absolutely like cosmic stuff. 320 |
[55:40] Simon Willison: There is an app that absolutely everyone with a modern iPhone should try out called MLC Chat. Yeah. It straight up runs Mistral on the phone. It just works. And it's worked for like six months. It's absolutely incredible. I can run Mistral 7B Instruct Quantized on my iPhone. Yeah. And it's good. I've used this on flights to look up Python commands and things. Yeah. That's incredible. And yeah, Apple stuff. 321 |
[56:09] Simon Willison: it's interesting that none of the stuff they announced the other day was actually a chatbot you know that they're building language model powered features that summarize and that help with copy editing and stuff they're not giving us a chat thing which means that they don't they're not responsible for hallucinations and all of the other weird stuff that can happen which is a really interesting design choice i feel like apple did such a good job of avoiding most of the traps and pitfalls and weirdnesses in in the in the products that they announced yesterday 322 |
[56:40] Hugo Bowne-Anderson: Totally agree. And so two more questions, we should wrap up in a second. I wonder, people have said to me, and I don't… My answer to this, people like, hey, why do you run LLMs locally when there are so many ways to access bigger models? And one of my answers is just, like you mentioned being on a plane or in the apocalypse or that type of thing. But it's also just for exploration to be able to do something when I'm at home to use my local stuff. 323 |
[57:14] Simon Willison: The vast majority of my real work that I would do LLMs, I use Clod 3 Opus, I use GPT-4O, I occasionally use Clod 3 Haiku. But the local models as a way of exploring the space are so fascinating. And it's also, I feel like if you want to learn how language models work, the best way to do that is to work with the really bad ones. 324 |
[57:35] Simon Willison: Like the working, spending time with a crap local model that hallucinates constantly is such a good way of getting your sort of mental model of what these things are and how they work. Because when you do that, then you start saying, oh, okay, I get it. ChatGPT 3.5 is like Mistral 7b, but it's a bit better. So it makes less mistakes and all of those kinds of things. But yeah, and I mean, there are plenty of very valid privacy concerns around this as well. I've kind of skipped those. 325 |
[58:03] Simon Willison: Most of the stuff I say to models is me working on open source code, where if there's a leak, it doesn't affect me at all. But yeah, I feel like… If you're interested in understanding the world of language models, running local models is such a great way to explore them. 326 |
[58:20] Hugo Bowne-Anderson: Totally. I do have a question also around, you mentioned the eval tool that you're slowly working on. Does it incorporate data set as well? Because I couldn't, when, so when I want to do like at least my first round of evaluations, I'll do it in a notebook or spin up a basic streamlet out where I can tag things as right or wrong and then filter by those. So these are the types of that could make sense in data. 327 |
[58:41] Simon Willison: Where I want to go. So the idea with the evals tools, and it doesn't do this yet. It should be recording the results to SQLite so that you can have. like a custom data interface to help you evaluate them. I want to do one of those, the LMSIS arena style interfaces where you can see two different prompted, two different responses from prompts that you've run evals against and click on the one that's better and that gets recorded in the database as well. 328 |
[59:05] Simon Willison: Like there's so much that I could do with that because fundamentally SQLite is such a great sort of substrate to build tools on top of. Like it's incredibly fast it's free everyone's got it you can use it as a file format for passing things around like imagine running a bunch of evals and then schlepping a like five megabytes sqlite file to your co-worker to have a look at what you've done that stuff all becomes possible as well But yeah, so that's the ambition there. I don't know when I'll get to invest the time in it. 329 |
[59:35] Hugo Bowne-Anderson: Well, once again, like if people here are interested in helping out or chatting about this stuff, please do get involved. I do. I am also interested, speaking about the SQLite database and then dataset. So one thing that's also nice about LM Studio is that it'll tell you, like it does have some serious product stuff. When you run something, it'll like give you in your GUI. the latency and number of tokens and stuff like that. We log that stuff to SQLite and have that in. And then like serious, you know, benchmarking of different models. 330 |
[1:00:10] Simon Willison: Yep. I've been meaning to file this ticket for a while. Awesome. That needs to happen. Yep. I guess it's tokens per second and total duration. Yeah, exactly. It's going to be interesting figuring out how to best do that for models where I don't have a good token count from them, but I can fudge it. Just the duration on its own would be useful things to start recording. 331 |
[1:00:41] Hugo Bowne-Anderson: Absolutely. And so there's kind of a through line in some of these questions. Firstly, a lot of people are like, wow, this is amazing. Thank you. Misha has said Simon is a hero. Thank you. has said this is brilliant. I can't believe you've done this. So that's all super cool. I want to build on this question. Eyal says a little off topic, but how are you able to build so many amazing things? I just want to- 332 |
[1:01:07] Simon Willison: I have a blog post about that. 333 |
[1:01:09] Hugo Bowne-Anderson: Raise that as an issue on a GitHub repository? Well, 334 |
[1:01:12] Simon Willison: yeah. Here we go. I have a blog- It's not the building, 335 |
[1:01:15] Hugo Bowne-Anderson: it's the writing as well. So yeah, what structures do you put in your own life in order to- I have a great story. 336 |
[1:01:24] Simon Willison: Basically, so this talk here is about my approach to personal projects and effectively, and really the argument I make here is that you need to write unit tests and documentation because then you can do more projects. Because if you haven't done that, you'll come across a project like my LLM evals project I haven't touched in two months, but because I've got a decent set of issues and sort of notes tucked away in there, I'm going to be able to pick up on that really easily. 337 |
[1:01:48] Simon Willison: And then the other trick is I only work on projects that I already know I can do quickly. I don't have time to take on a six-month mega project, but when I look at things like LLM, I already had the expertise of working with SQLite from the dataset stuff. I knew how to write Python command line applications. I knew how to build plugin infrastructures because I'd done that for dataset. 338 |
[1:02:09] Simon Willison: So I was probably the person on earth most well-equipped to build a command line tool in Python that has plugins and does language model stuff and logs to SQLite. And so really that's my sort of main trick is I've got a bunch of things that I know how to do, and I'm really good at spotting opportunities to combine them in a way that lets me build something really cool, but quite quickly, because I've got so many other things going on. Amazing. That's the trick. It's being selective in your projects. 339 |
[1:02:38] Hugo Bowne-Anderson: And also there are small things you do like your, well, it's not small anymore, right? But your Today I Learn blog, and what a wonderful way to, you know, it doesn't necessarily need to be novel stuff, right? But because it's the Today I Learn, you just quickly write something you learn. 340 |
[1:02:52] Simon Willison: I will tell you the trick for that is every single thing that I do, I do in GitHub issues. So if I'm working on anything at all, I will fire up a GitHub issue thread in a private repo or in a public repo, and I will write notes as I figure it out. And one of my favorite examples, this is when I wanted to serve an AWS Lambda function with a function URL from a custom subdomain, which took me 77 comments all from me to figure out because, oh my God, AWS is a nightmare. 341 |
[1:03:21] Simon Willison: And in those comments, I will drop in links to things I found and screenshots of the horrifying web interfaces I have to use and all of that kind of thing. And then when I went to write up a TIL, I just copy and paste the markdown to the issue. So most of my TALs take like 10 minutes to put together because they're basically just the sort of semi-structured notes I had already copied and pasted and cleaned up a little bit. But this is in that productivity presentation I gave. This works so well. 342 |
[1:03:49] Simon Willison: It's almost like a scientist's notebook kind of approach where anything you're doing, you write very comprehensive notes on what do I need to do next? What did I just try? What worked? What didn't work? And you get them all in that sequence. And it means that I can. 343 |
[1:04:03] Simon Willison: I don't remember a single thing about AWS Lambda now, but next time I want to solve this problem, I can come back and I can read through this and it'll sort of reboot my brain to the point that I can take out the project from where I got to. 344 |
[1:04:16] Hugo Bowne-Anderson: Awesome. I know we're over time. There are a couple of very more interesting questions. So if you've got a couple more minutes. 345 |
[1:04:22] Simon Willison: Yes, absolutely. 346 |
[1:04:23] Hugo Bowne-Anderson: There's one around, have you thought about using textual or rich in order to make pretty output? 347 |
[1:04:32] Simon Willison: I think. Does that work? Like what's a LM logs? What's this? LM logs pipe hyphen dash m rich. What's the thing? Is it rich dot markdown you can do? 348 |
[1:04:47] Hugo Bowne-Anderson: I think so, but… 349 |
[1:04:50] Simon Willison: Maybe I do. Look at that! There we go, Rich just pretty printed my markdown. So yeah, so I haven't added Rich as a dependency because I'm very, very protective of my dependencies. I try and keep them as minimal as possible. I should do a plugin. It would be really cool if there was a… I'd need a new plugin hook, but if there was a plugin where you could install LLM Rich and now LLM outputs things like that, that would be super fun. So yeah, I should do that. 350 |
[1:05:18] Hugo Bowne-Anderson: That would be super cool. And just for full transparency, occasionally when I want to have fun playing around with the command line, I muck around with tools like CowSay. So piping LLM to CowSay has been fun as well to get cows to. Final question is. Misha has a few GPU rigs and they're wondering if there's any idea how to run LLM with multiple models on different machines but on the same LAN. 351 |
[1:05:53] Simon Willison: I would solve that with more duct tape. I'd take advantage of existing tools that let you run the same command on multiple machines. Ansible, things like that. I think that would be the way to do that. And that's the joy of Unix is so many of these problems, if you've got a little Unix command, you can wire it together with extra bash script and Ansible and Kubernetes and Lord only knows what else. I run LLM occasionally inside of GitHub Actions, and that works because it's just a command line. 352 |
[1:06:22] Simon Willison: So yeah, for that, I'd look at existing tools that let you run commands in parallel on multiple machines. 353 |
[1:06:30] Hugo Bowne-Anderson: Amazing. So everyone, next step, pip install LLM or pipx install LLM and let us know on Discord how you go. I just want to thank everyone for joining once again. And thank you, Simon, for all of your expertise and wisdom. And it's always fun to chat. 354 |
[1:06:46] Simon Willison: Thanks for having me. And I will drop a marked down document with all of the links and demos and things in Discord at some point in the next six hours. So I'll drop that into the Discord channel. 355 |
[1:06:58] Hugo Bowne-Anderson: Fantastic. All right. See you all on Discord. And thanks once again, Simon. 356 | 357 | ::: -------------------------------------------------------------------------------- /talks/applications/simon_llm_cli/simon_bash.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parlance-labs/applied-llms/790c4a8d533a74b41c2e700f9eb2074266f0cec7/talks/applications/simon_llm_cli/simon_bash.png -------------------------------------------------------------------------------- /talks/index.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | id: conf 3 | title: Talks 4 | metadata-files: ["_listing_root.yml"] 5 | listing: 6 | contents: 7 | - "/*.qmd" 8 | - "/*/index.qmd" 9 | --- 10 | 11 | A directory of relevant talks and presentations. The talks are categorized by the following topics: --------------------------------------------------------------------------------