├── case-studies.md
├── .gitignore
├── ToC.pdf
├── .DS_Store
├── assets
    ├── rlhf.png
    ├── ai-judge.png
    ├── aie-cover.png
    ├── aie-architecture.png
    ├── aie-cover-back.png
    ├── prompt-anatomy.png
    ├── rag-architecture.png
    ├── rag-vs-finetune.png
    ├── evaluation-process.png
    ├── inference-service.png
    ├── model-perf-dataset.png
    └── aie-stack-evolution.png
├── scripts
    ├── chatgpt.png
    └── chatgpt-claude.png
├── misalignment.md
├── appendix.md
├── study-notes.md
├── prompt-examples.md
├── README.md
├── ToC.md
├── chapter-summaries.md
└── resources.md


/case-studies.md:
--------------------------------------------------------------------------------
1 | _Coming soon._


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | scripts/.ipynb_checkpoints
3 | 


--------------------------------------------------------------------------------
/ToC.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/ToC.pdf


--------------------------------------------------------------------------------
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/.DS_Store


--------------------------------------------------------------------------------
/assets/rlhf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/rlhf.png


--------------------------------------------------------------------------------
/assets/ai-judge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/ai-judge.png


--------------------------------------------------------------------------------
/scripts/chatgpt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/scripts/chatgpt.png


--------------------------------------------------------------------------------
/assets/aie-cover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/aie-cover.png


--------------------------------------------------------------------------------
/assets/aie-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/aie-architecture.png


--------------------------------------------------------------------------------
/assets/aie-cover-back.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/aie-cover-back.png


--------------------------------------------------------------------------------
/assets/prompt-anatomy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/prompt-anatomy.png


--------------------------------------------------------------------------------
/assets/rag-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/rag-architecture.png


--------------------------------------------------------------------------------
/assets/rag-vs-finetune.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/rag-vs-finetune.png


--------------------------------------------------------------------------------
/scripts/chatgpt-claude.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/scripts/chatgpt-claude.png


--------------------------------------------------------------------------------
/assets/evaluation-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/evaluation-process.png


--------------------------------------------------------------------------------
/assets/inference-service.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/inference-service.png


--------------------------------------------------------------------------------
/assets/model-perf-dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/model-perf-dataset.png


--------------------------------------------------------------------------------
/assets/aie-stack-evolution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuilenren/aie-book/main/assets/aie-stack-evolution.png


--------------------------------------------------------------------------------
/misalignment.md:
--------------------------------------------------------------------------------
1 | # Misalignment AI
2 | A curated list of generative AI use cases that go wrong.
3 | 
4 | _Coming soon_


--------------------------------------------------------------------------------
/appendix.md:
--------------------------------------------------------------------------------
1 | _Coming soon_
2 | 
3 | This section contains notes that are related to the concepts discussed in the book but too nitty gritty to be included. Most people will probably find this boring.
4 | 
5 | 


--------------------------------------------------------------------------------
/study-notes.md:
--------------------------------------------------------------------------------
 1 | # Study notes 
 2 | Notes from people reading AI Engineering. If you've made public notes about the book, feel free to create a PR to add them here. Thank you!
 3 | 
 4 | - [Alex Strick van Linschoten](https://www.linkedin.com/in/strickvl/) made excellent chapter-by-chapter notes of the book. Here are some of them.
 5 |     - [Chapter 6: RAG and Agents](https://mlops.systems/posts/2025-01-24-notes-on-ai-engineering-chip-huyen-chapter-6.html)
 6 |     - [Chapter 7: Finetuning](https://mlops.systems/posts/2025-01-26-notes-on-ai-engineering-chip-huyen-chapter-7:-finetuning.html)
 7 |     - [Chapter 9: Inference optimization](https://mlops.systems/posts/2025-02-07-ai-engineering-chapter-9.html)
 8 |     - [Chapter 10: AI engineering architecture and user feedback](https://mlops.systems/posts/2025-02-09-ai-eg-chapter-10.html)
 9 | - [Li Liu's bookclub](https://docs.google.com/document/d/1mKRI3cJVNTmNKgE85OUd3DOfnph7Z3QmqrZhBCX7yQ4/edit?tab=t.0#heading=h.ohq29zbftwgr): started Feb 8, 2025 with weekly meeting.
10 | - [AI from Scratch reading group](https://x.com/santiviquez/status/1886469829583835214): started Feb 16, 2025.
11 | - [Pastor Soto's notes on AI engineering](https://substack.com/home/post/p-154650527)
12 | 
13 | 
14 | I'm also collecting discussion topics from reading groups and courses on AI engineering. Feel free to make PRs to add yours. Thank you!


--------------------------------------------------------------------------------
/prompt-examples.md:
--------------------------------------------------------------------------------
  1 | # Prompt examples
  2 | Prompt examples from actual use cases (shared by their developers) and hypothetical prompts to demonstrate different prompt engineering techniques. More prompts will be added over time.
  3 | 
  4 | - [Real-world use case prompts](#real-world-use-case-prompts)
  5 |     - [Brex: Financial assistant](#brex-financial-assistant)
  6 |     - [Cursor: Task-or-not classifier](#cursor-task-or-not-classifier)
  7 |     - [Grab: Data entity classification](#grab-data-entity-classification)
  8 |     - [Pinterest: Text-to-SQL](#pinterest-text-to-sql)
  9 |     - [Thoughtworks: Co-pilot for product ideation](#thoughtworks-co-pilot-for-product-ideation)
 10 |     - [Whatnot: Content moderation](#whatnot-content-moderation)
 11 | - [Prompt attack examples](#prompt-attack-examples)
 12 |     
 13 |     Coming soon ...
 14 | - [Defensive prompt examples](#defensive-prompt-examples)
 15 |     
 16 |     Coming soon ...
 17 | 
 18 | ## Real world use case prompts
 19 | ### [Brex: Financial assistant](https://github.com/brexhq/prompt-engineering)
 20 | 
 21 | ```
 22 | You are a financial assistant. I am using Brex, a platform for managing expenses. I am located in [LOCATION]. My current time is [TIME]. Today is [DATE]. My current inbox looks like:
 23 | 
 24 | [RETRIEVED INBOX]
 25 | 
 26 | Any responses to the user should be concise — no more than two sentences. They should include pleasant greetings, such as "good morning," as is appropriate for my time. The responses should be pleasant and fun.
 27 | ```
 28 | ### [Cursor: Task-or-not classifier](https://www.cursor.com/blog/prompt-design#priompt-v01-a-first-attempt-at-a-prompt-design-library)
 29 | 
 30 | ```
 31 | export default function DoableAsTaskPrompt(
 32 |   props: DoableAsTaskProps
 33 | ): PromptElement {
 34 |   return (
 35 |     <>
 36 |       <SystemMessage p={1000}>
 37 |         You are a task-or-not classifier. Specifically, you will be given a
 38 |         message generated by an engineering assistant. Your job is to determine
 39 |         whether or not it describes a task / set of instructions to perform
 40 |         changes in an editor. Give your answer with a single word "EDITOR TASK"
 41 |         or "NOT EDITOR TASK". Note that requests for information, though
 42 |         actionable, are not editor tasks. Furthermore, you should only count
 43 |         editor tasks that are specific, not general suggestions that require
 44 |         user discretion.
 45 |       </SystemMessage>
 46 | 
 47 |       <NegativeExample message="I am doing great today!"></NegativeExample>
 48 | 
 49 |       <NegativeExample
 50 |         message={
 51 |           "Sorry I could not find the snippet of code you are talking about. Can you give me the code you're talking about?"
 52 |         }
 53 |       ></NegativeExample>
 54 | 
 55 |       <first>
 56 |         <UserMessage p={500}>{props.lastAIMessage}</UserMessage>
 57 |         <UserMessage p={501}>
 58 |           {props.lastAIMessage.slice(0, props.lastAIMessage.length / 2)}
 59 |         </UserMessage>
 60 |       </first>
 61 | 
 62 |       <empty p={1100} tokens={10} />
 63 |     </>
 64 |   );
 65 | }
 66 | ```
 67 | ### Grab: Data entity classification
 68 | [LLM-powered data classification for data entities at scale](https://engineering.grab.com/llm-powered-data-classification) (Liu et al., 2024)
 69 | ```
 70 | You are a database column tag classifier, your job is to assign the most appropriate tag based on table name and column name. The database columns are from a company that provides ride-hailing, delivery, and financial services. Assign one tag per column. However not all columns can be tagged and these columns should be assigned <None>. You are precise, careful and do your best to make sure the tag assigned is the most appropriate.
 71 | The following is the list of tags to be assigned to a column. For each line, left hand side of the : is the tag and right hand side is the tag definition
 72 | …
 73 | <Personal.ID> : refers to government-provided identification numbers that can be used to uniquely identify a person and should be assigned to columns containing "NRIC", "Passport", "FIN", "License Plate", "Social Security" or similar. This tag should absolutely not be assigned to columns named "id", "merchant id", "passenger id", “driver id" or similar since these are not government-provided identification numbers. This tag should be very rarely assigned.
 74 | <None> : should be used when none of the above can be assigned to a column.
 75 | …
 76 | 
 77 | Output Format is a valid json string, for example:
 78 | [{
 79 | "column_name": "",
 80 | "assigned_tag": ""
 81 | }]
 82 | 
 83 | Example question
 84 | `These columns belong to the "deliveries" table
 85 | 
 86 | 1. merchant_id
 87 | 2. status
 88 | 3. delivery_time`
 89 | 
 90 | Example response
 91 | 
 92 | [{
 93 | "column_name": "merchant_id",
 94 | "assigned_tag": "<Personal.ID>"
 95 | },{
 96 | "column_name": "status",
 97 | "assigned_tag": "<None>"
 98 | },{
 99 | "column_name": "delivery_time",
100 | "assigned_tag": "<None>"
101 | }]
102 | ```
103 | ### Pinterest: Text-to-SQL
104 | [Text-to-SQL prompt template](https://github.com/pinterest/querybook/blob/master/querybook/server/lib/ai_assistant/prompts/text_to_sql_prompt.py) (2024)
105 | 
106 | ```
107 | You are a {dialect} expert.
108 | 
109 | Please help to generate a {dialect} query to answer the question. Your response should ONLY be based on the given context and follow the response guidelines and format instructions.
110 | 
111 | ===Tables
112 | {table_schemas}
113 | 
114 | ===Original Query
115 | {original_query}
116 | 
117 | ===Response Guidelines
118 | 1. If the provided context is sufficient, please generate a valid query without any explanations for the question. The query should start with a comment containing the question being asked.
119 | 2. If the provided context is insufficient, please explain why it can't be generated.
120 | 3. Please use the most relevant table(s).
121 | 5. Please format the query before responding.
122 | 6. Please always respond with a valid well-formed JSON object with the following format
123 | 
124 | ===Response Format
125 | {{
126 |     "query": "A generated SQL query when context is sufficient.",
127 |     "explanation": "An explanation of failing to generate the query."
128 | }}
129 | 
130 | ===Question
131 | {question}
132 | ```
133 | 
134 | ### Thoughtworks: Co-pilot for product ideation
135 | [Building Boba AI](https://www.martinfowler.com/articles/building-boba.html) (Farooq Ali, 2023)
136 | 
137 | ```
138 | You are a visionary futurist. Given a strategic prompt, you will create
139 | {num_scenarios} futuristic, hypothetical scenarios that happen
140 | {time_horizon} from now. Each scenario must be a {optimism} version of the
141 | future. Each scenario must be {realism}.
142 | 
143 | Strategic prompt: {strategic_prompt}
144 | =====
145 | You will respond with only a valid JSON array of scenario objects.
146 | Each scenario object will have the following schema:
147 | "title": <string>, //Must be a complete sentence written in the past tense
148 | "summary": <string>, //Scenario description
149 | "plausibility": <string>, //Plausibility of scenario
150 | "horizon": <string>
151 | =====
152 | You will respond in JSON format containing two keys, "questions" and "strategies", with the respective schemas below:
153 | "questions": [<list of question objects, with each containing the following keys:>]
154 | "question": <string>,
155 | "answer": <string>
156 | "strategies": [<list of strategy objects, with each containing the following keys:>]
157 | "title": <string>,
158 | "summary": <string>,
159 | "problem_diagnosis": <string>,
160 | "winning_aspiration": <string>,
161 | "where_to_play": <string>,
162 | "how_to_win": <string>,
163 | "assumptions": <string>
164 | ```
165 | 
166 | ### Whatnot: Content moderation
167 | [How Whatnot Utilizes Generative AI to Enhance Trust and Safety](https://medium.com/whatnot-engineering/how-whatnot-utilizes-generative-ai-to-enhance-trust-and-safety-c7968eb6315e) (2023)
168 | 
169 | ```
170 | Given are following delimited by a new line
171 | 1. User id for the user under investigation
172 | 2. A message sent by a user through direct messaging
173 | 3. Interaction between users
174 | The interaction data is delimited by triple backticks, has timestamp, sender id and message separated by a '>>'.
175 | The sender may be trying to scam receivers in many ways. Following patterns are definitive and are known to occur frequently on the platform.
176 | 
177 | """ Known scam patterns """
178 | 
179 | Assess if the provided conversation indicates a scam attempt.
180 | Provide likelihoods (0-1) of scam, assessment notes in json format which can be consumed by a service with keys with no text output:
181 | scam_likelihood and explanation (reasoning for the likelihood)?
182 | 
183 | ``` text ````
184 | 
185 | Expected output
186 | {
187 | "scam_likelihood": [0-1],
188 | "explanation": reasoning for the likelihood for scam
189 | }
190 | ```
191 | 
192 | ## Prompt attack examples
193 | 
194 | ## Defensive prompt examples
195 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # AI Engineering book and other resources
  2 | > _This repo will be updated with more resources in the next few weeks._
  3 | 
  4 | - [About the book AI Engineering](#about-the-book)
  5 |     - [Table of contents](ToC.md)
  6 |     - [Chapter summaries](chapter-summaries.md)
  7 |     - [Study notes](study-notes.md)
  8 | - [AI engineering resources](resources.md)
  9 | - [Prompt examples](prompt-examples.md)
 10 | - [Case studies](case-studies.md)
 11 | - [Misalignment AI](misalignment.md)
 12 | - [Appendix](appendix.md)
 13 | - Fun tools:
 14 |     
 15 |     - [ChatGPT and Claude conversation heatmap generator](scripts/ai-heatmap.ipynb)
 16 | - And more ...
 17 | 
 18 | ## About the book
 19 | The availability of foundation models has transformed AI from a specialized discipline into a powerful development tool everyone can use. This book covers the end-to-end process of adapting foundation models to solve real-world problems, encompassing tried-and-true techniques from other engineering fields and techniques emerging with foundation models.
 20 | 
 21 | [<img src="assets/aie-cover.png" width="250">](https://amzn.to/49j1cGS)[<img src="assets/aie-cover-back.png" width="250">](https://amzn.to/49j1cGS)
 22 | 
 23 | The book is available on:
 24 | - [Amazon](https://amzn.to/49j1cGS)
 25 | - [O'Reilly](https://oreillymedia.pxf.io/c/5719111/2146021/15173)
 26 | - [Kindle](https://amzn.to/3Vq2ryu)
 27 | 
 28 | and most places where technical books are sold.
 29 | 
 30 | _This is NOT a tutorial book, so it doesn't have a lot of code snippets._
 31 | 
 32 | ## What this book is about
 33 | This book provides a framework for adapting foundation models, which include both large language models (LLMs) and large multimodal models (LMMs), to specific applications. It not only outlines various solutions for building an AI application but also raises questions you can ask to evaluate the best solution for your needs. Here are just some of the many questions that this book can help you answer:
 34 | 
 35 | 1. Should I build this AI application?
 36 | 1. How do I evaluate my application? Can I use AI to evaluate AI outputs?
 37 | 1. What causes hallucinations? How do I detect and mitigate hallucinations?
 38 | 1. What are the best practices for prompt engineering?
 39 | 1. Why does RAG work? What are the strategies for doing RAG?
 40 | 1. What’s an agent? How do I build and evaluate an agent?
 41 | 1. When to finetune a model? When not to finetune a model?
 42 | 1. How much data do I need? How do I validate the quality of my data?
 43 | 1. How do I make my model faster, cheaper, and secure?
 44 | 1. How do I create a feedback loop to improve my application continually?
 45 | 
 46 | The book will also help you navigate the overwhelming AI landscape: types of models, evaluation benchmarks, and a seemingly infinite number of use cases and application patterns.
 47 | 
 48 | The content in this book is illustrated using actual case studies, many of which I’ve worked on, backed by ample references and extensively reviewed by experts from a wide range of backgrounds. Even though the book took two years to write, it draws from my experience working with language models and ML systems from the last decade.
 49 | 
 50 | Like my previous book, _[Designing Machine Learning Systems (DMLS)](https://amzn.to/4fXVZH2)_, this book focuses on the fundamentals of AI engineering instead of any specific tool or API. Tools become outdated quickly, but fundamentals should last longer.
 51 | 
 52 | ### Reading _AI Engineering_ (AIE) with _Designing Machine Learning Systems_ (DMLS)
 53 | AIE can be a companion to DMLS. DMLS focuses on building applications on top of traditional ML models, which involves more tabular data annotations, feature engineering, and model training. AIE focuses on building applications on top of foundation models, which involves more prompt engineering, context construction, and parameter-efficient finetuning. Both books are self-contained and modular, so you can read either book independently.
 54 | 
 55 | Since foundation models are ML models, some concepts are relevant to working with both. If a topic is relevant to AIE but has been discussed extensively in DMLS, it’ll still be covered in this book, but to a lesser extent, with pointers to relevant resources. 
 56 | 
 57 | Note that many topics are covered in DMLS but not in AIE, and vice versa. The first chapter of this book also covers the differences between traditional ML engineering and AI engineering.
 58 | 
 59 | A real-world system often involves both traditional ML models and foundation models, so knowledge about working with both is often necessary.
 60 | 
 61 | ## Who this book is for
 62 | 
 63 | This book is for anyone who wants to leverage foundation models to solve real-world problems. This is a technical book, so the language of this book is geared towards technical roles, including AI engineers, ML engineers, data scientists, engineering managers, and technical product managers. This book is for you if you can relate to one of the following scenarios:
 64 | * You’re building or optimizing an AI application, whether you’re starting from scratch or looking to move beyond the demo phase into a production-ready stage. You may also be facing issues like hallucinations, security, latency, or costs, and need targeted solutions.
 65 | * You want to streamline your team’s AI development process, making it more systematic, faster, and reliable.
 66 | * You want to understand how your organization can leverage foundation models to improve the business’s bottom line and how to build a team to do so.
 67 | 
 68 | You can also benefit from the book if you belong to one of the following groups:
 69 | * Tool developers who want to identify underserved areas in AI engineering to position your products in the ecosystem.
 70 | * Researchers who want to understand better AI use cases.
 71 | * Job candidates seeking clarity on the skills needed to pursue a career as an AI engineer.
 72 | * Anyone wanting to better understand AI's capabilities and limitations, and how it might affect different roles.
 73 | 
 74 | I love getting to the bottom of things, so some sections dive a bit deeper into the technical side. While many early readers like the detail, I know it might not be for everyone. I’ll give you a heads-up before things get too technical. Feel free to skip ahead if it feels a little too in the weeds!
 75 | 
 76 | 
 77 | ## Reviews
 78 | - _"This book offers a comprehensive, well-structured guide to the essential aspects of building generative AI systems. A must-read for any professional looking to scale AI across the enterprise."_ - Vittorio Cretella, former global CIO at P&G and Mars
 79 | 
 80 | - _"Chip Huyen gets generative AI. She is a remarkable teacher and writer whose work has been instrumental in helping teams bring AI into production. Drawing on her deep expertise, AI Engineering is a comprehensive and holistic guide to building generative AI applications in production."_ - Luke Metz, co-creator of ChatGPT, ex-research manager @ OpenAI
 81 | 
 82 | - _"Every AI engineer building real-world applications should read this book. It’s a vital guide to end-to-end AI system design, from model development and evaluation to large-scale deployment and operation."_ - Andrei Lopatenko, Director Search and AI, Neuron7
 83 | 
 84 | - _"This book serves as an essential guide for building AI products that can scale. Unlike other books that focus on tools or current trends that are constantly changing, Chip delivers timeless foundational knowledge. Whether you're a product manager or an engineer, this book effectively bridges the collaboration gap between cross-functional teams, making it a must-read for anyone involved in AI development."_ - Aileen Bui, AI Product Operations Manager, Google
 85 | 
 86 | - _"This is the definitive segue into AI Engineering from one of the greats of ML Engineering! Chip has seen through successful projects and careers at every stage of a company and for the first time ever condensed her expertise for new AI Engineers entering the field."_ - swyx, Curator, AI.Engineer
 87 | 
 88 | - _"AI Engineering is a practical guide that provides the most up-to-date information on AI development, making it approachable for novice and expert leaders alike. This book is an essential resource for anyone looking to build robust and scalable AI systems."_ - Vicki Reyzelman, Chief AI Solutions Architect, Mave Sparks
 89 | 
 90 | - _"AI Engineering is a comprehensive guide that serves as an essential reference for both understanding and implementing AI systems in practice."_ - Han Lee, Director - Data Science, Moody's.
 91 | 
 92 | - _"AI Engineering is an essential guide for anyone building software with Generative AI! It demystifies the technology, highlights the importance of evaluation, and shares what should be done to achieve quality before starting with costly fine-tuning."_ - Rafal Kawala, Senior AI Engineering Director, 16 years of experience working in a Fortune 500 company
 93 | 
 94 | See what people are talking about the book on Twitter [@aisysbooks](https://twitter.com/aisysbooks/likes)!
 95 | 
 96 | ## Acknowledgments
 97 | This book would've taken a lot longer to write and missed many important topics if it wasn't for so many wonderful people who helped me through the process.
 98 | 
 99 | Because the timeline for the project was tight—two years for a 150,000-word book that covers so much ground—I'm grateful to the technical reviewers who put aside their precious time to review this book so quickly.
100 | 
101 | [Luke Metz](https://x.com/luke_metz) is an amazing soundboard who checked my assumptions and prevented me from going down the wrong path. [Han-chung Lee](https://www.linkedin.com/in/hanchunglee/), always up to date with the latest AI news and community development, pointed me toward resources that I missed. Luke and Han were the first to review my drafts before I sent them to the next round of technical reviewers, and I'm forever indebted to them for tolerating my follies and mistakes.
102 | 
103 | Having led AI innovation at Fortune 500 companies, [Vittorio Cretella](https://www.linkedin.com/in/vittorio-cretella/) and [Andrei Lopatenko](https://www.linkedin.com/in/lopatenko/) provided invaluable feedback that combined deep technical expertise with executive insights. [Vicki Reyzelman](https://www.linkedin.com/in/vickireyzelman/) helped me ground my content and keep it relevant for readers with a software engineering background.
104 | 
105 | [Eugene Yan](https://eugeneyan.com/), a dear friend and amazing applied scientist, provided me with technical and emotional support. Shawn Wang ([swyx](https://x.com/swyx)), provided an important vibe check that helped me feel more confident about the book. [Sanyam Bhutani](https://x.com/bhutanisanyam1) is one of the best learners and most humble souls I know, who not only gave thoughtful written feedback but also recorded videos to explain his feedback.
106 | 
107 | Kyle Krannen is a star deep learning lead who interviewed his colleagues and shared with me an amazing writeup about their finetuning process, which guided the finetuning chapter. [Mark Saroufim](https://x.com/marksaroufim), an inquisitive mind who always has his pulse on the most interesting problems, introduced me to great resources on efficiency. Both Kyle and Mark's feedback was critical in writing Chapters 7 and 9.
108 | 
109 | [Kittipat "Bot" Kampa](https://www.linkedin.com/in/kittipat-bot-kampa-1b1965/), on top of answering my many questions, shared with me a detailed visualization of how he thinks about AI platform. I appreciate [Denys Linkov](https://www.linkedin.com/in/denyslinkov/)'s systematic approach to evaluation and platform development. [Chetan Tekur](https://www.linkedin.com/in/chetantekur/) gave great examples that helped me structure AI application patterns. I'd also like to thank [Alex (Shengzhi Li) Li](https://www.linkedin.com/in/findalexli/) and [Hien Luu](https://www.linkedin.com/in/hienluu/) for their thoughtful feedback on my draft on AI architecture.
110 | 
111 | [Aileen Bui](https://www.linkedin.com/in/aileenbui/) is a treasure who shared unique feedback and examples from a product manager's perspective. Thanks [Todor Markov](https://www.linkedin.com/in/todor-markov-4aa38a67/) for the actionable advice on the RAG and Agents chapter. Thanks [Tal Kachman](https://www.linkedin.com/in/tal-kachman/) for jumping in at the last minute to push the finetuning chapter over the finish line. 
112 | 
113 | There are so many wonderful people whose company and conversations gave me ideas that guide the content of this book. I tried my best to include the names of everyone who has helped me here, but due to the inherent faultiness of human memory, I undoubtedly neglected to mention many. If I forgot to include your name, please know that it wasn't because I don't appreciate your contribution, and please kindly remind me so that I can rectify as soon as possible!
114 | 
115 | Andrew Francis, Anish Nag, [Anthony Galczak](https://www.linkedin.com/in/wgalczak/), [Anton Bacaj](https://x.com/abacaj), Balázs Galambosi, Charles Frye, Charles Packer, Chris Brousseau, Eric Hartford, Goku Mohandas, Hamel Husain, Harpreet Sahota, Hassan El Mghari, Huu Nguyen, Jeremy Howard, Jesse Silver, John Cook, [Juan Pablo Bottaro](https://www.linkedin.com/in/juan-pablo-bottaro/), Kyle Gallatin, Lance Martin, Lucio Dery, Matt Ross, Maxime Labonne, Miles Brundage, Nathan Lambert, Omar Khattab, [Phong Nguyen](https://www.linkedin.com/in/xphongvn/), Purnendu Mukherjee, Sam Reiswig, Sebastian Raschka, Shahul ES, Sharif Shameem, Soumith Chintala, Teknium, Tim Dettmers, Undi5, Val Andrei Fajardo, Vern Liang, Victor Sanh, Wing Lian, Xiquan Cui, Ying Sheng, and Kristofer.
116 | 
117 | I'd like to thank all early readers who have also reached out with feedback. Douglas Bailley is a super reader who shared so much thoughtful feedback. Nutan Sahoo for suggesting an elegant way to explain perplexity.
118 | 
119 | I learned so much from the online discussions with so many. Thanks to everyone who's ever answered my questions, commented on my posts, or sent me an email with your thoughts.
120 | 
121 | Of course, the book wouldn't have been possible without the team at O'Reilly, especially my development editors (Melissa Potter, Corbin Collins, Jill Leonard) and my production editors (Kristen Brown and Elizabeth Kelly). Liz Wheeler is the most discerning editor I've ever worked with. Nicole Butterfield is a force who oversaw this book from an idea to a final product.
122 | 
123 | This book, after all, is an accumulation of invaluable lessons I learned throughout my career. I owe these lessons to my extremely competent and patient coworkers and former coworkers. Every person I've worked with has taught me something new about bringing ML into the world.
124 | 
125 | ---
126 | 
127 | <br>
128 | <br>
129 | 
130 | Chip Huyen, *AI Engineering*. O'Reilly Media, 2025.
131 | 
132 |     @book{aiebook2025,  
133 |         address = {USA},  
134 |         author = {Chip Huyen},  
135 |         isbn = {978-1801819312},   
136 |         publisher = {O'Reilly Media},  
137 |         title = {{AI Engineering}},  
138 |         year = {2025}  
139 |     }
140 | 


--------------------------------------------------------------------------------
/ToC.md:
--------------------------------------------------------------------------------
  1 | # Table of Contents
  2 | 
  3 | _[PDF version](ToC.pdf)_
  4 | |                                                                                      |  |
  5 | |---------------------------------------------------------------------------------------------|-----:|
  6 | | **Preface**                                                                                 |    ix |
  7 | | **1. Introduction to Building AI Applications with Foundation Models**                     |     1 |
  8 | | The Rise of AI Engineering                                                                  |     2 |
  9 | | - From Language Models to Large Language Models                                               |     2 |
 10 | | - From Large Language Models to Foundation Models                                             |     8 |
 11 | | - From Foundation Models to AI Engineering                                                   |    12 |
 12 | | Foundation Model Use Cases                                                                  |    16 |
 13 | | - Coding                                                                                    |    20 |
 14 | | - Image and Video Production                                                                |    22 |
 15 | | - Writing                                                                                   |    22 |
 16 | | - Education                                                                                 |    24 |
 17 | | - Conversational Bots                                                                       |    26 |
 18 | | - Information Aggregation                                                                   |    26 |
 19 | | - Data Organization                                                                         |    27 |
 20 | | - Workflow Automation                                                                       |    28 |
 21 | | Planning AI Applications                                                                    |    28 |
 22 | | - Use Case Evaluation                                                                       |    29 |
 23 | | - Setting Expectations                                                                      |    32 |
 24 | | - Milestone Planning                                                                        |    33 |
 25 | | - Maintenance                                                                               |    34 |
 26 | | The AI Engineering Stack                                                                    |    35 |
 27 | | - Three Layers of the AI Stack                                                              |    37 |
 28 | | - AI Engineering Versus ML Engineering                                                     |    39 |
 29 | | - AI Engineering Versus Full-Stack Engineering                                             |    46 |
 30 | | Summary                                                                                     |    47 |
 31 | | **2. Understanding Foundation Models**                                                     |    49 |
 32 | | Training Data                                                                               |    50 |
 33 | | - Multilingual Models                                                                       |    51 |
 34 | | - Domain-Specific Models                                                                    |    56 |
 35 | | Modeling                                                                                    |    58 |
 36 | | - Model Architecture                                                                        |    58 |
 37 | | - Model Size                                                                                |    67 |
 38 | | Post-Training                                                                               |    78 |
 39 | | - Supervised Finetuning                                                                     |    80 |
 40 | | - Preference Finetuning                                                                     |    83 |
 41 | | Sampling                                                                                    |    88 |
 42 | | - Sampling Fundamentals                                                                     |    88 |
 43 | | - Sampling Strategies                                                                       |    90 |
 44 | | - Test Time Compute                                                                         |    96 |
 45 | | - Structured Outputs                                                                        |    99 |
 46 | | - The Probabilistic Nature of AI                                                           |   105 |
 47 | | Summary                                                                                     |   111 |
 48 | | **3. Evaluation Methodology**                                                              |   113 |
 49 | | Challenges of Evaluating Foundation Models                                                 |   114 |
 50 | | Understanding Language Modeling Metrics                                                    |   118 |
 51 | | - Entropy                                                                                   |   119 |
 52 | | - Cross Entropy                                                                             |   120 |
 53 | | - Bits-per-Character and Bits-per-Byte                                                     |   121 |
 54 | | - Perplexity                                                                                |   121 |
 55 | | - Perplexity Interpretation and Use Cases                                                  |   122 |
 56 | | Exact Evaluation                                                                           |   125 |
 57 | | - Functional Correctness                                                                    |   126 |
 58 | | - Similarity Measurements Against Reference Data                                           |   127 |
 59 | | - Introduction to Embedding                                                                |   134 |
 60 | | AI as a Judge                                                                              |   136 |
 61 | | - Why AI as a Judge?                                                                        |   137 |
 62 | | - How to Use AI as a Judge                                                                  |   138 |
 63 | | - Limitations of AI as a Judge                                                              |   141 |
 64 | | - What Models Can Act as Judges?                                                           |   145 |
 65 | | Ranking Models with Comparative Evaluation                                                 |   148 |
 66 | | - Challenges of Comparative Evaluation                                                     |   152 |
 67 | | - The Future of Comparative Evaluation                                                     |   155 |
 68 | | Summary                                                                                     |   156 |
 69 | | **4. Evaluate AI Systems**                                                                 |   159 |
 70 | | Evaluation Criteria                                                                         |   160 |
 71 | | - Domain-Specific Capability                                                                |   161 |
 72 | | - Generation Capability                                                                     |   163 |
 73 | | - Instruction-Following Capability                                                         |   172 |
 74 | | - Cost and Latency                                                                          |   177 |
 75 | | Model Selection                                                                            |   179 |
 76 | | - Model Selection Workflow                                                                  |   179 |
 77 | | - Model Build Versus Buy                                                                    |   181 |
 78 | | - Navigate Public Benchmarks                                                               |   191 |
 79 | | Design Your Evaluation Pipeline                                                            |   200 |
 80 | | - Step 1. Evaluate All Components in a System                                              |   200 |
 81 | | - Step 2. Create an Evaluation Guideline                                                   |   202 |
 82 | | - Step 3. Define Evaluation Methods and Data                                               |   204 |
 83 | | Summary                                                                                     |   208 |
 84 | | **5. Prompt Engineering**                                                                  |   211 |
 85 | | Introduction to Prompting                                                                  |   212 |
 86 | | - In-Context Learning: Zero-Shot and Few-Shot                                              |   213 |
 87 | | - System Prompt and User Prompt                                                            |   215 |
 88 | | - Context Length and Context Efficiency                                                    |   218 |
 89 | | Prompt Engineering Best Practices                                                          |   220 |
 90 | | - Write Clear and Explicit Instructions                                                    |   220 |
 91 | | - Provide Sufficient Context                                                               |   223 |
 92 | | - Break Complex Tasks into Simpler Subtasks                                                |   224 |
 93 | | - Give the Model Time to Think                                                             |   227 |
 94 | | - Iterate on Your Prompts                                                                  |   229 |
 95 | | - Evaluate Prompt Engineering Tools                                                        |   230 |
 96 | | - Organize and Version Prompts                                                             |   233 |
 97 | | Defensive Prompt Engineering                                                               |   235 |
 98 | | - Proprietary Prompts and Reverse Prompt Engineering                                       |   236 |
 99 | | - Jailbreaking and Prompt Injection                                                        |   238 |
100 | | - Information Extraction                                                                    |   243 |
101 | | - Defenses Against Prompt Attacks                                                          |   248 |
102 | | Summary                                                                                     |   251 |
103 | | **6. RAG and Agents**                                                                      |   253 |
104 | | RAG                                                                                         |   253 |
105 | | - RAG Architecture                                                                         |   256 |
106 | | - Retrieval Algorithms                                                                     |   257 |
107 | | - Retrieval Optimization                                                                   |   268 |
108 | | - RAG Beyond Texts                                                                         |   273 |
109 | | Agents                                                                                     |   275 |
110 | | - Agent Overview                                                                           |   276 |
111 | | - Tools                                                                                    |   278 |
112 | | - Planning                                                                                 |   281 |
113 | | - Agent Failure Modes and Evaluation                                                       |   298 |
114 | | Memory                                                                                     |   300 |
115 | | Summary                                                                                     |   305 |
116 | | **7. Finetuning**                                                                          |   307 |
117 | | Finetuning Overview                                                                        |   308 |
118 | | When to Finetune                                                                         |   311 |
119 | | - Reasons to Finetune                                                                      |   311 |
120 | | - Reasons Not to Finetune                                                                  |   312 |
121 | | - Finetuning and RAG                                                                       |   316 |
122 | | Memory Bottlenecks                                                                         |   319 |
123 | | - Backpropagation and Trainable Parameters                                                 |   320 |
124 | | - Memory Math                                                                              |   322 |
125 | | - Numerical Representations                                                                |   325 |
126 | | - Quantization                                                                             |   328 |
127 | | Finetuning Techniques                                                                      |   332 |
128 | | - Parameter-Efficient Finetuning                                                           |   333 |
129 | | - Model Merging and Multi-Task Finetuning                                                  |   347 |
130 | | - Finetuning Tactics                                                                       |   357 |
131 | | Summary                                                                                     |   361 |
132 | | **8. Dataset Engineering**                                                                 |   363 |
133 | | Data Curation                                                                              |   365 |
134 | | - Data Quality                                                                             |   368 |
135 | | - Data Coverage                                                                            |   370 |
136 | | - Data Quantity                                                                            |   372 |
137 | | - Data Acquisition and Annotation                                                          |   377 |
138 | | Data Augmentation and Synthesis                                                            |   380 |
139 | | - Why Data Synthesis                                                                       |   381 |
140 | | - Traditional Data Synthesis Techniques                                                   |   383 |
141 | | - AI-Powered Data Synthesis                                                                |   386 |
142 | | - Model Distillation                                                                       |   395 |
143 | | Data Processing                                                                            |   396 |
144 | | - Inspect Data                                                                             |   397 |
145 | | - Deduplicate Data                                                                         |   399 |
146 | | - Clean and Filter Data                                                                    |   401 |
147 | | - Format Data                                                                                |   401 |
148 | | Summary                                                                                     |   403 |
149 | | **9. Inference Optimization**                                                              |   405 |
150 | | Understanding Inference Optimization                                                       |   406 |
151 | | - Inference Overview                                                                       |   406 |
152 | | - Inference Performance Metrics                                                            |   412 |
153 | | - AI Accelerators                                                                          |   419 |
154 | | Inference Optimization                                                                     |   426 |
155 | | - Model Optimization                                                                       |   426 |
156 | | - Inference Service Optimization                                                           |   440 |
157 | | Summary                                                                                     |   447 |
158 | | **10. AI Engineering Architecture and User Feedback**                                      |   449 |
159 | | AI Engineering Architecture                                                                |   449 |
160 | | - Step 1. Enhance Context                                                                  |   450 |
161 | | - Step 2. Put in Guardrails                                                                |   451 |
162 | | - Step 3. Add Model Router and Gateway                                                    |   456 |
163 | | - Step 4. Reduce Latency with Caches                                                      |   460 |
164 | | - Step 5. Add Agent Patterns                                                               |   463 |
165 | | - Monitoring and Observability                                                             |   465 |
166 | | - AI Pipeline Orchestration                                                                |   472 |
167 | | User Feedback                                                                              |   474 |
168 | | - Extracting Conversational Feedback                                                      |   475 |
169 | | - Feedback Design                                                                          |   480 |
170 | | - Feedback Limitations                                                                     |   490 |
171 | | Summary                                                                                     |   492 |
172 | | **Epilogue**                                                                               |   495 |
173 | | **Index**                                                                                  |   497 |
174 | 


--------------------------------------------------------------------------------
/chapter-summaries.md:
--------------------------------------------------------------------------------
  1 | # Chapter Summaries
  2 | 
  3 | These are the summaries of each chapter taken from the book. Some of the summaries might not make sense to readers without having first read the originating chapters, but I hope that they will give you a sense of what the book is about.
  4 | 
  5 | * [Chapter 1. Introduction to Building AI Applications with Foundation Models](#chapter-1-introduction-to-building-ai-applications-with-foundation-models)
  6 | * [Chapter 2. Understanding Foundation Models](#chapter-2-understanding-foundation-models)
  7 | * [Chapter 3. Evaluation Methodology](#chapter-3-evaluation-methodology)
  8 | * [Chapter 4. Evaluate AI Systems](#chapter-4-evaluate-ai-systems)
  9 | * [Chapter 5. Prompt Engineering](#chapter-5-prompt-engineering)
 10 | * [Chapter 6. RAG and Agents](#chapter-6-rag-and-agents)
 11 | * [Chapter 7. Finetuning](#chapter-7-finetuning)
 12 | * [Chapter 8. Dataset Engineering](#chapter-8-dataset-engineering)
 13 | * [Chapter 9. Inference Optimization](#chapter-9-inference-optimization)
 14 | * [Chapter 10. AI Engineering Architecture and User Feedback](#chapter-10-ai-engineering-architecture-and-user-feedback)
 15 | 
 16 | ## Chapter 1. Introduction to Building AI Applications with Foundation Models
 17 | Table 1-3. Common generative AI use cases across consumer and enterprise applications.
 18 | 
 19 | 
 20 | | Category               | Examples of Consumer Use Cases          | Examples of Enterprise Use Cases           |
 21 | |------------------------|-----------------------------------------|-------------------------------------------|
 22 | | Coding                | Coding                                  | Coding                                    |
 23 | | Image and Video Production | Photo and video editing<br>Design      | Presentation<br>Ad generation             |
 24 | | Writing               | Email<br>Social media and blog posts    | Copywriting<br>SEO<br>Reports, memos, design docs |
 25 | | Education             | Tutoring<br>Essay grading               | Employee onboarding<br>Employee upskill training |
 26 | | Conversational Bots   | General chatbot<br>AI companion         | Customer support<br>Product copilots      |
 27 | | Information Aggregation | Summarization<br>Talk-to-your-docs     | Summarization<br>Market research          |
 28 | | Data Organization     | Image search<br>Memex                   | Knowledge management<br>Document processing |
 29 | | Workflow Automation   | Travel planning<br>Event planning       | Data extraction, entry, and annotation<br>Lead generation |
 30 | <br>
 31 | 
 32 | I meant this chapter to serve two purposes. One is to explain the emergence of AI engineering as a discipline, thanks to the availability of foundation models. Two is to give an overview of the process needed to build applications on top of these models. I hope that this chapter achieved this goal. As an overview chapter, it only lightly touched on many concepts. These concepts will be explored further in the rest of the book.
 33 | 
 34 | The chapter discussed the rapid evolution of AI in recent years. It walked through some of the most notable transformations, starting with the transition from language models to large language models, thanks to a training approach called self-supervision. It then traced how language models incorporated other data modalities to become foundation models, and how foundation models gave rise to AI engineering.
 35 | 
 36 | The rapid growth of AI engineering is motivated by the many applications that the emerging capabilities of foundation models enable. This chapter discussed some of the most successful application patterns, both for consumers and enterprises. Despite the incredible number of AI applications already in production, we're still in the early stages of AI engineering, with countless more innovations yet to be built.
 37 | 
 38 | Before building an application, an important yet often overlooked question is whether you should build it. This chapter discussed this question together with major considerations for building AI applications.
 39 | 
 40 | While AI engineering is a new term, it evolved out of ML engineering, which is the overarching discipline involved with building applications with all ML models. Many principles from ML engineering are still applicable to AI engineering. However, AI engineering also brings with it new challenges and solutions. The last section of the chapter discusses the AI engineering stack, including how it has changed from ML engineering.
 41 | 
 42 | One aspect of AI engineering that is especially challenging to capture in writing is the incredible amount of collective energy, creativity, and engineering talent that the community brings. This collective enthusiasm can often be overwhelming, as it's impossible to keep up-to-date with new techniques, discoveries, and engineering feats that seem to happen constantly.
 43 | 
 44 | One consolidation is that since AI is great at information aggregation, it can help us aggregate and summarize all these new updates. But tools can only help to a certain extent. The more overwhelming a space is, the more important it is to have a framework to help us navigate it. This book aims to provide such a framework.
 45 | 
 46 | The rest of the book will explore this framework step-by-step, starting with the fundamental building block of AI engineering: the foundation models that make so many amazing applications possible.
 47 | 
 48 | ## Chapter 2. Understanding Foundation Models
 49 | <center><img src="assets/rlhf.png" width="800"><br>
 50 | <i>The overall training workflow with pre-training, SFT, and RLHF. Image originally from my <a href="https://huyenchip.com/2023/05/02/rlhf.html">RLHF blog post</a> (May 2023)</i>
 51 | </center>
 52 | <br>
 53 | 
 54 | This chapter discussed the core design decisions when building a foundation model. Since most people will be using ready-made foundation models instead of training one from scratch, I skipped the nitty-gritty training details in favor of modeling factors that help you determine what models to use and how to use them.
 55 | 
 56 | A crucial factor affecting a model's performance is its training data. Large models require a large amount of training data, which can be expensive and time-consuming to acquire. Model providers, therefore, often leverage whatever data is available. This leads to models that can perform well on the many tasks present in the training data, which may not include the specific task you want. This chapter went over why it's often necessary to curate training data to develop models targeting specific languages, especially low-resource languages, and specific domains.
 57 | 
 58 | After sourcing the data, model development can begin. While model training often dominates the headlines, an important step prior to that is architecting the model. The chapter looked into modeling choices, such as model architecture and model size. The dominating architecture for language-based foundation models is transformer. This chapter explored the problems that the transformer architecture was designed to address, as well as its limitations.
 59 | 
 60 | The scale of a model can be measured by three key numbers: the number of parameters, the number of training tokens, and the number of FLOPs needed for training. Two aspects that influence the amount of compute needed to train a model are the model size and the data size. The scaling law helps determine the optimal number of parameters and number of tokens given a compute budget. This chapter also looked at the scaling bottlenecks. Up until now, scaling up a model generally makes it better. But how long will this continue to be true?
 61 | 
 62 | Due to the low quality of training data and self-supervision during pre-training, the resulting model might produce outputs that don't align with what users want. This is addressed by post-training, which consists of two steps: supervised finetuning and preference finetuning. Human preference is diverse and impossible to capture in a single mathematical formula, so existing solutions are far from foolproof.
 63 | 
 64 | This chapter also covered one of my favorite topics: sampling, the process by which a model generates output tokens. Sampling makes AI models probabilistic. This probabilistic nature is what makes models like ChatGPT and Gemini great for creative tasks and fun to talk to. However, this probabilistic nature also causes inconsistency and hallucinations.
 65 | 
 66 | Working with AI models requires building your workflows around their probabilistic nature. The rest of this book will explore how to make AI engineering, if not deterministic, at least systematic. The first step towards systematic AI engineering is to establish a solid evaluation pipeline to help us detect failures and unexpected changes. Evaluation for foundation models is so crucial that I dedicated two chapters to it, starting with the next chapter.
 67 | 
 68 | ## Chapter 3. Evaluation Methodology
 69 | <center><img src="assets/ai-judge.png" width="600"><br>
 70 | <i>Figure 3-8. An example of an AI judge that evaluates the quality of an answer given a question.</i>
 71 | </center>
 72 | <br>
 73 | 
 74 | The stronger AI models become, the higher the potential for catastrophic failures, which makes evaluation even more important. At the same time, evaluating open-ended, powerful models is challenging. These challenges make many teams turn towards human evaluation. Having humans in the loop for sanity checks is always helpful, and in many cases, human evaluation is essential. However, this chapter focused on different approaches to automatic evaluation.
 75 | 
 76 | This chapter starts with a discussion on why foundation models are harder to evaluate than traditional ML models. While many new evaluation techniques are being developed, investments in evaluation still lag behind investments in model and application development.
 77 | 
 78 | Since many foundation models have a language model component, we zoomed into language modeling metrics, including perplexity and cross entropy. Many people I've talked to find these metrics confusing, so I included a section on how to interpret these metrics and leverage them in evaluation and data processing.
 79 | 
 80 | This chapter then shifted the focus to the different approaches to evaluate open-ended responses, including functional correctness, similarity scores, and AI-as-a-judge. The first two evaluation approaches are exact, while AI-as-a-judge evaluation is subjective.
 81 | 
 82 | Unlike exact evaluation, subjective metrics are highly dependent on the judge. Their scores need to be interpreted in the context of what judges are being used. Scores aimed to measure the same quality by different AI judges might not be comparable. AI judges, like all AI applications, should be iterated upon, meaning their judgments change. This makes them unreliable as benchmarks to track an application's changes over time. While promising, AI judges should be supplemented with exact evaluation, human evaluation, or both.
 83 | 
 84 | When evaluating models, you can evaluate each model independently, and then rank them by their scores. Alternatively, you can rank them using comparative signals: which of the two models is better? Comparative evaluation is common in sports, especially chess, and is gaining traction in AI evaluation. Both comparative evaluation and the post-training alignment process need preference signals, which are expensive to collect. This motivated the development of preference models: specialized AI judges that predict which response users prefer.
 85 | 
 86 | While language modeling metrics and hand-designed similarity measurements have existed for some time, AI-as-a-judge and comparative evaluation have only gained adoption with the emergence of foundation models. Many teams are figuring out how to incorporate them into their evaluation pipelines. Figuring out how to build a reliable evaluation pipeline to evaluate open-ended applications is the topic of the next chapter.
 87 | 
 88 | ## Chapter 4. Evaluate AI Systems
 89 | <center><img src="assets/evaluation-process.png" width="600"><br>
 90 | <i>Figure 4-5. An overview of the evaluation workflow to evaluate models for your application.</i>
 91 | </center>
 92 | <br>
 93 | 
 94 | This is one of the hardest, but I believe one of the most important, topics that I've written about with regard to AI. Not having a reliable evaluation pipeline is one of the biggest blockers to AI adoption. While evaluation takes time, a reliable evaluation pipeline will enable you to reduce risks, discover opportunities to improve performance, and benchmark progresses, which will all save you time and headaches down the line.
 95 | 
 96 | Given an increasing number of readily available foundation models, for most application developers, the challenge is no longer in developing models but in selecting the right models for your application. This chapter discussed a list of criteria that are often used to evaluate models for applications, and how they are evaluated. It discussed how to evaluate both domain-specific capabilities and generation capabilities, including factual consistency and safety. Many criteria to evaluate foundation models evolved from traditional NLP, including fluency, coherence, and faithfulness.
 97 | 
 98 | To help answer the question of whether to host a model or to use a model API, this chapter outlined the pros and cons of each approach along seven axes, including data privacy, data lineage, performance, functionality, control, and cost. This decision, like all the build versus buy decisions, is unique to every team, depending on not only what the team needs but also what the team wants.
 99 | 
100 | This chapter also explored the thousands of available public benchmarks. Public benchmarks can help you weed out bad models, but won't help you find the best models for your applications. Public benchmarks are also likely contaminated, as their data is included in the training data of many models. There are public leaderboards that aggregate multiple benchmarks to rank models, but how benchmarks are selected and aggregated is not a clear process. The lessons learned from public leaderboards are helpful for model selection, as model selection is akin to creating a private leaderboard to rank models based on your needs.
101 | 
102 | This chapter ends with how to use all the evaluation techniques and criteria discussed in the last chapter and how to create an evaluation pipeline for your application. No perfect evaluation method exists. It's impossible to capture the ability of a high-dimensional system using one- or few-dimensional scores. There are many limitations and biases associated with evaluating modern AI systems. However, this doesn't mean we shouldn't do it. Combining different methods and approaches can help mitigate many of these challenges.
103 | 
104 | Even though dedicated discussions on evaluation end here, evaluation will come up again and again, not just throughout the book but also throughout your application development process. Chapter 6 explores on evaluating retrieval and agentic systems, while Chapters 7 and 9 focus on calculating a model's memory usage, latency, and costs. Data quality verification is addressed in Chapter 8, and using user feedback to evaluate production applications in Chapter 10.
105 | 
106 | With that, let's move onto the actual model adaptation process, starting with a topic that many people associate with AI engineering: prompt engineering.
107 | 
108 | ## Chapter 5. Prompt Engineering
109 | <center><img src="assets/prompt-anatomy.png" width="600"><br>
110 | <i>Figure 5-1. A simple example to show to anatomy of a prompt.</i>
111 | </center>
112 | <br>
113 | 
114 | Foundation models can do many things, but you must tell them exactly what you want. The process of crafting an instruction to get a model to do what you want is called prompt engineering. How much crafting is needed depends on how sensitive the model is to prompts. If a small change can cause a big change in the model's response, more crafting will be necessary.
115 | 
116 | You can think of prompt engineering as human–AI communication. Anyone can communicate, but not everyone can communicate well. Prompt engineering is easy to get started, which misleads many into thinking that it's easy to do it well.
117 | 
118 | The first part of this chapter discusses the anatomy of a prompt, why in-context learning works, and best prompt engineering practices. Whether you're communicating with AI or other humans, clear instructions with examples and relevant information are essential. Simple tricks like asking the model to slow down and think step by step can yield surprising improvements. Just like humans, AI models have their quirks and biases, which need to be considered for a productive relationship with them.
119 | 
120 | Foundation models are useful because they can follow instructions. However, this ability also opens them up to prompt attacks in which bad actors get models to follow malicious instructions. This chapter discusses different attack approaches and potential defenses against them. As security is an ever-evolving cat-and-mouse game, no security measurements will be foolproof. Security risks will remain a significant roadblock for AI adoption in high-stakes environments.
121 | 
122 | This chapter also discusses techniques to write better instructions to get models to do what you want. However, to accomplish a task, a model needs not just instructions but also relevant context. How to provide a model with relevant information will be discussed in the next chapter.
123 | 
124 | ## Chapter 6. RAG and Agents
125 | <center><img src="assets/rag-architecture.png" width="700"><br>
126 | <i>Figure 6-3. A high-level view of how an embedding-based, or semantic, retriever works.</i>
127 | </center>
128 | <br>
129 | 
130 | Given the popularity of RAG and the potential of agents, early readers have mentioned that this is the chapter they're most excited about.
131 | 
132 | This chapter started with RAG, the pattern that emerged first between the two. Many tasks require extensive background knowledge that often exceeds a model's context window. For example, code copilots might need access to entire codebases, and research assistants may need to analyze multiple books. Originally developed to overcome a model's context limitations, RAG also enables more efficient use of information, improving response quality while reducing costs. From the early days of foundation models, it was clear that the RAG pattern would be immensely valuable for a wide range of applications, and it has since been rapidly adopted across both consumer and enterprise use cases.
133 | 
134 | RAG employs a two-step process. It first retrieves relevant information from external memory and then uses this information to generate more accurate responses. The success of a RAG system depends on the quality of its retriever. Term-based retrievers, such as Elasticsearch and BM25, are much lighter to implement and can provide strong baselines. Embedding-based retrievers are more computationally intensive but have the potential to outperform term-based algorithms.
135 | 
136 | Embedding-based retrieval is powered by vector search, which is also the backbone of many core internet applications such as search and recommender systems. Many vector search algorithms developed for these applications can be used for RAG.
137 | 
138 | The RAG pattern can be seen as a special case of agent where the retriever is a tool the model can use. Both patterns allow a model to circumvent its context limitation and stay more up-to-date, but the agentic pattern can do even more than that. An agent is defined by its environment and the tools it can access. In an AI-powered agent, AI is the planner that analyzes its given task, considers different solutions, and picks the most promising one. A complex task can require many steps to solve, which requires a powerful model to plan. A model's ability to plan can be augmented with reflection and a memory system to help it keep track of its progress.
139 | 
140 | The more tools you give a model, the more capabilities the model has, enabling it to solve more challenging tasks. However, the more automated the agent becomes, the more catastrophic its failures. Tool use exposes agents to many security risks discussed in the Chapter 5. For agents to work in the real world, rigorous defensive mechanisms need to be put in place.
141 | 
142 | Both RAG and agents work with a lot of information, which often exceeds the maximum context length of the underlying model. This necessitates the introduction of a memory system for managing and using all the information a model has. This chapter ended with a short discussion on what this component looks like.
143 | 
144 | RAG and agents are both prompt-based methods, as they influence the model's quality solely through inputs without modifying the model itself. While they can enable many incredible applications, modifying the underlying model can open up even more possibilities. How to do so will be the topic of the next chapter.
145 | 
146 | ## Chapter 7. Finetuning
147 | <center><img src="assets/rag-vs-finetune.png" width="700"><br>
148 | <i>Figure 7-3. Example application development flows. After simple retrieval (such as term-based retrieval), whether to experiment with more complex retrieval (such as hybrid search) or finetuning depends on each application and its failure modes.</i>
149 | </center>
150 | <br>
151 | 
152 | Outside of the evaluation chapters, finetuning has been the most challenging chapter to write. It touched on a wide range of concepts, both old (transfer learning) and new (PEFT), fundamental (low-rank factorization) and experimental (model merging), mathematical (memory calculation) and tactical (hyperparameter tuning). Arranging all these different aspects into a coherent structure while keeping them accessible was difficult.
153 | 
154 | The process of finetuning itself isn't hard. Many finetuning frameworks handle the training process for you. These frameworks can even suggest common finetuning methods with sensible default hyperparameters.
155 | 
156 | However, the context surrounding finetuning is complex. It starts with whether you should even finetune a model. This chapter started with the reasons for finetuning and the reasons for not finetuning. It also discussed one question that I have been asked many times: when to finetune and when to do RAG.
157 | 
158 | In its early days, finetuning was similar to pre-training—both involved updating the model's entire weights. However, as models increased in size, full finetuning became impractical for most practitioners. The more parameters to update during finetuning, the more memory finetuning needs. Most practitioners don't have access to sufficient resources (hardware, time, and data) to do full finetuning with foundation models.
159 | 
160 | Many finetuning techniques have been developed with the same motivation: to achieve strong performance on a minimal memory footprint. For example, PEFT reduces finetuning's memory requirements by reducing the number of trainable parameters. Quantized training, on the other hand, mitigates this memory bottleneck by reducing the number of bits needed to represent each value.
161 | 
162 | After giving an overview of PEFT, the chapter zoomed into LoRA—why it works and how it works. LoRA has many properties that make it popular among practitioners. On top of being parameter-efficient and data-efficient, it's also modular, making it much easier to serve and combine multiple LoRA models.
163 | 
164 | The idea of combining finetuned models brought the chapter to model merging, whose goal is to combine multiple models into one model that works better than these models separately. This chapter discussed the many use cases of model merging, from on-device deployment to model upscaling, and general approaches to model merging.
165 | 
166 | A comment I often hear from practitioners is that finetuning is easy, but getting data for finetuning is hard. Obtaining high-quality annotated data, especially instruction data, is challenging. The next chapter will dive into these challenges.
167 | 
168 | ## Chapter 8. Dataset Engineering
169 | <center><img src="assets/model-perf-dataset.png" width="600"><br>
170 | <i>Figure 8-3. The performance gain curve with different dataset sizes can help you esti‐ mate the impact of additional training examples on your model’s performance.</i>
171 | </center>
172 | <br>
173 | 
174 | Even though the actual process of creating training data is incredibly intricate, the principles of creating a dataset are surprisingly straightforward. To build a dataset to train a model, you start by thinking through the behaviors you want your model to learn and then design a dataset to show these behaviors. Due to the importance of data, teams are introducing dedicated data roles responsible for acquiring appropriate datasets while ensuring privacy and compliance.
175 | 
176 | What data you need depends not only on your use case but also on the training phase. Pre-training requires different data from instruction finetuning and preferred finetuning. However, dataset design across training phases shares the same three core criteria: quality, coverage, and quantity.
177 | 
178 | While how much data a model is trained on is what has been grabbing headlines, having high-quality data with sufficient coverage is just as important. A small amount of high-quality data can outperform a large amount of noisy data. Similarly, many teams have found that increasing the diversity of their datasets is a key to improving their models' performance.
179 | 
180 | Due to the challenge of acquiring high-quality data, many teams have turned to synthetic data. While generating data programmatically has long been a goal, it wasn't until AI could create realistic, complex data that synthetic data became a practical solution for many more use cases. This chapter discussed different techniques for data synthesis with a deep dive into synthesizing instruction data for finetuning.
181 | 
182 | Just like real data, synthetic data must be evaluated to ensure its quality before being used to train models. Evaluating AI-generated data is just as tricky as evaluating other AI outputs, and people are more likely to use generated data that they can reliably evaluate.
183 | 
184 | Data is hard because many steps in dataset creation aren't easily automatable. It's hard to annotate data, but it's even harder to create annotation guidelines. It's hard to automate data generation, but it's even harder to automate verifying it. While data synthesis helps generate more data, you can't automate thinking through what data you want. You can't easily automate annotation guidelines. You can't automate paying attention to details.
185 | 
186 | Challenging problems lead to creative solutions. One thing that stood out to me when doing research for this chapter is how much creativity is involved in dataset design. There are so many ways people construct and evaluate data. I hope that the range of data synthesis and verification techniques discussed in this chapter will give you inspiration for how to design your dataset.
187 | 
188 | Let's say that you've curated a wonderful dataset that allows you to train an amazing model, how should you serve this model? The next chapter will discuss how to optimize inference for latency and cost.
189 | 
190 | ## Chapter 9. Inference Optimization
191 | <center><img src="assets/inference-service.png" width="500"><br>
192 | <i>Figure 9-1. A simple inference service.</i>
193 | </center>
194 | <br>
195 | 
196 | A model's usability depends heavily on its inference cost and latency. Cheaper inference makes AI-powered decisions more affordable, while faster inference enables the integration of AI into more applications. Given the massive potential impact of inference optimization, it has attracted a remarkable number of talented individuals who continually come up with innovative approaches.
197 | 
198 | Before we start making things more efficient, it's necessary to understand how efficiency is measured. This chapter started with common efficiency metrics for latency, throughput, and utilization. For language model-based inference, latency can be broken into **time to first token** (TTFT), which is influenced by the prefilling phase, and **time per output token** (TPOT), which is influenced by the decoding phase. Throughput metrics are directly related to cost. There's a tradeoff between latency and throughput. You can potentially reduce cost if you're okay with increased latency, and reducing latency often involves increasing cost.
199 | 
200 | How efficiently a model can run depends on the hardware it is run on. For this reason, this chapter also provided a quick overview of AI hardware and what it takes to optimize models on different accelerators.
201 | 
202 | This chapter then continued with different techniques for inference optimization. Given the availability of model APIs, most application developers will use these APIs with their built-in optimization instead of implementing these techniques themselves. While these techniques might not be relevant to all application developers, I believe that understanding what techniques are possible can be helpful for evaluating the efficiency of model APIs.
203 | 
204 | In this chapter, I focused on optimization at the model level and the inference service level. Model-level optimization often requires changing the model itself, which can lead to changes in the model behaviors. Inference service-level optimization, on the other hand, typically keeps the model intact and only changes how it's served.
205 | 
206 | Model-level techniques include model-agnostic techniques like quantization and distillation. Different model architectures require their own optimization. For example, because a key bottleneck of transformer models is in the attention mechanism, many optimization techniques involve making attention more efficient, including KV cache management and writing attention kernels. A big bottleneck for an autoregressive language model is in its autoregressive decoding process, and consequently, many techniques have been developed to address it, too.
207 | 
208 | Inference service-level techniques include various batching and parallelism strategies. There are also techniques developed especially for autoregressive language models, including prefilling/decoding decoupling and prompt caching.
209 | 
210 | The choice of optimization techniques depends on your workloads. For example, KV caching is significantly more important for workloads with long contexts than those with short contexts. Prompt caching, on the other hand, is crucial for workloads involving long, overlapping prompt segments or multi-turn conversations. The choice also depends on your performance requirements. For instance, if low latency is a higher priority than cost, you might want to scale up replica parallelism. While more replicas require additional machines, each machine handles fewer requests, allowing it to allocate more resources per request and, thus, improve response time.
211 | 
212 | However, across various use cases, the most impactful techniques are typically quantization (which generally works well across models), tensor parallelism (which both reduces latency and enables serving larger models), replica parallelism (which is relatively straightforward to implement), and attention mechanism optimization (which can significantly accelerate transformer models).
213 | 
214 | Inference optimization concludes the list of model adaptation techniques covered in this book. The next chapter will explore how to integrate these techniques into a cohesive system.
215 | 
216 | ## Chapter 10. AI Engineering Architecture and User Feedback
217 | <center><img src="assets/aie-architecture.png" width="800"><br>
218 | <i>Figure 10-10. A common generative AI application architecture.</i>
219 | </center>
220 | <br>
221 | If each previous chapter focused on a specific aspect of AI engineering, this chapter looked into the process of building applications on top of foundation models as a whole.
222 | 
223 | The chapter consisted of two parts. The first part discussed a common architecture for AI applications. While the exact architecture for an application might vary, this high-level architecture provides a framework for understanding how different components fit together. I used the step-by-step approach in building this architecture to discuss the challenges at each step and what techniques you can use to address them.
224 | 
225 | While it's necessary to separate components to keep your system modular and maintainable, this separation is fluid. There are many ways in which components can overlap in functionalities. For example, guardrails can be implemented in the inference service, the model gateway, or as a standalone component.
226 | 
227 | Each additional component can potentially make your system more capable, safer, or faster, but will also increase the system's complexity, exposing it to new failure modes. One integral part of any complex system is monitoring and observability. Observability involves understanding how your system fails, designing metrics and alerts around failures, and ensuring that your system is designed in a way that makes these failures detectable and traceable. While many observability best practices and tools from software engineering and traditional machine learning are applicable to AI engineering applications, foundation models introduce new failure modes, which require additional metrics and design considerations.
228 | 
229 | At the same time, the conversational interface enables new types of user feedback, which you can leverage for analytics, product improvement, and the data flywheel. The second part of the chapter discussed various forms of conversational feedback and how to design your application to effectively collect it.
230 | 
231 | Traditionally, user feedback design has been seen as a product responsibility rather than an engineering one, and as a result, it is often overlooked by engineers. However, since user feedback is a crucial source of data for continuously improving AI models, more AI engineers are now becoming involved in the process to ensure they receive the data they need. This reinforces the idea from Chapter 1 that, compared to traditional ML engineering, AI engineering is moving closer to product. This is because of both the increasing importance of data flywheel and product experience as competitive advantages.
232 | 
233 | Many AI challenges are, at their core, system problems. To solve them, it's often necessary to step back and consider the system as a whole. A single problem might be addressed by different components working independently, or a solution could require the collaboration of multiple components. A thorough understanding of the system is essential to solving real problems, unlocking new possibilities, and ensuring safety.
234 | 


--------------------------------------------------------------------------------
/resources.md:
--------------------------------------------------------------------------------
  1 | # Resources
  2 | During the process of writing *AI Engineering*, I went through many papers, case studies, blog posts, repos, tools, etc. The book itself has 1200+ reference links and I've been tracking [1000+ generative AI GitHub repos](https://huyenchip.com/llama-police). This document contains the resources I found the most helpful to understand different areas.
  3 | 
  4 | If there are resources that you've found helpful but not yet included, feel free to open a PR.
  5 | 
  6 | - [ML Theory Fundamentals](#ml-theory-fundamentals)
  7 | - [Chapter 1. Planning Applications with Foundation Models](#chapter-1-planning-applications-with-foundation-models)
  8 | - [Chapter 2. Understanding Foundation Models](#chapter-2-understanding-foundation-models)
  9 |     - [Training large models](#training-large-models)
 10 |     - [Sampling](#sampling)
 11 |     - [Context length and context efficiency](#context-length-and-context-efficiency)
 12 | - [Chapters 3 + 4. Evaluation Methodology](#chapters-3--4-evaluation-methodology)
 13 | - [Chapter 5. Prompt Engineering](#chapter-5-prompt-engineering)
 14 |     - [Prompt engineering guides](#prompt-engineering-guides)
 15 |     - [Defensive prompt engineering](#defensive-prompt-engineering)
 16 | - [Chapter 6. RAG and Agents](#chapter-6-rag-and-agents)
 17 |     - [RAG](#rag)
 18 |     - [Agents](#agents)
 19 | - [Chapter 7. Finetuning](#chapter-7-finetuning)
 20 | - [Chapter 8. Dataset Engineering](#chapter-8-dataset-engineering)
 21 |     - [Public datasets](#public-datasets)
 22 | - [Chapter 9. Inference Optimization](#chapter-9-inference-optimization)
 23 | - [Chapter 10. AI Engineering Architecture and User Feedback](#chapter-10-ai-engineering-architecture-and-user-feedback)
 24 | - [Bonus: Organization engineering blogs](#bonus-organization-engineering-blogs)
 25 | 
 26 | ## ML Theory Fundamentals
 27 | While you don't need an ML background to start building with foundation models, a rough understanding of how AI works under the hood is useful to prevent misuse. Familiarity with ML theory will make you much more effective.
 28 | 
 29 | 1. [Lecture notes] [Stanford CS 321N](https://cs231n.github.io/): a longtime favorite introductory course on neural networks.
 30 |     
 31 |     - [Videos] I'd recommend watching lectures 1 to 7 from the 2017 course [video recordings](https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv). They cover the fundamentals that haven't changed.
 32 |     - [Videos] Andrej Karpathy's [Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) is more hands-on where he shows how to implement several models from scratch.
 33 | 2. [Book] [Machine Learning: A Probabilistic Perspective](https://probml.github.io/pml-book/book1.html) (Kevin P Murphy, 2012)
 34 |     
 35 |     Foundational, comprehensive, though a bit intense. This used to be many of my friends' go-to book when preparing for theory interviews for research positions.
 36 | 3. [Aman's Math Primers](https://aman.ai/primers/math/)
 37 |     
 38 |     A good note that covers basic differential calculus and probability concepts.
 39 | 4. I also made a list of resources for MLOps, which includes a section for [ML + engineering fundamentals](https://huyenchip.com/mlops/#ml_engineering_fundamentals).
 40 | 5. I wrote a brief [1500-word note](https://github.com/chiphuyen/dmls-book/blob/main/basic-ml-review.md) on how an ML model learns and concepts like objective function and learning procedure.
 41 | 6. *AI Engineering* also covers the important concepts immediately relevant to the discussion:
 42 |     
 43 |     - Transformer architecture (Chapter 2)
 44 |     - Embedding (Chapter 3)
 45 |     - Backpropagation and trainable parameters (Chapter 7)
 46 | 
 47 | ## Chapter 1. Planning Applications with Foundation Models
 48 | 
 49 | 1. [GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models](https://arxiv.org/abs/2303.10130) (OpenAI, 2023) 
 50 |     
 51 |     OpenAI (2023) has excellent research on how exposed different occupations are to AI. They defined a task as exposed if AI and AI-powered software can reduce the time needed to complete this task by at least 50%. An occupation with 80% exposure means that 80% of this occupation tasks are considered exposed. According to the study, occupations with 100% or close to 100% exposure include interpreters and translators, tax preparers, web designers, and writers. Some of them are shown in Figure 1-5. Not unsurprisingly, occupations with no exposure to AI include cooks, stonemasons, and athletes. This study gives a good idea of what use cases AI is good for.
 52 | 1. [Applied LLMs](https://applied-llms.org/) (Yan et al., 2024)
 53 |     
 54 |     Eugene Yan and co. shared their learnings from one year of deploying LLM applications. Many helpful tips!
 55 | 1. [Musings on Building a Generative AI Product](https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product) (Juan Pablo Bottaro and Co-authored byKarthik Ramgopal, LinkedIn, 2024) 
 56 |     
 57 |     One of the best reports I've read on deploying LLM applications: what worked and what didn't. They discussed structured outputs, latency vs. throughput tradeoffs, the challenges of evaluation (they spent most of their time on creating annotation guidelines), and the last-mile challenge of building gen AI applications.
 58 | 1. [Apple's human interface guideline](https://developer.apple.com/design/human-interface-guidelines/machine-learning) for designing ML applications
 59 |     
 60 |     Outlines how to think about the role of AI and human in your application, which influences the interface decisions.
 61 | 1. [LocalLlama subreddit](https://www.reddit.com/r/LocalLLaMA/): useful to check from time to time to see what people are up to.
 62 | 1. [State of AI Report](https://www.stateof.ai/) (updated yearly): very comprehensive. It's useful to skim through to see what you've missed.
 63 | 1. [16 Changes to the Way Enterprises Are Building and Buying Generative AI](https://a16z.com/generative-ai-enterprise-2024/) (Andreessen Horowitz, 2024)
 64 | 1. ["Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents](https://dl.acm.org/doi/abs/10.1145/2858036.2858288) (Luger and Sellen, 2016)
 65 |     
 66 |     A solid, ahead-of-its-time paper on user experience with conversational agents. It makes a great case for the value of dialogue interfaces and what's needed to make them useful, featuring in-depth interviews with 14 users. "*It has been argued that the true value of dialogue interface systems over direct manipulation (GUI) can be found where task complexity is greatest.*"
 67 | 1. [Stanford Webinar - How AI is Changing Coding and Education, Andrew Ng & Mehran Sahami](https://www.youtube.com/watch?v=J91_npj0Nfw&ab_channel=StanfordOnline) (2024) 
 68 |     
 69 |     A great discussion that shows how the Stanford's CS department thinks about what CS education will look like in the future.. My favorite quote: "CS is about systematic thinking, not writing code."
 70 | 1. [Professional artists: how much has AI art affected your career? - 1 year later : r/ArtistLounge](https://www.reddit.com/r/ArtistLounge/comments/1ap0cm3/professional_artists_how_much_has_ai_art_affected/) 
 71 |     
 72 |     Many people share their experience on how AI impacted their work. E.g.:
 73 | 
 74 |     *"From time to time, I am sitting in meetings where managers dream of replacing coders, writers and visual artists with AI. I hate those meetings and try to avoid them, but I still get involved from time to time. All my life, I loved coding & art. But nowadays, I often feel this weird sadness in my heart."*
 75 | 
 76 | ## Chapter 2. Understanding Foundation Models
 77 | 
 78 | ### Training large models
 79 | 
 80 | Papers detailing the training process of important models are gold mines. I'd recommend reading all of them. But if you can only pick 3, I'd recommend Gopher, InstructGPT, and Llama 3.
 81 | 
 82 | 1. [GPT-2] [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (OpenAI, 2019) 
 83 | 2. [GPT-3] [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (OpenAI, 2020) 
 84 | 3. [Gopher] [Scaling Language Models: Methods, Analysis & Insights from Training Gopher](https://arxiv.org/abs/2112.11446) (DeepMind, 2021) 
 85 | 4. [InstructGPT] [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) (OpenAI, 2022)
 86 | 5. [Chinchilla] [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) (DeepMind, 2022)
 87 | 6. [Qwen technical report](https://arxiv.org/abs/2309.16609) (Alibaba, 2022)
 88 | 7. [Qwen2 Technical Report](https://arxiv.org/abs/2407.10671) (Alibaba, 2024)
 89 | 8. [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073) (Anthropic, 2022)
 90 | 9. [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) (Meta, 2023) 
 91 | 10. [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/abs/2307.09288) (Meta, 2023)
 92 | 11. [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) (Meta, 2024)
 93 |     
 94 |     This paper is so good. The section on synthetic data generation and verification is especially important.
 95 | 12. [Yi: Open Foundation Models by 01.AI](https://arxiv.org/abs/2403.04652) (01.AI, 2024)
 96 | 
 97 | **Scaling laws**
 98 | 
 99 | 1. [From bare metal to high performance training: Infrastructure scripts and best practices - imbue](https://imbue.com/research/70b-infrastructure/)
100 |     
101 |     Discusses how to scale compute to train large models. It uses 4,092 H100 GPUs spread across 511 computers, 8 GPUs/computer
102 | 2. [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) (OpenAI, 2020)
103 |     
104 |     Earlier scaling law. Only up to 1B non-embedding params and 1B tokens.
105 | 3. [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) (Hoffman et al., 2022)
106 |     
107 |     Known as Chinchilla scaling law, this might be the most well-known scaling law paper.
108 | 4. [Scaling Data-Constrained Language Models](https://proceedings.neurips.cc/paper_files/paper/2023/hash/9d89448b63ce1e2e8dc7af72c984c196-Abstract-Conference.html) (Muennighoff et al., 2023) 
109 |     
110 |     *"We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero."*
111 | 5. [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416) (Chung et al., 2022)
112 |     
113 |     A very good paper that talks about the importance of diversity of instruction data.
114 | 6. [Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws](https://arxiv.org/abs/2401.00448) (Sardana et al., 2023) 
115 | 7. [AI models are devouring energy. Tools to reduce consumption are here, if data centers will adopt](https://www.ll.mit.edu/news/ai-models-are-devouring-energy-tools-reduce-consumption-are-here-if-data-centers-will-adopt) ( MIT Lincoln Laboratory, 2023)
116 | 8. [Will we run out of data? Limits of LLM scaling based on human-generated data](https://arxiv.org/abs/2211.04325) (Villalobos et al., 2022)
117 | 
118 | **Fun stuff**
119 | 
120 | 1. [Evaluating feature steering: A case study in mitigating social biases](https://www.anthropic.com/research/evaluating-feature-steering) (Anthropic, 2024)
121 |     
122 |     This area of research is awesome. They focused on 29 [features related to social biases](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-bias) and found that feature steering can influence specific social biases, but it may also produce unexpected ‘off-target effects'.
123 | 2. [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) (Anthropic, 2024)
124 | 3. [GitHub - ianand/spreadsheets-are-all-you-need](https://github.com/ianand/spreadsheets-are-all-you-need)
125 | 
126 |     *"Implements the forward pass of GPT2 (an ancestor of ChatGPT) entirely in Excel using standard spreadsheet functions."*
127 | 4. [BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)](https://github.com/jessevig/bertviz)
128 |     
129 |     A helpul visualization of multi-head attention in action, developed to show how BERT works. 
130 | 
131 | ### Sampling
132 | 
133 | 1. [A Guide to Structured Generation Using Constrained Decoding](https://www.aidancooper.co.uk/constrained-decoding/) (Aidan Cooper, 2024)
134 | 
135 |     An in-depth, detailed tutorial on generating structured outputs.
136 | 2. [Fast JSON Decoding for Local LLMs with Compressed Finite State Machine](https://lmsys.org/blog/2024-02-05-compressed-fsm/) (LMSYS, 2024)
137 | 3. [How fast can grammar-structured generation be?](https://blog.dottxt.co/how-fast-cfg.html) (Brandon T. Willard, 2024)
138 | 
139 | I also wrote a post on [sampling for text generation](https://huyenchip.com/2024/01/16/sampling.html) (2024).
140 | 
141 | ### Context length and context efficiency
142 | 
143 | 1. [Everything About Long Context Fine-tuning](https://huggingface.co/blog/wenbopan/long-context-fine-tuning) (Wenbo Pan, 2024)
144 | 2. [Data Engineering for Scaling Language Models to 128K Context](https://arxiv.org/abs/2402.10171v1) (Yu et al., 2024)
145 | 3. [The Secret Sauce behind 100K context window in LLMs: all tricks in one place](https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c) (Galina Alperovich, 2023)
146 | 4. [Extending Context is Hard…but not Impossible](https://kaiokendev.github.io/context) (kaioken, 2023)
147 | 5. [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) (Su et al., 2021)
148 |     
149 |     Introducing RoPE, a technique to handle positional embeddings that enables transformer-based models to handle longer context length.
150 | 
151 | ## Chapters 3 + 4. Evaluation Methodology
152 | 
153 | 1. [Challenges in evaluating AI systems](https://www.anthropic.com/news/evaluating-ai-systems) (Anthropic, 2023)
154 |     
155 |    Discusses the limitations of common AI benchmarks to show why evaluation is so hard.
156 | 2. [Holistic Evaluation of Language Models](https://arxiv.org/abs/2211.09110) (Liang et al., Stanford 2022)
157 | 3. [Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models](https://arxiv.org/abs/2206.04615) (Google, 2022) 
158 | 4. [Open-LLM performances are plateauing, let's make the leaderboard steep again](https://huggingface.co/spaces/open-llm-leaderboard/blog) (Hugging Face, 2024)
159 |     
160 |     Helpful explanation on why Hugging Face chose certain benchmarks for their leaderboard, which is a useful reference for selecting benchmarks for your personal leaderboard. 
161 | 5. [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) (Zheng et al., 2023)
162 | 6. [LLM Task-Specific Evals that Do & Don't Work](https://eugeneyan.com/writing/evals/) (Eugene Yan, 2024) 
163 | 7. [Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/) (Hamel Hussain, 2024) 
164 | 8. [Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks](https://arxiv.org/abs/2305.10160) (Google & AI2, May 2023)
165 | 9. [alopatenko/LLMEvaluation](https://github.com/alopatenko/LLMEvaluation) (Andrei Lopatenko)
166 |     
167 |     A large collection of evaluation resources. The [slide deck](https://github.com/alopatenko/LLMEvaluation/blob/main/LLMEvaluation.pdf) on eval has a lot of pointers too.
168 | 10. [Discovering Language Model Behaviors with Model-Written Evaluations](https://arxiv.org/abs/2212.09251) (Perez et al., 2022)
169 |     
170 |     A fun paper that uses AI to discover novel AI behaviors. They use methods with various degrees of automation to generate evaluation sets for 154 diverse behaviors.
171 | 11. [Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models](https://arxiv.org/abs/2309.01219) (Zhang et al., 2023)
172 | 12. OpenRouter's [LLM Rankings](https://openrouter.ai/rankings) shows the top open source models on their platform, ranked by their usage (token volume). This can help you evaluate open source models by popularity. I wish more inference services would publish statistics like this.
173 | 
174 | ## Chapter 5. Prompt Engineering
175 | 
176 | ### Prompt engineering guides
177 | 
178 | 1. [Anthropic's Prompt Engineering Interactive Tutorial](https://docs.google.com/spreadsheets/d/19jzLgRruG9kjUQNKtCg1ZjdD6l6weA6qRXG5zLIAhC8/edit#gid=1733615301)
179 |     
180 |     Practical, comprehensive, and fun. The Google Sheets-based interactive exercises make it easy to experiment with different prompts and see immediately what works and what doesn't. I'm surprised other model providers don't have similar interactive guides.
181 | 2. [Brex's prompt engineering guide](https://github.com/brexhq/prompt-engineering)
182 |     
183 |     Contains a list of example prompts that Brex uses internally.
184 | 3. [Meta's prompt engineering guide](https://llama.meta.com/docs/how-to-guides/prompting/)
185 | 4. [Google's Gemini prompt engineering guide](https://services.google.com/fh/files/misc/gemini-for-google-workspace-prompting-guide-101.pdf)
186 | 5. [dair-ai/Prompt-Engineering-Guide](https://github.com/dair-ai/Prompt-Engineering-Guide) 
187 | 6. Collections of prompt examples from [OpenAI](https://platform.openai.com/examples), [Anthropic](https://docs.anthropic.com/en/prompt-library/library), and [Google](https://console.cloud.google.com/vertex-ai/generative/prompt-gallery).
188 | 7. [Larger language models do in-context learning differently
189 | ](https://arxiv.org/abs/2303.03846) (Wei et al., 2023)
190 | 8. [How I think about LLM prompt engineering](https://fchollet.substack.com/p/how-i-think-about-llm-prompt-engineering) (Francois Chollet, 2023) 
191 | 
192 | ### Defensive prompt engineering
193 | 
194 | 1. [Offensive ML Playbook](https://wiki.offsecml.com/Welcome+to+the+Offensive+ML+Playbook)
195 |     
196 |     Has many resources on adversarial ML and how to defend your ML systems against attacks, including both text and image attacks
197 | 2. [The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions](https://arxiv.org/abs/2404.13208) (OpenAI, 2024)
198 |     
199 |     A good paper on how OpenAI trained a model to imbue prompt hierarchy to protect a model from jailbreaking. 
200 | 3. [Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection](https://arxiv.org/abs/2302.12173) (Greshake et al., 2023) 
201 |     
202 |     Has a great list of examples of indirect prompt injections in the appendix.
203 | 4. [Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks](https://arxiv.org/abs/2302.05733) (Kang et al., 2023)
204 | 5. [Scalable Extraction of Training Data from (Production) Language Models](https://arxiv.org/abs/2311.17035) (Nasr et al., 2023)
205 | 6. [How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs](https://arxiv.org/abs/2401.06373) (Zeng et al., 2024)
206 | 7. [LLM Security](https://llmsecurity.net/): A collection of LLM security papers.
207 | 8. Tools that help automate security probing include [PyRIT](https://github.com/Azure/PyRIT), [Garak](https://github.com/leondz/garak/), [persuasive_jailbreaker](https://github.com/CHATS-lab/persuasive_jailbreaker), [GPTFUZZER](https://arxiv.org/abs/2309.10253), and [MasterKey](https://arxiv.org/abs/2307.08715).
208 | 9. [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674) (Meta, 2023)
209 | 10. [AI Security Overview](https://owaspai.org/docs/ai_security_overview/#threat-model) (AI Exchange)
210 | 
211 | ## Chapter 6. RAG and Agents
212 | 
213 | ### RAG
214 | 
215 | 1. [Reading Wikipedia to Answer Open-Domain Questions](https://arxiv.org/abs/1704.00051) (Chen et al., 2017)
216 |     
217 |     Introduces the RAG pattern to help with knowledge-intensive tasks such as question answering.
218 | 2. [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) (Lewis et al., 2020) 
219 | 3. [Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2312.10997) (Gao et al., 2023)
220 | 4. [Introducing Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval) (Anthropic, 2024)
221 |     
222 |     An important topic not discussed nearly enough is how to prepare data for RAG system. This post discusses several techniques for preparing data for RAG and some very practical on when to use RAG and when to use long context.
223 | 5. Chunking tutorials from [Pinecone](https://www.pinecone.io/learn/chunking-strategies/) and [Langchain](https://js.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)
224 | 6. [The 5 Levels Of Text Splitting For Retrieval](https://www.youtube.com/watch?v=8OJC21T2SL4) (Greg Kamradt, 2024)
225 | 7. [GPT-4 + Streaming Data = Real-Time Generative AI](https://www.confluent.io/blog/chatgpt-and-streaming-data-for-real-time-generative-ai/) (Confluent, 2023)
226 |     
227 |     A great post detailing the pattern of retrieving real-time data in RAG applications.
228 | 8. [Everything You Need to Know about Vector Index Basics](https://zilliz.com/learn/vector-index) (Zilliz, 2023)
229 |     
230 |     An excellent series on vector search and vector database.
231 | 9. [A deep dive into the world's smartest email AI](https://www.shortwave.com/blog/deep-dive-into-worlds-smartest-email-ai/) (Hiranya Jayathilaka, 2023)
232 |     
233 |     If you can ignore the title, the post is a detailed case study on using the RAG pattern to build an email assistant.
234 | 10. [Book] [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/information-retrieval-book.html) (Manning, Raghavan, and Schütze, 2008)
235 |     
236 |     Information retrieval is the backbone of RAG. This book is for those who want to dive really, really deep into different techniques for organizing and querying text data.
237 | 
238 | ### Agents
239 | 
240 | 1. [[2304.09842] Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models](https://arxiv.org/abs/2304.09842) (Lu et al., 2023)
241 |     
242 |     My favorite study on LLM planners, how they use tools, and their failure modes. An interesting finding is that different LLMs have different tool preferences. 
243 | 2. [Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442) (Park et al., 2023)
244 | 3. [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761) (Schick et al., 2023)
245 | 4. [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) and the paper [Gorilla: Large Language Model Connected with Massive APIs](https://arxiv.org/abs/2305.15334) (Patil et al., 2023)
246 |     
247 |     The list of 4 common mistakes in function calling made by ChatGPT is interesting.
248 | 5. [THUDM/AgentBench: A Benchmark to Evaluate LLMs as Agents](https://github.com/THUDM/AgentBench)  (ICLR'24) 
249 | 6. [WebGPT: Browser-assisted question-answering with human feedback](https://arxiv.org/abs/2112.09332) (Nakano et al., 2021)
250 | 7. [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629) (Yao et al., 2022)
251 | 8. [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366) (Shinn et al., 2023)
252 | 9. [Voyager: An Open-Ended Embodied Agent with Large Language Models](https://arxiv.org/abs/2305.16291) (Wang et al., 2023)
253 | 10. [Book] [Artificial Intelligence: A Modern Approach](https://www.amazon.com/Artificial-Intelligence-A-Modern-Approach/dp/0134610997) (Russell and Norvig, 4th edition is in 2020)
254 |     
255 |     Planning is closely related to search, and this classic book has a several in-depth chapters on search.
256 | 
257 | ## Chapter 7. Finetuning
258 | 
259 | 1. [Best practices for fine-tuning GPT-3 to classify text](https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit) (OpenAI) 
260 |     
261 |     A draft from OpenAI. While this guide focuses on GPT-3 but many techniques are applicable to full finetuning in general. It explains how GPT-3 finetuning works, how to prepare training data, how to evaluate your model, and common mistakes
262 | 2. [Easily Train a Specialized LLM: PEFT, LoRA, QLoRA, LLaMA-Adapter, and More](https://cameronrwolfe.substack.com/p/easily-train-a-specialized-llm-peft) (Cameron R. Wolfe, 2023)
263 |     
264 |     For more general parameter-efficient finetuning, 's 7000-word, well-researched article on the evolution of adapter-based finetuning, why LoRA has is so popular and why it works
265 | 3. [Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs](https://arxiv.org/abs/2312.05934) (Ovadia et al., 2024) 
266 |     
267 |     Interesting results to help answering the question: finetune or RAG?
268 | 4. [Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/abs/1902.00751) (Houlsby et al., 2019)
269 |     
270 |     The paper introducing the concept of PEFT.
271 | 5. [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) (Hu et al., 2021)
272 |     
273 |     A must-read.
274 | 6. [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) (Dettmers et al., 2023)
275 | 7. [Direct Preference Optimization with Synthetic Data on Anyscale](https://www.anyscale.com/blog/direct-preference-optimization-with-synthetic-data) (2024)
276 | 8. [Transformer Inference Arithmetic](https://kipp.ly/transformer-inference-arithmetic/) (kipply, 2022)
277 | 9. [Transformer Math 101](https://blog.eleuther.ai/transformer-math/) (EleutherAI, 2023): Memory footprint calculation, focusing more on training.
278 | 10. [Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.15647) (Lialin et al., 2023)
279 |     
280 |     An comprehensive study of different finetuning methods. Not all techniques are relevant today, though.
281 | 11. [My experience on starting with fine tuning LLMs with custom data : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/14vnfh2/my_experience_on_starting_with_fine_tuning_llms/) (2023)
282 | 12. [Train With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) (NVIDIA Docs)
283 | 
284 | ## Chapter 8. Dataset Engineering
285 | 1. [Annotation Best Practices for Building High-Quality Datasets](https://www.grammarly.com/blog/engineering/annotation-best-practices/) (Grammarly, 2022) 
286 | 2. [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416) (Chung et al., 2022) 
287 | 3. [The Curse of Recursion: Training on Generated Data Makes Models Forget](https://arxiv.org/abs/2305.17493) (Shumailov et al., 2023)
288 | 4. [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) (Meta, 2024)
289 |     
290 |     The whole paper is good, but the section on synthetic data generation and verification is especially important.
291 | 5. [Instruction Tuning with GPT-4](https://arxiv.org/abs/2304.03277) (Peng et al., 2023)
292 |     
293 |     Use GPT-4 to generate instruction-following data for LLM finetuning.
294 | 6. [Best Practices and Lessons Learned on Synthetic Data for Language Models](https://arxiv.org/abs/2404.07503) (Liu et al., DeepMind 2024)
295 | 7. [UltraChat] [Enhancing Chat Language Models by Scaling High-quality Instructional Conversations](https://arxiv.org/abs/2305.14233) (Ding et al., 2023)
296 | 8. [Deduplicating Training Data Makes Language Models Better](https://arxiv.org/abs/2107.06499) (Lee et al., 2021)
297 | 9. [Can LLMs learn from a single example?](https://www.fast.ai/posts/2023-09-04-learning-jumps/) (Jeremy Howard and Jonathan Whitaker, 2023)
298 |     
299 |     Fun experiment to show that it's possible to see model improvement with just one training example.
300 | 10. [LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) (Zhou et al., 2023)
301 | 
302 | ### Public datasets
303 | 
304 | Here are a few resources where you can look for publicly available datasets. While you should take advantage of available data, you should never fully trust it. Data needs to be thoroughly inspected and validated.
305 | 
306 | Always check a dataset's license before using it. Try your best to understand where the data comes from. Even if a dataset has a license that allows commercial use, it's possible that part of it comes from a source that doesn't.
307 | 
308 | 1. [Hugging Face](https://huggingface.co/datasets) and [Kaggle](https://www.kaggle.com/datasets?fileType=csv) each host hundreds of thousands of datasets.
309 | 2. Google has a wonderful and underrated [Dataset Search](https://datasetsearch.research.google.com/).
310 | 3. Governments are often great providers of open data. [Data.gov](https://data.gov) hosts approximately hundreds of thousands of datasets, and [data.gov.in](https://data.gov.in) hosts tens of thousands. 
311 | 4. University of Michigan's [Institute for Social Research](https://www.icpsr.umich.edu/web/pages/ICPSR/index.html) ICPSR has data from tens of thousands of social studies.
312 | 5. [UC Irvine's Machine Learning Repository](https://archive.ics.uci.edu/datasets) and [OpenML](https://www.openml.org/search?type=data&sort=runs&status=active) are two older dataset repositories, each hosting several thousands of datasets.
313 | 6. The [Open Data Network](https://www.opendatanetwork.com/) lets you search among tens of thousands of datasets.
314 | 7. Cloud service providers often host a small collection of open datasets;, the most notable one is [AWS's Open Data](https://registry.opendata.aws/).
315 | 8. ML frameworks often have small pre-built datasets that you can load while using the framework, such as [TensorFlow datasets](https://www.tensorflow.org/datasets/catalog/overview#all_datasets).
316 | 9. Some evaluation harness tools host evaluation benchmark datasets that are sufficiently large for PEFT finetuning. For example, Eleuther AI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) hosts 400+ benchmark datasets, averaging 2,000+ examples per dataset.
317 | 10. The [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/) is a great repository for graph datasets.
318 | 
319 | 
320 | ## Chapter 9. Inference Optimization
321 | 
322 | 1. [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/) (NVIDIA Technical Blog, 2023)
323 |     
324 |     A very good overview of different optimization techniques.
325 | 2. [Accelerating Generative AI with PyTorch II: GPT, Fast](https://pytorch.org/blog/accelerating-generative-ai-2/) (Pytorch, 2023)
326 |     
327 |     A good case study with the performance improvement achieved from different techniques.
328 | 3. [Efficiently Scaling Transformer Inference](https://arxiv.org/pdf/2211.05102) (Pope et al., 2022)
329 |     
330 |     A highly technical but really good paper on inference paper from Jeff Dean's team. My favorite is the section discussing what to focus for different tradeoffs (e.g. latency vs. cost).
331 | 4. [Optimizing AI Inference at Character.AI](https://research.character.ai/optimizing-inference/) (Character.AI, 2024)
332 |     
333 |     This is less of a technical paper and more of a "Look, I can do this" paper. It's pretty impressive what the Character.AI technical team was able to achieve. This post discusses attention design, cache optimization, and int8 training.
334 | 5. [Video] [GPU optimization workshop with OpenAI, NVIDIA, PyTorch, and Voltron Data](https://www.youtube.com/watch?v=v_q2JTIqE20&ab_channel=MLOpsLearners) 
335 | 6. [Video] [Essence VC Q1 Virtual Conference: LLM Inference](https://www.youtube.com/watch?v=XPArX12gXVE) (with vLLM, TVM, and Modal Labs)
336 | 7. [Techniques for KV Cache Optimization in Large Language Models](https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/) (Omri Mallis, 2024)
337 |     
338 |     An excellent post explaining KV cache optimization, one of the most memory-heavy parts of transformer inference.
339 |     
340 |     [João Lages](https://medium.com/@joaolages/kv-caching-explained-276520203249) has an excellent visualization of KV cache.
341 | 
342 | 8. [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) (DeepMind, 2023)
343 | 9. [DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving](https://arxiv.org/abs/2401.09670) (Zhong et al., 2024) 
344 | 10. [The Best GPUs for Deep Learning in 2023 — An In-depth Analysis](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/) (Tim Dettmers, 2023) 
345 |     
346 |     Stas Bekman also has some great [notes](https://github.com/stas00/ml-engineering/tree/master/compute/accelerator) on evaluating accelerators. 
347 | 11. [Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads](https://www.usenix.org/system/files/atc19-jeon.pdf) (Jeon et al., 2019)
348 |     
349 |     A detailed study of GPU clusters used for training deep neural networks (DNNs) in a multi-tenant environment. The authors analyze a two-month-long trace from a GPU cluster at Microsoft, focusing on three key issues affecting cluster utilization: gang scheduling and locality constraints, GPU utilization, and job failures.
350 | 12. [AI Datacenter Energy Dilemma - Race for AI Datacenter Space](https://www.semianalysis.com/p/ai-datacenter-energy-dilemma-race) (SemiAnalysis, 2024)
351 |     
352 |     Great analysis on the business of data centers and their bottlenecks.
353 | 
354 | I also have an older post [A friendly introduction to machine learning compilers and optimizers](https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html) (Chip Huyen, 2018)
355 | 
356 | 
357 | ## Chapter 10. AI Engineering Architecture and User Feedback
358 | 
359 | 1. [Chapter 4: Monitoring](https://sre.google/workbook/monitoring/) from Google SRE Book
360 | 1. [Guidelines for Human-AI Interaction](https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/) (Microsoft Research)
361 |     
362 |     Microsoft proposed 18 design guidelines for human-AI interaction, covering decisions before development, during development, when something goes wrong, and over time.
363 | 1. [Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models](https://arxiv.org/abs/2308.15812v3) (Bansal et al., 2023)
364 |     
365 |     A study on how the feedback protocol influences a model's training performance.
366 | 1. [Feedback-Based Self-Learning in Large-Scale Conversational AI Agents](https://arxiv.org/abs/1911.02557) (Ponnusamy et al., Amazon 2019)
367 | 1. [A scalable framework for learning from implicit user feedback to improve natural language understanding in large-scale conversational AI systems](https://arxiv.org/abs/2010.12251) (Park et al., Amazon 2020)
368 | 
369 | User feedback design for conversation AI is an under-researched area so there aren't many resources yet, but I hope to see that will soon change.
370 | 
371 | 
372 | ## Bonus: Organization engineering blogs
373 | 
374 | I enjoy reading good technical blogs. Here are some of my frequent go-to engineering blogs.
375 | 
376 | 1. [LinkedIn Engineering Blog](https://www.linkedin.com/blog/engineering)   
377 | 2. [Engineering Blog - DoorDash](https://careersatdoordash.com/engineering-blog/) 
378 | 3. [Engineering | Uber Blog](https://www.uber.com/en-US/blog/engineering/)
379 | 4. [The Unofficial Google Data Science Blog](https://www.unofficialgoogledatascience.com/)  
380 | 5. [Pinterest Engineering Blog – Medium](https://medium.com/pinterest-engineering)
381 | 6. [Netflix TechBlog](https://netflixtechblog.com/)
382 | 7. [Blog | LMSYS Org](https://lmsys.org/blog/) 
383 | 8. [Blog | Anyscale](https://www.anyscale.com/blog)
384 | 9. [Data Science and ML | Databricks Blog](https://www.databricks.com/blog/category/engineering/data-science-machine-learning)  
385 | 10. [Together Blog](https://www.together.ai/blog) 
386 | 11. [Duolingo Engineering](https://blog.duolingo.com/hub/engineering/) 
387 | 


--------------------------------------------------------------------------------