├── README.md ├── data └── ai_surveillance_data.csv └── images └── oreilly.png /README.md: -------------------------------------------------------------------------------- 1 | ![oreilly-logo](images/oreilly.png) 2 | 3 | # Evaluating Large Language Models (LLMs) 4 | 5 | This repository contains code for live session [O'Reilly Course on Evaluating LLMs](https://learning.oreilly.com/live-events/evaluating-large-language-models-llms/0642572013878) with companion video course [here](https://learning.oreilly.com/course/evaluating-large-language/9780135451922/) 6 | 7 | This course offers an in-depth look at evaluating large language models (LLMs), equipping participants with the tools and techniques to measure their performance, reliability, and task alignment. Topics range from foundational metrics to advanced methods such as probing and fine-tuning evaluation. Hands-on exercises and real-world case studies make this course engaging and practical, ensuring learners can directly apply their knowledge to real-world systems. 8 | 9 | ## Notebooks 10 | 11 | In the activated environment, run 12 | 13 | ```bash 14 | python3 -m jupyter notebook 15 | ``` 16 | 17 | - **Evaluating Generative Tasks** 18 | 19 | - **[Evaluating Generative Free Text with Rubrics](https://colab.research.google.com/drive/1DeVYrdNb3FlQQLeBqGPFkx6roZaPwVRy?usp=sharing)** 20 | 21 | - **[Perplexlity, SelfCheckGPT + BERTScore](https://colab.research.google.com/drive/1rG8vCJz5He5JM5oPLYnH3TShSyCnyK9H?usp=sharing)** - 22 | 23 | 24 | - **Evaluating Understanding Tasks** 25 | 26 | - **[Classification Metrics with BERT and BART](https://colab.research.google.com/drive/1yALtgSK6ENEa5WkBGWm3DPviuVGwhrw9?usp=sharing)** - Comparing a fine-tuned BERT model vs 0-shot classification with BART on the [app_reviews dataset](https://huggingface.co/datasets/sealuzh/app_reviews) 27 | 28 | - [Fine-tuning BERT on app_reviews](https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/05_bert_app_review.ipynb): Fine-tuning a BERT model for app review classification. 29 | 30 | - [Fine-tuning Openai on app_reviews](https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/05_openai_app_review_fine_tuning.ipynb): Fine-tuning OpenAI models for app review classification. 31 | 32 | 33 | - **[RAG - Retrieval](https://github.com/sinanuozdemir/oreilly-retrieval-augmented-gen-ai/blob/main/notebooks/RAG_Retrieval.ipynb)**: An introduction to vector databases, embeddings, and retrieval 34 | 35 | - [Advanced Semantic Search](https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/02_semantic_search.ipynb): A more advanced notebook on semantic search, cross-encoders, and fine-tuning from my [book](https://github.com/sinanuozdemir/quick-start-guide-to-llms) 36 | 37 | 38 | - **Benchmarking** 39 | 40 | - **[Benchmarking Llama 3.2 Instruct on MMLU and Embedders on MTEB](https://colab.research.google.com/drive/1zDCqXc7vHoZilHVe3y2lYyTmSUSe6bh3?usp=sharingb)** 41 | 42 | 43 | - [Follow-up Evaluating Llama 3.2 non-instruct on MMLU](https://colab.research.google.com/drive/1aMy19Ikyody9CGyn42K3E_DQwLScL0Ek?usp=sharing) 44 | 45 | - [Evaluating Llama 3.1 vs Mistral on Truthful Q/A](https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/12_llm_gen_eval.ipynb) - 46 | 47 | 48 | - **Probing** 49 | 50 | - **[Probing Chess Playing LLMs](https://colab.research.google.com/drive/114turFLNxLJXiIseDWl1BDJmont0VD8h?usp=sharing)** 51 | 52 | - There are over a dozen notebooks for the birth year/death year probing example so I will only share a few key ones here: 53 | - [Llama-3 8B Instruct with prompt "Who is {NAME}"](https://colab.research.google.com/drive/1e1d9fATVjVun-_tPj4vS_DSTGaIfxs01?usp=sharing) 54 | - [BERT-large-cased no prompt](https://colab.research.google.com/drive/1cizgoh1J6Y-DHBrOkNTFo9Y1CypjwuQM?usp=sharing) 55 | - [Mistral-7B-Instruct-v0.3 with prompt "Who is {NAME}"](https://colab.research.google.com/drive/1VL3betxqVZ_H3_8XmLbjE0hEjaoy-HPV?usp=sharing) 56 | 57 | - **Evaluating Fine-tuning** 58 | 59 | - **[Optimizing Fine-tuning](https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/10_optimizing_fine_tuning.ipynb)** - Best practices for optimizing fine-tuning of transformer models. 60 | 61 | - **Evaluating Fine-tuning Data** 62 | 63 | - **[AUM + Cosine Similarity to clean data](https://colab.research.google.com/drive/1hPnU9sLsV9W50q9rd_oxUU1Bv7SUCVU5?usp=sharing)** 64 | 65 | - **Case Studies** 66 | 67 | - **[Evaluating AI Agents: Task Automation and Tool Integration](https://ai-office-hours.beehiiv.com/p/evaluating-ai-agent-tool-selection)** 68 | - [Positional Bias on Agent Response Evaluation](https://github.com/sinanuozdemir/oreilly-ai-agents/blob/main/notebooks/Evaluating_LLMs_with_Rubrics.ipynb) 69 | 70 | - **[Measuring RAG Re-Ranking](https://ai-office-hours.beehiiv.com/p/re-ranking-rag)** 71 | 72 | - **[Building and Evaluating a Recommendation Engine Using LLMs](https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/07_recommendation_engine.ipynb)** - Fine-tuning embedding engines using custom preference data 73 | 74 | - **[Using Evaluation to combat AI drift](https://colab.research.google.com/drive/14E6DMP_RGctUPqjI6VMa8EFlggXR7fat?usp=sharing)** 75 | 76 | - **[Time Series Regression](https://colab.research.google.com/drive/1VRB1774lq5s0loxDpDXGTw5qAF9FUseH?usp=sharing)** - Predicting the price of Bitcoin 77 | 78 | 79 | 80 | ## Instructor 81 | 82 | **Sinan Ozdemir** is the Founder and CTO of LoopGenius where he uses State of the art AI to help people run digital ads on Meta, Google, and more. Sinan is a former lecturer of Data Science at Johns Hopkins University and the author of multiple textbooks on data science and machine learning. Additionally, he is the founder of the recently acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a master’s degree in Pure Mathematics from Johns Hopkins University and is based in San Francisco, CA. 83 | 84 | -------------------------------------------------------------------------------- /data/ai_surveillance_data.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sinanuozdemir/oreilly-evaluating-llms/9e886b212569ced9a827e5387c8c7408e3b1b7cf/data/ai_surveillance_data.csv -------------------------------------------------------------------------------- /images/oreilly.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sinanuozdemir/oreilly-evaluating-llms/9e886b212569ced9a827e5387c8c7408e3b1b7cf/images/oreilly.png --------------------------------------------------------------------------------