└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # ⚖️ **Awesome LLM Judges** ⚖️ 2 | 3 | _This repo curates recent research on LLM Judges for automated evaluation._ 4 | 5 | > [!TIP] 6 | > ⚖️ Check out [Verdict](https://verdict.haizelabs.com) — our in-house library for hassle-free implementations of the papers below! 7 | 8 | --- 9 | 10 | ## 📚 Table of Contents 11 | 12 | - [🌱 Starter](#-starter) 13 | - [🎭 Ensemble](#-ensemble) 14 | - [🤔 Debate](#-debate) 15 | - [🎯 Finetuned Models](#-finetuned-models) 16 | - [🌀 Hallucination](#-hallucination) 17 | - [🏆 Generative Reward Models](#-generative-reward-models) 18 | - [🛡️ Safety](#️-safety) 19 | - [🛑 Content Moderation](#-content-moderation) 20 | - [🔍 Scalable Oversight](#-scalable-oversight) 21 | - [👨‍⚖️ Evaluating Judges](#-evaluating-judges) 22 | - [⚖️ Biases](#-biases) 23 | - [🤖 Agents](#-agents) 24 | - [✨ Contributing](#-contributing) 25 | 26 | --- 27 | 28 | ## 🌱 Starter 29 | 30 | - [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) 31 | - [G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634) 32 | - [Benchmarking Foundation Models with Language-Model-as-an-Examiner](https://arxiv.org/abs/2306.04181) 33 | 34 | --- 35 | 36 | ## 🎭 Multi-Judge 37 | 38 | - [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796) 39 | 40 | ### 🤔 Debate 41 | 42 | - [ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate](https://arxiv.org/abs/2401.16788) 43 | - [ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https://arxiv.org/abs/2308.07201) 44 | - [Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate](https://arxiv.org/abs/2305.13160) 45 | - [Debating with More Persuasive LLMs Leads to More Truthful Answers](https://arxiv.org/abs/2402.06782) 46 | 47 | --- 48 | 49 | ## 🎯 Finetuned Models 50 | 51 | - [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491) 52 | - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535) 53 | - [JudgeLM: Fine-tuned Large Language Models are Scalable Judges](https://arxiv.org/abs/2310.17631) 54 | 55 | ### 🌀 Hallucination 56 | 57 | - [HALU-J: Critique-Based Hallucination Judge](https://arxiv.org/abs/2407.12943) 58 | - [MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents](https://aclanthology.org/2024.emnlp-main.499/) 59 | - [Lynx: An Open Source Hallucination Evaluation Model](https://arxiv.org/abs/2407.08488) 60 | 61 | ### 🏆 Generative Reward Models 62 | 63 | - [Generative Verifiers: Reward Modeling as Next-Token Prediction](https://arxiv.org/abs/2408.15240) 64 | - [Critique-out-Loud Reward Models](https://arxiv.org/abs/2408.11791) 65 | 66 | --- 67 | 68 | ## 🛡️ Safety 69 | 70 | ### 🛑 Content Moderation 71 | 72 | - [A STRONGREJECT for Empty Jailbreaks (Sections C.4 & C.5)](https://arxiv.org/pdf/2402.10260) 73 | - [OR-Bench: An Over-Refusal Benchmark for Large Language Models (Sections A.3 & A.11)](https://arxiv.org/abs/2405.20947) 74 | - [WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495) 75 | - [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674) 76 | 77 | ### 🔍 Scalable Oversight 78 | 79 | - [On Scalable Oversight with Weak LLMs Judging Strong LLMs](https://arxiv.org/abs/2407.04622) 80 | - [Debate Helps Supervise Unreliable Experts](https://arxiv.org/abs/2311.08702) 81 | - [Great Models Think Alike and this Undermines AI Oversight](https://arxiv.org/abs/2502.04313) 82 | - [LLM Critics Help Catch LLM Bugs](https://arxiv.org/abs/2407.00215) 83 | 84 | --- 85 | 86 | ## 👨‍⚖️ Judging the Judges: Meta-Evaluation 87 | 88 | - [JudgeBench: A Benchmark for Evaluating LLM-based Judges](https://arxiv.org/abs/2410.12784) 89 | - [RewardBench: Evaluating Reward Models for Language Modeling](https://arxiv.org/abs/2403.13787) 90 | - [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/abs/2310.07641) 91 | - [Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge](https://arxiv.org/abs/2407.19594) 92 | - [From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge](https://arxiv.org/abs/2411.16594) 93 | - [Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences](https://arxiv.org/abs/2404.12272) 94 | - [Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference](https://arxiv.org/abs/2501.00560) 95 | - [The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs](https://arxiv.org/abs/2501.10970) 96 | - [ReIFE: Re-evaluating Instruction-Following Evaluation](https://arxiv.org/abs/2410.07069) 97 | 98 | ### ⚖️ Biases 99 | 100 | - [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926) 101 | - [Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions](https://arxiv.org/abs/2308.11483) 102 | - [Large Language Models are Inconsistent and Biased Evaluators](https://arxiv.org/abs/2405.01724) 103 | - [Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges](https://arxiv.org/abs/2406.12624) 104 | 105 | --- 106 | 107 | ## 🤖 Agents 108 | 109 | _🚧 Coming Soon -- Stay tuned!_ 110 | 111 | --- 112 | 113 | ## ✨ Contributing 114 | 115 | Have a paper to add? Found a mistake? 🧐 116 | 117 | - Open a pull request or submit an issue! Contributions are welcome. 🙌 118 | - Questions? Reach out to [leonard@haizelabs.com](mailto:leonard@haizelabs.com). 119 | --------------------------------------------------------------------------------