└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # ⚖️ **Awesome LLM Judges** ⚖️
  2 | 
  3 | _This repo curates recent research on LLM Judges for automated evaluation._
  4 | 
  5 | > [!TIP]
  6 | > ⚖️ Check out [Verdict](https://verdict.haizelabs.com) — our in-house library for hassle-free implementations of the papers below!
  7 | 
  8 | ---
  9 | 
 10 | ## 📚 Table of Contents
 11 | 
 12 | - [🌱 Starter](#-starter)
 13 | - [🎭 Ensemble](#-ensemble)
 14 |   - [🤔 Debate](#-debate)
 15 | - [🎯 Finetuned Models](#-finetuned-models)
 16 |   - [🌀 Hallucination](#-hallucination)
 17 |   - [🏆 Generative Reward Models](#-generative-reward-models)
 18 | - [🛡️ Safety](#️-safety)
 19 |   - [🛑 Content Moderation](#-content-moderation)
 20 |   - [🔍 Scalable Oversight](#-scalable-oversight)
 21 | - [👨‍⚖️ Evaluating Judges](#-evaluating-judges)
 22 |   - [⚖️ Biases](#-biases)
 23 | - [🤖 Agents](#-agents)
 24 | - [✨ Contributing](#-contributing)
 25 | 
 26 | ---
 27 | 
 28 | ## 🌱 Starter
 29 | 
 30 | - [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
 31 | - [G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)
 32 | - [Benchmarking Foundation Models with Language-Model-as-an-Examiner](https://arxiv.org/abs/2306.04181)
 33 | 
 34 | ---
 35 | 
 36 | ## 🎭 Multi-Judge
 37 | 
 38 | - [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)
 39 | 
 40 | ### 🤔 Debate
 41 | 
 42 | - [ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate](https://arxiv.org/abs/2401.16788)
 43 | - [ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https://arxiv.org/abs/2308.07201)
 44 | - [Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate](https://arxiv.org/abs/2305.13160)
 45 | - [Debating with More Persuasive LLMs Leads to More Truthful Answers](https://arxiv.org/abs/2402.06782)
 46 | 
 47 | ---
 48 | 
 49 | ## 🎯 Finetuned Models
 50 | 
 51 | - [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)
 52 | - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
 53 | - [JudgeLM: Fine-tuned Large Language Models are Scalable Judges](https://arxiv.org/abs/2310.17631)
 54 | 
 55 | ### 🌀 Hallucination
 56 | 
 57 | - [HALU-J: Critique-Based Hallucination Judge](https://arxiv.org/abs/2407.12943)
 58 | - [MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents](https://aclanthology.org/2024.emnlp-main.499/)
 59 | - [Lynx: An Open Source Hallucination Evaluation Model](https://arxiv.org/abs/2407.08488)
 60 | 
 61 | ### 🏆 Generative Reward Models
 62 | 
 63 | - [Generative Verifiers: Reward Modeling as Next-Token Prediction](https://arxiv.org/abs/2408.15240)
 64 | - [Critique-out-Loud Reward Models](https://arxiv.org/abs/2408.11791)
 65 | 
 66 | ---
 67 | 
 68 | ## 🛡️ Safety
 69 | 
 70 | ### 🛑 Content Moderation
 71 | 
 72 | - [A STRONGREJECT for Empty Jailbreaks (Sections C.4 & C.5)](https://arxiv.org/pdf/2402.10260)
 73 | - [OR-Bench: An Over-Refusal Benchmark for Large Language Models (Sections A.3 & A.11)](https://arxiv.org/abs/2405.20947)
 74 | - [WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495)
 75 | - [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674)
 76 | 
 77 | ### 🔍 Scalable Oversight
 78 | 
 79 | - [On Scalable Oversight with Weak LLMs Judging Strong LLMs](https://arxiv.org/abs/2407.04622)
 80 | - [Debate Helps Supervise Unreliable Experts](https://arxiv.org/abs/2311.08702)
 81 | - [Great Models Think Alike and this Undermines AI Oversight](https://arxiv.org/abs/2502.04313)
 82 | - [LLM Critics Help Catch LLM Bugs](https://arxiv.org/abs/2407.00215)
 83 | 
 84 | ---
 85 | 
 86 | ## 👨‍⚖️ Judging the Judges: Meta-Evaluation
 87 | 
 88 | - [JudgeBench: A Benchmark for Evaluating LLM-based Judges](https://arxiv.org/abs/2410.12784)
 89 | - [RewardBench: Evaluating Reward Models for Language Modeling](https://arxiv.org/abs/2403.13787)
 90 | - [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/abs/2310.07641)
 91 | - [Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge](https://arxiv.org/abs/2407.19594)
 92 | - [From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge](https://arxiv.org/abs/2411.16594)
 93 | - [Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences](https://arxiv.org/abs/2404.12272)
 94 | - [Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference](https://arxiv.org/abs/2501.00560)
 95 | - [The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs](https://arxiv.org/abs/2501.10970)
 96 | - [ReIFE: Re-evaluating Instruction-Following Evaluation](https://arxiv.org/abs/2410.07069)
 97 | 
 98 | ### ⚖️ Biases
 99 | 
100 | - [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926)
101 | - [Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions](https://arxiv.org/abs/2308.11483)
102 | - [Large Language Models are Inconsistent and Biased Evaluators](https://arxiv.org/abs/2405.01724)
103 | - [Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges](https://arxiv.org/abs/2406.12624)
104 | 
105 | ---
106 | 
107 | ## 🤖 Agents
108 | 
109 | _🚧 Coming Soon -- Stay tuned!_
110 | 
111 | ---
112 | 
113 | ## ✨ Contributing
114 | 
115 | Have a paper to add? Found a mistake? 🧐
116 | 
117 | - Open a pull request or submit an issue! Contributions are welcome. 🙌
118 | - Questions? Reach out to [leonard@haizelabs.com](mailto:leonard@haizelabs.com).
119 | 


--------------------------------------------------------------------------------