└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-LLM-based-Evaluators 2 | 3 | In the realm of evaluating large language models, automated LLM-based evaluations have emerged as a scalable and efficient alternative to human evaluation. This repo includes papers about the LLM-based evaluators. 4 | 5 | ## LLM-based evaluators 6 | 7 | | Title & Authors | Venue | Year | Citation Count | Code | 8 | | ------------------------------------------------------------ | ------------- | ---- | -------------- | ------------------------------------------------------------ | 9 | | [**Chateval: Towards better llm-based evaluators through multi-agent debate**](https://arxiv.org/abs/2308.07201) by Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu | ICLR | 2024 | 107 | [GitHub](https://github.com/thunlp/ChatEval) | 10 | | [**Dyval: Graph-informed dynamic evaluation of large language models**](https://arxiv.org/abs/2309.17167) by Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie | ICLR | 2024 | 6 | [GitHub](https://github.com/microsoft/promptbench) | 11 | | [**PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization**](https://openreview.net/pdf?id=5Nn2BLV7SB) by Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Wenjin Yao, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang | ICLR | 2024 | 52 | [GitHub](https://github.com/WeOpenML/PandaLM) | 12 | | [**Can large language models be an alternative to human evaluations?**](https://arxiv.org/abs/2305.01937) by Cheng-Han Chiang, Hung-yi Lee | ACL | 2023 | 133 | - | 13 | | [**LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models**](https://arxiv.org/abs/2305.13711) by Yen-Ting Lin, Yun-Nung Chen | NLP4ConvAI | 2023 | 42 | - | 14 | | [**Are large language model-based evaluators the solution to scaling up multilingual evaluation?**](https://arxiv.org/abs/2309.07462) by Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram | EACL Findings | 2024 | 12 | [GitHub](https://github.com/hadarishav/LLM-Eval) | 15 | | [**Judging llm-as-a-judge with mt-bench and chatbot arena**](https://arxiv.org/abs/2306.05685) by Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica | NeurIPS | 2023 | 510 | [GitHub](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) | 16 | | [**Calibrating LLM-Based Evaluator**](https://arxiv.org/abs/2309.13308) by Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang | arXiv | 2023 | 7 | - | 17 | | [**LLM-based NLG Evaluation: Current Status and Challenges**](https://arxiv.org/abs/2402.01383) by Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, Xiaojun Wan | arXiv | 2024 | 1 | - | 18 | | [**Are LLM-based Evaluators Confusing NLG Quality Criteria?**](https://arxiv.org/abs/2402.12055) by Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan | arXiv | 2024 | - | - | 19 | | [**PRE: A Peer Review Based Large Language Model Evaluator**](https://arxiv.org/abs/2401.15641) by Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu | arXiv | 2024 | 1 | [GitHub](https://github.com/chuzhumin98/PRE) | 20 | | [**Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning**](https://arxiv.org/abs/2309.13701) by Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, Ida Momennejad | arXiv | 2023 | 4 | - | 21 | | [**Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate**](https://arxiv.org/abs/2401.16788) by Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu | arXiv | 2024 | - | [GitHub](https://github.com/GAIR-NLP/scaleeval) | 22 | | [**Split and merge: Aligning position biases in large language model based evaluators**](https://arxiv.org/abs/2310.01432) by Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu | arXiv | 2023 | 8 | | 23 | | [**One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation**](https://arxiv.org/abs/2402.11683) by Tejpalsingh Siledar, Swaroop Nath, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Swaprava Nath, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera | arXiv | 2024 | - | [GitHub](https://github.com/tjsiledar/SummEval-OP) | 24 | | [**Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation**](https://arxiv.org/abs/2311.18702) by Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang | arXiv | 2023 | 9 | [GitHub](https://github.com/thu-coai/CritiqueLLM) | 25 | | [**Is chatgpt a good nlg evaluator? a preliminary study**](https://arxiv.org/abs/2303.04048) by Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou | NewSumm@EMNLP | 2023 | 179 | [Github](https://github.com/krystalan/chatgpt_as_nlg_evaluator) | 26 | | **[G-eval: Nlg evaluation using gpt-4 with better human alignment](https://arxiv.org/abs/2303.16634)** by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu | EMNLP | 2023 | 381 | [Github](https://github.com/nlpyang/geval) | 27 | | [**GPTScore: Evaluate as You Desire**](https://arxiv.org/abs/2302.04166) by Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu | EMNLP | 2023 | 228 | [Github](https://github.com/jinlanfu/GPTScore) | 28 | | [**Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study**](https://arxiv.org/abs/2304.00723) by Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu | arXiv | 2023 | 39 | - | 29 | | **[Evaluating General-Purpose AI with Psychometrics](https://arxiv.org/abs/2310.16379)** by Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, Xing Xie | arXiv | 2023 | 3 | - | 30 | ## Psychometrics in LLMs evaluation 31 | 32 | | Title & Authors | Venue | Year | Citation Count | Code | 33 | | ------------------------------------------------------------ | ------- | ---- | -------------- | --------------------------------------------------- | 34 | | **[InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews](https://jiangjiechen.github.io/publication/incharacter/)** by Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, Yanghua Xiao | arXiv | 2023 | 16 | [Github](https://github.com/Neph0s/InCharacter) | 35 | | **[Who is ChatGPT? Benchmarking LLMs’ Psychological Portrayal Using Psycho Bench](https://arxiv.org/abs/2310.01386)** by Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu | ICLR | 2024 | 10 | [Github](https://github.com/CUHK-ARISE/PsychoBench) | 36 | | **[On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs](https://openreview.net/forum?id=H3UayAQWoE)** by Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho LAM, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael Lyu | ICLR | 2024 | 2 | [Github](https://github.com/CUHK-ARISE/PsychoBench) | 37 | | **[Evaluating and Inducing Personality in Pre-trained Language Models](https://arxiv.org/abs/2206.07550)** by Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, Yixin Zhu | NeurIPS | 2023 | 47 | [Github](https://github.com/jianggy/MPI) | 38 | | **[Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective](https://arxiv.org/abs/2306.10512v2)** by Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, Enhong Chen | arXiv | 2023 | 19 | - | 39 | | **[MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts](https://arxiv.org/abs/2310.02255v1)** by Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao | arXiv | 2023 | 49 | [Github](https://github.com/lupantech/MathVista) | 40 | | **[LLM Agents for Psychology: A Study on Gamified Assessments](https://arxiv.org/abs/2402.12326)** by Qisen Yang, Zekun Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, Gao Huang | arXiv | 2024 | - | - | 41 | | **[MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback](https://arxiv.org/abs/2309.10691)** by Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng and Heng Ji | ICLR | 2024 | 37 | [Github](https://github.com/xingyaoww/mint-bench) | 42 | | **[GPT-4’s assessment of its performance in a USMLE-based case study](https://arxiv.org/html/2402.09654v2)** by Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal, Chandra Dhakal | arXiv | 2024 | 1 | - | 43 | 44 | ​ 45 | --------------------------------------------------------------------------------