├── LICENSE.txt ├── README.md ├── experiments.png └── logo.png /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Dan Hendrycks 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 |

4 | 5 |

6 | CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

7 |
🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).
8 |
If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏
9 | 10 |
11 | 12 | [![hf_data](https://img.shields.io/badge/🤗-Datasets-9C276A.svg)](https://huggingface.co/datasets/CodeResearch/CodeJudge-Eval) 13 | [![arXiv](https://img.shields.io/badge/Arxiv-2408.10718-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2408.10718) 14 | [![License](https://img.shields.io/badge/License-MIT-yellow)](https://github.com/CodeLLM-Research/CodeJudge-Eval/LICENSE.txt) 15 | 16 |
17 | 18 | ## Introduction 19 | 20 | Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce **CodeJudge-Eval (CJ-Eval)**, a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. **CJ-Eval** challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, **CJ-Eval** addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on **CJ-Eval** reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. 21 | 22 | ## Experiment Results 23 | 24 |

25 | 26 |

27 | 28 | ## More Details 29 | 30 | More details can be found in our [paper](https://arxiv.org/abs/2408.10718). 31 | 32 | ## 📑 Citation 33 | 34 | If you find **CodeJudge-Eval** useful for your research and applications, please cite using this BibTeX: 35 | ```bibtex 36 | @misc{zhao2024codejudgeevallargelanguagemodels, 37 | title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, 38 | author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma}, 39 | year={2024}, 40 | eprint={2408.10718}, 41 | archivePrefix={arXiv}, 42 | primaryClass={cs.SE}, 43 | url={https://arxiv.org/abs/2408.10718}, 44 | } 45 | ``` 46 | -------------------------------------------------------------------------------- /experiments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeLLM-Research/CodeJudge-Eval/c168cd2702c3da3b7c8d08ec7a572e7d9f7f1fbf/experiments.png -------------------------------------------------------------------------------- /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeLLM-Research/CodeJudge-Eval/c168cd2702c3da3b7c8d08ec7a572e7d9f7f1fbf/logo.png --------------------------------------------------------------------------------