├── LICENSE.txt
├── README.md
├── experiments.png
└── logo.png


/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Dan Hendrycks
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <p align="center">
 2 |     <img src="logo.png" width="150" style="margin-bottom: 0.2;"/>
 3 | <p>
 4 | 
 5 | <h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A">
 6 | CodeJudge-Eval:  Can Large Language Models be Good Judges in Code Understanding?</a></h3>
 7 | <h5 align="center"> 🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (<a href="https://coling2025.org/">COLING 25</a>). </h2>
 8 | <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2>
 9 | 
10 | <h5 align="center">
11 | 
12 | [![hf_data](https://img.shields.io/badge/🤗-Datasets-9C276A.svg)](https://huggingface.co/datasets/CodeResearch/CodeJudge-Eval)
13 | [![arXiv](https://img.shields.io/badge/Arxiv-2408.10718-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2408.10718)
14 | [![License](https://img.shields.io/badge/License-MIT-yellow)](https://github.com/CodeLLM-Research/CodeJudge-Eval/LICENSE.txt) 
15 | 
16 | </h5>
17 | 
18 | ## Introduction
19 | 
20 | Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce **CodeJudge-Eval (CJ-Eval)**, a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. **CJ-Eval** challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, **CJ-Eval** addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on **CJ-Eval** reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.
21 | 
22 | ## Experiment Results
23 | 
24 | <p align="center">
25 |     <img src="experiments.png" width="1550" style="margin-bottom: 0.2;"/>
26 | <p>
27 | 
28 | ## More Details
29 | 
30 | More details can be found in our [paper](https://arxiv.org/abs/2408.10718).
31 | 
32 | ## 📑 Citation
33 | 
34 | If you find **CodeJudge-Eval** useful for your research and applications, please cite using this BibTeX:
35 | ```bibtex
36 | @misc{zhao2024codejudgeevallargelanguagemodels,
37 |       title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, 
38 |       author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
39 |       year={2024},
40 |       eprint={2408.10718},
41 |       archivePrefix={arXiv},
42 |       primaryClass={cs.SE},
43 |       url={https://arxiv.org/abs/2408.10718}, 
44 | }
45 | ```
46 | 


--------------------------------------------------------------------------------
/experiments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeLLM-Research/CodeJudge-Eval/c168cd2702c3da3b7c8d08ec7a572e7d9f7f1fbf/experiments.png


--------------------------------------------------------------------------------
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeLLM-Research/CodeJudge-Eval/c168cd2702c3da3b7c8d08ec7a572e7d9f7f1fbf/logo.png


--------------------------------------------------------------------------------