🏆 EvalPlus Leaderboard 🏆
95 |EvalPlus evaluates AI Coders with rigorous tests.
96 |
104 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 | 📝 Notes
130 |
131 |
132 | - Samples are generated from scratch and are post-processed by our sanitizer script. We also run syntax checkers to avoid trivial syntactical errors.
133 | - Models are ranked according to pass@1 using greedy decoding. Setup details can be found here.
134 | - Models labelled with ✨ are evaluated using an instruction/chat setting, while others perform direct code generation given the prompt.
135 | - Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed and unambiguous.
136 | - It is the model providers' responsibility to avoid data contamination and clarify the training data, and we cannot guarantee if evaluated models are contaminated.
137 | - 💚 means the model is fully open-sourced with weights and training data. 💙 means, while the weights & fine-tuning data is known, the base model is trained on unknown data. What does this mean? Because we know how 💚 models are trained (or partially trained for 💙 models) so we can somehow reason and measure its contamination.
138 | - The "size" of Mixtral 8x7B models is regarded to be 13B, aka the actual required compute per inference.
139 |
140 |
141 |
142 |
143 | 🤗 More Leaderboards
144 | In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
145 |
146 |
147 | - Big Code Models Leaderboard
148 | - Chatbot Arena Leaderboard
149 | - ClassEval Leaderboard
150 | - CRUXEval Leaderboard
151 | - InfiCoder-Eval
152 | - TabbyML Leaderboard
153 |
154 |
155 |
156 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 | 📝 Notes
130 |131 |
-
132 |
- Samples are generated from scratch and are post-processed by our sanitizer script. We also run syntax checkers to avoid trivial syntactical errors. 133 |
- Models are ranked according to pass@1 using greedy decoding. Setup details can be found here. 134 |
- Models labelled with ✨ are evaluated using an instruction/chat setting, while others perform direct code generation given the prompt. 135 |
- Both MBPP and MBPP+ referred in our leaderboard use a subset (399 tasks) of hand-verified problems from MBPP-sanitized (427 tasks), to make sure the programming task is well-formed and unambiguous. 136 |
- It is the model providers' responsibility to avoid data contamination and clarify the training data, and we cannot guarantee if evaluated models are contaminated. 137 |
- 💚 means the model is fully open-sourced with weights and training data. 💙 means, while the weights & fine-tuning data is known, the base model is trained on unknown data. What does this mean? Because we know how 💚 models are trained (or partially trained for 💙 models) so we can somehow reason and measure its contamination. 138 |
- The "size" of Mixtral 8x7B models is regarded to be 13B, aka the actual required compute per inference. 139 |
🤗 More Leaderboards
144 | In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: 145 |146 |
-
147 |
- Big Code Models Leaderboard 148 |
- Chatbot Arena Leaderboard 149 |
- ClassEval Leaderboard 150 |
- CRUXEval Leaderboard 151 |
- InfiCoder-Eval 152 |
- TabbyML Leaderboard 153 |