├── LICENSE
├── README.md
├── assets
    ├── best_annotation_practices.png
    ├── llm_gen.png
    ├── llm_logprob.png
    ├── llm_tk_1.png
    ├── lm_eval_diff.png
    └── sympy_doc.png
├── contents
    ├── automated-benchmarks
    │   ├── basics.md
    │   ├── designing-your-automatic-evaluation.md
    │   ├── some-evaluation-datasets.md
    │   └── tips-and-tricks.md
    ├── examples
    │   └── comparing_task_formulations.ipynb
    ├── general-knowledge
    │   ├── model-inference-and-evaluation.md
    │   └── tokenization.md
    ├── human-evaluation
    │   ├── basics.md
    │   ├── tips-and-tricks.md
    │   └── using-human-annotators.md
    ├── model-as-a-judge
    │   ├── basics.md
    │   ├── designing-your-evaluation-prompt.md
    │   ├── evaluating-your-evaluator.md
    │   ├── getting-a-judge-llm.md
    │   ├── tips-and-tricks.md
    │   └── what-about-reward-models.md
    └── troubleshooting
    │   ├── troubleshooting-inference.md
    │   ├── troubleshooting-math-parsing.md
    │   └── troubleshooting-reproducibility.md
├── resources
    ├── About NLP.md
    └── About evaluation.md
└── translations
    ├── CONTRIBUTING.md
    └── zh
        └── contents
            ├── automated-benchmarks
                ├── basics.md
                ├── designing-your-automatic-evaluation.md
                ├── some-evaluation-datasets.md
                └── tips-and-tricks.md
            ├── examples
                └── comparing_task_formulations.ipynb
            ├── general-knowledge
                ├── model-inference-and-evaluation.md
                └── tokenization.md
            ├── human-evaluation
                ├── basics.md
                ├── tips-and-tricks.md
                └── using-human-annotators.md
            ├── model-as-a-judge
                ├── basics.md
                ├── designing-your-evaluation-prompt.md
                ├── evaluating-your-evaluator.md
                ├── getting-a-judge-llm.md
                ├── tips-and-tricks.md
                └── what-about-reward-models.md
            └── troubleshooting
                ├── troubleshooting-inference.md
                ├── troubleshooting-math-parsing.md
                └── troubleshooting-reproducibility.md


/LICENSE:
--------------------------------------------------------------------------------
  1 | Attribution-NonCommercial-ShareAlike 4.0 International
  2 | 
  3 | =======================================================================
  4 | 
  5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  6 | does not provide legal services or legal advice. Distribution of
  7 | Creative Commons public licenses does not create a lawyer-client or
  8 | other relationship. Creative Commons makes its licenses and related
  9 | information available on an "as-is" basis. Creative Commons gives no
 10 | warranties regarding its licenses, any material licensed under their
 11 | terms and conditions, or any related information. Creative Commons
 12 | disclaims all liability for damages resulting from their use to the
 13 | fullest extent possible.
 14 | 
 15 | Using Creative Commons Public Licenses
 16 | 
 17 | Creative Commons public licenses provide a standard set of terms and
 18 | conditions that creators and other rights holders may use to share
 19 | original works of authorship and other material subject to copyright
 20 | and certain other rights specified in the public license below. The
 21 | following considerations are for informational purposes only, are not
 22 | exhaustive, and do not form part of our licenses.
 23 | 
 24 |      Considerations for licensors: Our public licenses are
 25 |      intended for use by those authorized to give the public
 26 |      permission to use material in ways otherwise restricted by
 27 |      copyright and certain other rights. Our licenses are
 28 |      irrevocable. Licensors should read and understand the terms
 29 |      and conditions of the license they choose before applying it.
 30 |      Licensors should also secure all rights necessary before
 31 |      applying our licenses so that the public can reuse the
 32 |      material as expected. Licensors should clearly mark any
 33 |      material not subject to the license. This includes other CC-
 34 |      licensed material, or material used under an exception or
 35 |      limitation to copyright. More considerations for licensors:
 36 |     wiki.creativecommons.org/Considerations_for_licensors
 37 | 
 38 |      Considerations for the public: By using one of our public
 39 |      licenses, a licensor grants the public permission to use the
 40 |      licensed material under specified terms and conditions. If
 41 |      the licensor's permission is not necessary for any reason--for
 42 |      example, because of any applicable exception or limitation to
 43 |      copyright--then that use is not regulated by the license. Our
 44 |      licenses grant only permissions under copyright and certain
 45 |      other rights that a licensor has authority to grant. Use of
 46 |      the licensed material may still be restricted for other
 47 |      reasons, including because others have copyright or other
 48 |      rights in the material. A licensor may make special requests,
 49 |      such as asking that all changes be marked or described.
 50 |      Although not required by our licenses, you are encouraged to
 51 |      respect those requests where reasonable. More considerations
 52 |      for the public:
 53 |     wiki.creativecommons.org/Considerations_for_licensees
 54 | 
 55 | =======================================================================
 56 | 
 57 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
 58 | Public License
 59 | 
 60 | By exercising the Licensed Rights (defined below), You accept and agree
 61 | to be bound by the terms and conditions of this Creative Commons
 62 | Attribution-NonCommercial-ShareAlike 4.0 International Public License
 63 | ("Public License"). To the extent this Public License may be
 64 | interpreted as a contract, You are granted the Licensed Rights in
 65 | consideration of Your acceptance of these terms and conditions, and the
 66 | Licensor grants You such rights in consideration of benefits the
 67 | Licensor receives from making the Licensed Material available under
 68 | these terms and conditions.
 69 | 
 70 | 
 71 | Section 1 -- Definitions.
 72 | 
 73 |   a. Adapted Material means material subject to Copyright and Similar
 74 |      Rights that is derived from or based upon the Licensed Material
 75 |      and in which the Licensed Material is translated, altered,
 76 |      arranged, transformed, or otherwise modified in a manner requiring
 77 |      permission under the Copyright and Similar Rights held by the
 78 |      Licensor. For purposes of this Public License, where the Licensed
 79 |      Material is a musical work, performance, or sound recording,
 80 |      Adapted Material is always produced where the Licensed Material is
 81 |      synched in timed relation with a moving image.
 82 | 
 83 |   b. Adapter's License means the license You apply to Your Copyright
 84 |      and Similar Rights in Your contributions to Adapted Material in
 85 |      accordance with the terms and conditions of this Public License.
 86 | 
 87 |   c. BY-NC-SA Compatible License means a license listed at
 88 |      creativecommons.org/compatiblelicenses, approved by Creative
 89 |      Commons as essentially the equivalent of this Public License.
 90 | 
 91 |   d. Copyright and Similar Rights means copyright and/or similar rights
 92 |      closely related to copyright including, without limitation,
 93 |      performance, broadcast, sound recording, and Sui Generis Database
 94 |      Rights, without regard to how the rights are labeled or
 95 |      categorized. For purposes of this Public License, the rights
 96 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 97 |      Rights.
 98 | 
 99 |   e. Effective Technological Measures means those measures that, in the
100 |      absence of proper authority, may not be circumvented under laws
101 |      fulfilling obligations under Article 11 of the WIPO Copyright
102 |      Treaty adopted on December 20, 1996, and/or similar international
103 |      agreements.
104 | 
105 |   f. Exceptions and Limitations means fair use, fair dealing, and/or
106 |      any other exception or limitation to Copyright and Similar Rights
107 |      that applies to Your use of the Licensed Material.
108 | 
109 |   g. License Elements means the license attributes listed in the name
110 |      of a Creative Commons Public License. The License Elements of this
111 |      Public License are Attribution, NonCommercial, and ShareAlike.
112 | 
113 |   h. Licensed Material means the artistic or literary work, database,
114 |      or other material to which the Licensor applied this Public
115 |      License.
116 | 
117 |   i. Licensed Rights means the rights granted to You subject to the
118 |      terms and conditions of this Public License, which are limited to
119 |      all Copyright and Similar Rights that apply to Your use of the
120 |      Licensed Material and that the Licensor has authority to license.
121 | 
122 |   j. Licensor means the individual(s) or entity(ies) granting rights
123 |      under this Public License.
124 | 
125 |   k. NonCommercial means not primarily intended for or directed towards
126 |      commercial advantage or monetary compensation. For purposes of
127 |      this Public License, the exchange of the Licensed Material for
128 |      other material subject to Copyright and Similar Rights by digital
129 |      file-sharing or similar means is NonCommercial provided there is
130 |      no payment of monetary compensation in connection with the
131 |      exchange.
132 | 
133 |   l. Share means to provide material to the public by any means or
134 |      process that requires permission under the Licensed Rights, such
135 |      as reproduction, public display, public performance, distribution,
136 |      dissemination, communication, or importation, and to make material
137 |      available to the public including in ways that members of the
138 |      public may access the material from a place and at a time
139 |      individually chosen by them.
140 | 
141 |   m. Sui Generis Database Rights means rights other than copyright
142 |      resulting from Directive 96/9/EC of the European Parliament and of
143 |      the Council of 11 March 1996 on the legal protection of databases,
144 |      as amended and/or succeeded, as well as other essentially
145 |      equivalent rights anywhere in the world.
146 | 
147 |   n. You means the individual or entity exercising the Licensed Rights
148 |      under this Public License. Your has a corresponding meaning.
149 | 
150 | 
151 | Section 2 -- Scope.
152 | 
153 |   a. License grant.
154 | 
155 |        1. Subject to the terms and conditions of this Public License,
156 |           the Licensor hereby grants You a worldwide, royalty-free,
157 |           non-sublicensable, non-exclusive, irrevocable license to
158 |           exercise the Licensed Rights in the Licensed Material to:
159 | 
160 |             a. reproduce and Share the Licensed Material, in whole or
161 |                in part, for NonCommercial purposes only; and
162 | 
163 |             b. produce, reproduce, and Share Adapted Material for
164 |                NonCommercial purposes only.
165 | 
166 |        2. Exceptions and Limitations. For the avoidance of doubt, where
167 |           Exceptions and Limitations apply to Your use, this Public
168 |           License does not apply, and You do not need to comply with
169 |           its terms and conditions.
170 | 
171 |        3. Term. The term of this Public License is specified in Section
172 |           6(a).
173 | 
174 |        4. Media and formats; technical modifications allowed. The
175 |           Licensor authorizes You to exercise the Licensed Rights in
176 |           all media and formats whether now known or hereafter created,
177 |           and to make technical modifications necessary to do so. The
178 |           Licensor waives and/or agrees not to assert any right or
179 |           authority to forbid You from making technical modifications
180 |           necessary to exercise the Licensed Rights, including
181 |           technical modifications necessary to circumvent Effective
182 |           Technological Measures. For purposes of this Public License,
183 |           simply making modifications authorized by this Section 2(a)
184 |           (4) never produces Adapted Material.
185 | 
186 |        5. Downstream recipients.
187 | 
188 |             a. Offer from the Licensor -- Licensed Material. Every
189 |                recipient of the Licensed Material automatically
190 |                receives an offer from the Licensor to exercise the
191 |                Licensed Rights under the terms and conditions of this
192 |                Public License.
193 | 
194 |             b. Additional offer from the Licensor -- Adapted Material.
195 |                Every recipient of Adapted Material from You
196 |                automatically receives an offer from the Licensor to
197 |                exercise the Licensed Rights in the Adapted Material
198 |                under the conditions of the Adapter's License You apply.
199 | 
200 |             c. No downstream restrictions. You may not offer or impose
201 |                any additional or different terms or conditions on, or
202 |                apply any Effective Technological Measures to, the
203 |                Licensed Material if doing so restricts exercise of the
204 |                Licensed Rights by any recipient of the Licensed
205 |                Material.
206 | 
207 |        6. No endorsement. Nothing in this Public License constitutes or
208 |           may be construed as permission to assert or imply that You
209 |           are, or that Your use of the Licensed Material is, connected
210 |           with, or sponsored, endorsed, or granted official status by,
211 |           the Licensor or others designated to receive attribution as
212 |           provided in Section 3(a)(1)(A)(i).
213 | 
214 |   b. Other rights.
215 | 
216 |        1. Moral rights, such as the right of integrity, are not
217 |           licensed under this Public License, nor are publicity,
218 |           privacy, and/or other similar personality rights; however, to
219 |           the extent possible, the Licensor waives and/or agrees not to
220 |           assert any such rights held by the Licensor to the limited
221 |           extent necessary to allow You to exercise the Licensed
222 |           Rights, but not otherwise.
223 | 
224 |        2. Patent and trademark rights are not licensed under this
225 |           Public License.
226 | 
227 |        3. To the extent possible, the Licensor waives any right to
228 |           collect royalties from You for the exercise of the Licensed
229 |           Rights, whether directly or through a collecting society
230 |           under any voluntary or waivable statutory or compulsory
231 |           licensing scheme. In all other cases the Licensor expressly
232 |           reserves any right to collect such royalties, including when
233 |           the Licensed Material is used other than for NonCommercial
234 |           purposes.
235 | 
236 | 
237 | Section 3 -- License Conditions.
238 | 
239 | Your exercise of the Licensed Rights is expressly made subject to the
240 | following conditions.
241 | 
242 |   a. Attribution.
243 | 
244 |        1. If You Share the Licensed Material (including in modified
245 |           form), You must:
246 | 
247 |             a. retain the following if it is supplied by the Licensor
248 |                with the Licensed Material:
249 | 
250 |                  i. identification of the creator(s) of the Licensed
251 |                     Material and any others designated to receive
252 |                     attribution, in any reasonable manner requested by
253 |                     the Licensor (including by pseudonym if
254 |                     designated);
255 | 
256 |                 ii. a copyright notice;
257 | 
258 |                iii. a notice that refers to this Public License;
259 | 
260 |                 iv. a notice that refers to the disclaimer of
261 |                     warranties;
262 | 
263 |                  v. a URI or hyperlink to the Licensed Material to the
264 |                     extent reasonably practicable;
265 | 
266 |             b. indicate if You modified the Licensed Material and
267 |                retain an indication of any previous modifications; and
268 | 
269 |             c. indicate the Licensed Material is licensed under this
270 |                Public License, and include the text of, or the URI or
271 |                hyperlink to, this Public License.
272 | 
273 |        2. You may satisfy the conditions in Section 3(a)(1) in any
274 |           reasonable manner based on the medium, means, and context in
275 |           which You Share the Licensed Material. For example, it may be
276 |           reasonable to satisfy the conditions by providing a URI or
277 |           hyperlink to a resource that includes the required
278 |           information.
279 |        3. If requested by the Licensor, You must remove any of the
280 |           information required by Section 3(a)(1)(A) to the extent
281 |           reasonably practicable.
282 | 
283 |   b. ShareAlike.
284 | 
285 |      In addition to the conditions in Section 3(a), if You Share
286 |      Adapted Material You produce, the following conditions also apply.
287 | 
288 |        1. The Adapter's License You apply must be a Creative Commons
289 |           license with the same License Elements, this version or
290 |           later, or a BY-NC-SA Compatible License.
291 | 
292 |        2. You must include the text of, or the URI or hyperlink to, the
293 |           Adapter's License You apply. You may satisfy this condition
294 |           in any reasonable manner based on the medium, means, and
295 |           context in which You Share Adapted Material.
296 | 
297 |        3. You may not offer or impose any additional or different terms
298 |           or conditions on, or apply any Effective Technological
299 |           Measures to, Adapted Material that restrict exercise of the
300 |           rights granted under the Adapter's License You apply.
301 | 
302 | 
303 | Section 4 -- Sui Generis Database Rights.
304 | 
305 | Where the Licensed Rights include Sui Generis Database Rights that
306 | apply to Your use of the Licensed Material:
307 | 
308 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
309 |      to extract, reuse, reproduce, and Share all or a substantial
310 |      portion of the contents of the database for NonCommercial purposes
311 |      only;
312 | 
313 |   b. if You include all or a substantial portion of the database
314 |      contents in a database in which You have Sui Generis Database
315 |      Rights, then the database in which You have Sui Generis Database
316 |      Rights (but not its individual contents) is Adapted Material,
317 |      including for purposes of Section 3(b); and
318 | 
319 |   c. You must comply with the conditions in Section 3(a) if You Share
320 |      all or a substantial portion of the contents of the database.
321 | 
322 | For the avoidance of doubt, this Section 4 supplements and does not
323 | replace Your obligations under this Public License where the Licensed
324 | Rights include other Copyright and Similar Rights.
325 | 
326 | 
327 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
328 | 
329 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
330 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
331 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
332 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
333 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
334 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
335 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
336 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
337 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
338 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
339 | 
340 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
341 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
342 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
343 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
344 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
345 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
346 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
347 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
348 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
349 | 
350 |   c. The disclaimer of warranties and limitation of liability provided
351 |      above shall be interpreted in a manner that, to the extent
352 |      possible, most closely approximates an absolute disclaimer and
353 |      waiver of all liability.
354 | 
355 | 
356 | Section 6 -- Term and Termination.
357 | 
358 |   a. This Public License applies for the term of the Copyright and
359 |      Similar Rights licensed here. However, if You fail to comply with
360 |      this Public License, then Your rights under this Public License
361 |      terminate automatically.
362 | 
363 |   b. Where Your right to use the Licensed Material has terminated under
364 |      Section 6(a), it reinstates:
365 | 
366 |        1. automatically as of the date the violation is cured, provided
367 |           it is cured within 30 days of Your discovery of the
368 |           violation; or
369 | 
370 |        2. upon express reinstatement by the Licensor.
371 | 
372 |      For the avoidance of doubt, this Section 6(b) does not affect any
373 |      right the Licensor may have to seek remedies for Your violations
374 |      of this Public License.
375 | 
376 |   c. For the avoidance of doubt, the Licensor may also offer the
377 |      Licensed Material under separate terms or conditions or stop
378 |      distributing the Licensed Material at any time; however, doing so
379 |      will not terminate this Public License.
380 | 
381 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
382 |      License.
383 | 
384 | 
385 | Section 7 -- Other Terms and Conditions.
386 | 
387 |   a. The Licensor shall not be bound by any additional or different
388 |      terms or conditions communicated by You unless expressly agreed.
389 | 
390 |   b. Any arrangements, understandings, or agreements regarding the
391 |      Licensed Material not stated herein are separate from and
392 |      independent of the terms and conditions of this Public License.
393 | 
394 | 
395 | Section 8 -- Interpretation.
396 | 
397 |   a. For the avoidance of doubt, this Public License does not, and
398 |      shall not be interpreted to, reduce, limit, restrict, or impose
399 |      conditions on any use of the Licensed Material that could lawfully
400 |      be made without permission under this Public License.
401 | 
402 |   b. To the extent possible, if any provision of this Public License is
403 |      deemed unenforceable, it shall be automatically reformed to the
404 |      minimum extent necessary to make it enforceable. If the provision
405 |      cannot be reformed, it shall be severed from this Public License
406 |      without affecting the enforceability of the remaining terms and
407 |      conditions.
408 | 
409 |   c. No term or condition of this Public License will be waived and no
410 |      failure to comply consented to unless expressly agreed to by the
411 |      Licensor.
412 | 
413 |   d. Nothing in this Public License constitutes or may be interpreted
414 |      as a limitation upon, or waiver of, any privileges and immunities
415 |      that apply to the Licensor or You, including from the legal
416 |      processes of any jurisdiction or authority.
417 | 
418 | =======================================================================
419 | 
420 | Creative Commons is not a party to its public
421 | licenses. Notwithstanding, Creative Commons may elect to apply one of
422 | its public licenses to material it publishes and in those instances
423 | will be considered the “Licensor.” The text of the Creative Commons
424 | public licenses is dedicated to the public domain under the CC0 Public
425 | Domain Dedication. Except for the limited purpose of indicating that
426 | material is shared under a Creative Commons public license or as
427 | otherwise permitted by the Creative Commons policies published at
428 | creativecommons.org/policies, Creative Commons does not authorize the
429 | use of the trademark "Creative Commons" or any other trademark or logo
430 | of Creative Commons without its prior written consent including,
431 | without limitation, in connection with any unauthorized modifications
432 | to any of its public licenses or any other arrangements,
433 | understandings, or agreements concerning use of licensed material. For
434 | the avoidance of doubt, this paragraph does not form part of the
435 | public licenses.
436 | 
437 | Creative Commons may be contacted at creativecommons.org.
438 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # The LLM Evaluation guidebook ⚖️
 2 | 
 3 | If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! 
 4 | It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience.
 5 | 
 6 | Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide!
 7 | 
 8 | ## How to read this guide
 9 | - **Beginner user**: 
10 |   If you don't know anything about evaluation, you should start by the  `Basics` sections in each chapter before diving deeper. 
11 |   You'll also find explanations to support you about important LLM topics in `General knowledge`: for example, how model inference works and what tokenization is.
12 | - **Advanced user**:
13 |   The more practical sections are the `Tips and Tricks` ones, and `Troubleshooting` chapter. You'll also find interesting things in the `Designing` sections.
14 | 
15 | In text, links prefixed by ⭐ are links I really enjoyed and recommend reading.
16 | 
17 | ## Table of contents
18 | If you want an intro on the topic, you can read this [blog](https://huggingface.co/blog/clefourrier/llm-evaluation) on how and why we do evaluation!
19 | 
20 | ### Automatic benchmarks
21 | - [Basics](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/basics.md)
22 | - [Designing your automatic evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/designing-your-automatic-evaluation.md)
23 | - [Some evaluation datasets](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md)
24 | - [Tips and tricks](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/tips-and-tricks.md)
25 | 
26 | ### Human evaluation
27 | - [Basics](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md)
28 | - [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md)
29 | - [Tips and tricks](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/tips-and-tricks.md)
30 | 
31 | ### LLM-as-a-judge
32 | - [Basics](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/basics.md)
33 | - [Getting a Judge-LLM](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/getting-a-judge-llm.md)
34 | - [Designing your evaluation prompt](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/designing-your-evaluation-prompt.md)
35 | - [Evaluating your evaluator](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/evaluating-your-evaluator.md)
36 | - [What about reward models](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md)
37 | - [Tips and tricks](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/tips-and-tricks.md)
38 | 
39 | ### Troubleshooting
40 | The most densely practical part of this guide. 
41 | - [Troubleshooting inference](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-inference.md)
42 | - [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)
43 | 
44 | ### General knowledge
45 | These are mostly beginner guides to LLM basics, but will still contain some tips and cool references! 
46 | If you're an advanced user, I suggest skimming to the `Going further` sections.
47 | - [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md)
48 | - [Tokenization](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/tokenization.md)
49 | 
50 | ### Examples
51 | You'll also find examples as jupyter notebooks, to get a more hands on experience of evaluation if that's how you learn!
52 | - [Comparing task formulations during evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/examples/comparing_task_formulations.ipynb): This notebook walks you through how to define prompt variations for a single task, run the evaluations, and analyse the results.  
53 | 
54 | ## Planned next articles
55 | - contents/automated-benchmarks/Metrics -> Description of automatic metrics
56 | - contents/Introduction: Why do we need to do evaluation?
57 | - contents/Thinking about evaluation: What are the high level things you always need to consider when building your task?
58 | - contents/Troubleshooting/Troubleshooting ranking: Why comparing models is hard
59 | 
60 | ## Resources
61 | Links I like
62 | - [About evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/resources/About%20evaluation.md)
63 | - [About NLP](https://github.com/huggingface/evaluation-guidebook/blob/main/resources/About%20NLP.md)
64 | 
65 | ## Thanks
66 | This guide has been heavily inspired by the [ML Engineering Guidebook](https://github.com/stas00/ml-engineering) by Stas Bekman! Thanks for this cool resource!
67 | 
68 | Many thanks also to all the people who inspired this guide through discussions either at events or online, notably and not limited to:
69 | - 🤝 Luca Soldaini, Kyle Lo and Ian Magnusson (Allen AI), Max Bartolo (Cohere), Kai Wu (Meta), Swyx and Alessio Fanelli (Latent Space Podcast), Hailey Schoelkopf (EleutherAI), Martin Signoux (Open AI), Moritz Hardt (Max Planck Institute), Ludwig Schmidt (Anthropic)
70 | - 🔥 community users of the Open LLM Leaderboard and lighteval, who often raised very interesting points in discussions
71 | - 🤗 people at Hugging Face, like Lewis Tunstall, Omar Sanseviero, Arthur Zucker, Hynek Kydlíček, Guilherme Penedo and Thom Wolf,
72 | - of course my team ❤️ doing evaluation and leaderboards, Nathan Habib and Alina Lozovskaya.
73 | 
74 | ## Citation
75 | [![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]
76 | 
77 | [cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
78 | [cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png
79 | [cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC-BY--NC--SA-4.0-lightgrey.svg
80 | 
81 | ```
82 | @misc{fourrier2024evaluation,
83 |   author = {Clémentine Fourrier and The Hugging Face Community},
84 |   title = {LLM Evaluation Guidebook},
85 |   year = {2024},
86 |   journal = {GitHub repository},
87 |   url = {https://github.com/huggingface/evaluation-guidebook)
88 | }
89 | ```
90 | 


--------------------------------------------------------------------------------
/assets/best_annotation_practices.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huggingface/evaluation-guidebook/ff928479014580780b61535fc95d3b7cf0d667c3/assets/best_annotation_practices.png


--------------------------------------------------------------------------------
/assets/llm_gen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huggingface/evaluation-guidebook/ff928479014580780b61535fc95d3b7cf0d667c3/assets/llm_gen.png


--------------------------------------------------------------------------------
/assets/llm_logprob.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huggingface/evaluation-guidebook/ff928479014580780b61535fc95d3b7cf0d667c3/assets/llm_logprob.png


--------------------------------------------------------------------------------
/assets/llm_tk_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huggingface/evaluation-guidebook/ff928479014580780b61535fc95d3b7cf0d667c3/assets/llm_tk_1.png


--------------------------------------------------------------------------------
/assets/lm_eval_diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huggingface/evaluation-guidebook/ff928479014580780b61535fc95d3b7cf0d667c3/assets/lm_eval_diff.png


--------------------------------------------------------------------------------
/assets/sympy_doc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/huggingface/evaluation-guidebook/ff928479014580780b61535fc95d3b7cf0d667c3/assets/sympy_doc.png


--------------------------------------------------------------------------------
/contents/automated-benchmarks/basics.md:
--------------------------------------------------------------------------------
 1 | # Basics
 2 | 
 3 | *Note: Some of this overlaps with [my general blog on evals](https://huggingface.co/blog/clefourrier/llm-evaluation)*
 4 | ## What are automated benchmarks?
 5 | 
 6 | Automated benchmarks usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete **task**, such as `How well can my model classify spam from non spam emails?`, or a more abstract and general **capability**, such as `How good is my model at math?`.
 7 | 
 8 | From this, you construct an evaluation, using:
 9 | - a **dataset**, made of **samples**. 
10 | 	- These samples contain an input for the model, sometimes coupled with a reference (called gold) to compare the model's output with. 
11 | 	- Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at email classification, you create a dataset of spam and non spam emails, try to include some hard edge cases, etc. 
12 | - a **metric**. 
13 | 	- The metric is a way to score your model.
14 | 	  Example: how accurately can your model classify spam (score of well classified sample = 1, badly classified = 0).
15 | 	- Metrics use your model's outputs to do this scoring. In the case of LLMs, people mostly consider two kind of outputs:
16 | 		- the text generated by the model following the input (*generative evaluation*)
17 | 		- the log-probability of one or several sequences provided to the model (*multiple-choice evaluations*, sometimes called MCQA, or *perplexity evaluations*)
18 | 		- For more info on this, you should check out the [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md) page.
19 | 
20 | This is more interesting to do on data that the model has never been exposed to before (data absent from the model training set), because you want to test if it **generalizes** well. For example, if it can classify spam emails about 'health' products after having seen only spam emails about fake banks.
21 | 
22 | Note: *A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. Similarly to a student who learned test questions by heart without understanding the topic, evaluating LLMs on data that was already present in their training set is scoring them on capabilities they do not possess.*
23 | 
24 | ## Pros and cons of using automated benchmarks
25 | Automated benchmarks have the following advantages:
26 | - **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task. 
27 | - **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
28 | - **Understandability**: Most automated metrics are very understandable. 
29 |   *Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as `BLEU` or `ROUGE` for example).*
30 | - **Dataset quality**: A number of automated benchmarks are using expert generated datasets or pre-existing high quality data (like MMLU or MATH). However, this does not mean these datasets are perfect: for MMLU, several errors have been identified in samples afterwards, from parsing issues to actually non-sensical questions, leading to the creation of several follow-up datasets, like MMLU-Pro and MMLU-Redux.
31 | 
32 | However, they also present the following limitations:
33 | - **Reduced use on more complex tasks**: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks. 
34 |   *Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts?*
35 |   This led to the use of more **generalist** evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a **good proxy** for what we aim to measure.  
36 | - **Contamination**: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.
37 | 


--------------------------------------------------------------------------------
/contents/automated-benchmarks/designing-your-automatic-evaluation.md:
--------------------------------------------------------------------------------
  1 | # Designing your automatic evaluation
  2 | 
  3 | ## Choosing a dataset
  4 | For your evaluation, you can either select an existing dataset (see [Some evaluation datasets](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) for examples) or design your own. Through this process, it's very important to keep in mind that **your evaluation result will only be as good as your evaluation dataset**.
  5 | 
  6 | ### Selecting an existing dataset
  7 | You must imperatively look at its components.
  8 | #### Creation process
  9 | - **Who created the actual samples?** 
 10 | Imo, expert created dataset > paid annotator dataset ~ crowdsourced dataset > MTurked dataset.
 11 | You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity.
 12 | 
 13 | - **Were they all examined by other annotators or by the authors?** 
 14 | You want to know: 
 15 | 	- if the inter-annotator score on samples is high (= are annotators in agreement?)
 16 | 	- and/or if the full dataset has been examined by the authors.
 17 | This is especially important for datasets with the help of underpaid annotators who usually are not native speakers of your target language (think AWS Mechanical Turk), as you might otherwise find typos/grammatical errors/nonsensical answers.
 18 | 
 19 | - **Were the annotators provided with clear data creation guidelines?**
 20 | In other words, is your dataset consistent?
 21 | 
 22 | #### Samples 
 23 | Take 50 random samples and manually inspect them:
 24 | - *For quality*:
 25 | 	- are the prompts clear and unambiguous? 
 26 | 	- are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*)
 27 | 	- is information missing? (*Eg: MMLU misses reference schematics in a number of questions.*)
 28 | - *For relevance to your task*:
 29 | 	- are these questions the kind of questions you want to evaluate an LLM on?
 30 | 	- are these examples relevant to your use case?
 31 | 
 32 | You also want to know how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).
 33 | ### Designing your own
 34 | You can go 3 ways when designing your own dataset. 
 35 | #### Aggregating existing data
 36 | You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
 37 | #### Using human annotators
 38 | There's a whole section on using human annotators in `Human evaluation`, see [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md).
 39 | #### Using synthetic data
 40 | - **Using LLMs**
 41 | On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation. 
 42 | Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
 43 | 
 44 | - **Using rule-based techniques**
 45 | If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! 
 46 | For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
 47 | 
 48 | ## Choosing an inference method
 49 | You'll need to choose what kind of inference method you need.
 50 | 
 51 | Using log-probabilities (MCQA, multi-choice question answer) is very good for multiple choice question answers (usually to test model knowledge, or ability to disambiguate). 
 52 | - Pros: 
 53 | 	- Makes sure that all models have access to the correct answer
 54 | 	- Provides a proxy for model "confidence" (and calibration)
 55 | 	- Fast to evaluate, especially when we ask the model to predict only one token (A/B/C/D the indices of the choices, or Yes/No, etc).
 56 | 	- Allow to get signal on small models' task performance 
 57 | - Cons: 
 58 | 	- Slightly over-scores small models which would have generated something outside of the range of available choices if given free rein.
 59 | 	- Some models [favor specific choices based on the order in which they have been presented](https://arxiv.org/abs/2309.03882), which could lead to unrepresentative evaluations
 60 | 
 61 | Using generations (QA, question answering) is very good for any task where you want to test fluency, reasoning, or the ability of your model to actually answer questions.
 62 | - Pros:
 63 | 	- Should actually correlates with LLM ability to generate fluent text, will most of the time be what people are actually interested in
 64 | - Cons:
 65 | 	- Can be harder to score (see the `metrics` section below)
 66 | 	- Usually slightly more expensive than log likelihood evaluations, especially if they include sampling
 67 | 
 68 | ## Choosing a prompt
 69 | The prompt is going to define:
 70 | - how much information is given to your model about the task
 71 | - how this information is presented to your model.
 72 | 
 73 | A prompt for a general MCQA or QA is usually made of some of the following:
 74 | - a task prompt (optional): introduces your task.
 75 | - a context: provides additional context for your question.
 76 | 	- *Eg: For a summarization or information extraction task, you could provide a content source*
 77 | - a question: the actual core of your prompt.
 78 | - in case of a multi choice evaluation, you can add options
 79 | - connector words (`Question`, `Context`, `Choice`, ...)
 80 | 
 81 | When defining your prompt, you need to be aware that:
 82 | - even small changes in semantically equivalent prompts can make the results vary by quite a lot (see Section `Different prompt` in [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)), and prompt formats might advantage or disadvantage specific models
 83 | 	- How to mitigate this: 
 84 | 		- A costly way is to re-run the evaluation several times with prompt variations
 85 | 		- A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
 86 | - you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
 87 | - but models now tend to overfit specific prompt formats. 
 88 | 	- [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**
 89 | 	- On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
 90 | - for a number of metrics, you want a very constrained generation or output. 
 91 |   *You can learn more about this in the `Constraining model outputs` section of the [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md) page.*
 92 | 
 93 | ## Choosing a metric
 94 | If you are looking at **log-probabilities**, your metrics are going to be easy: you'll want to look at accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
 95 | 
 96 | For **generative** evaluations, your range of metrics is going to be wider. 
 97 | You'll need to 
 98 | 1. decide if you compare generations as they are, or first normalize them with something. 
 99 | 	- Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
100 | 	- They are very important for specific tasks, such as math evaluations, where you might want to extract your result from formatted outputs.
101 | 	- They will also be important if you want to evaluate with added mechanisms for accuracy, such as Chain of Thought, as you'll need to remove the reasoning trace from the actual result
102 | 2. decide how you compare the generation with the reference. 
103 |    You could use anything ranging from match-based metrics (exact match, prefix match, etc) to summarization and translation metrics (ROUGE, BLEU, character n gram comparisons). For a list of existing metrics, you can look [here](https://github.com/huggingface/lighteval/wiki/Metric-List), I'll add a section later on which metric to use when.
104 | 
105 | More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc). (*To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/)*)
106 | 
107 | ## Smart new tasks: what about functional testing?
108 | In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
109 | 
110 | This functionality approach is extremely promising, as it 
111 | - allows to generate test cases more easily (in many cases, you can generate rule-based test cases)
112 | - therefore reducing overfitting
113 | - tests models on specific active capabilities
114 | 
115 | It's however an approach which requires creativity to be translated to text! 
116 | 
117 | A good example of this is IFEval, an evaluation benchmark which tests if models can follow instructions. It works by creating a number of formatting instructions (*Add this number of bullet points. Capitalize only one sentence.* etc), and strictly testing if the format is followed. More work is clearly needed to extend this idea to other features of text to analyze!
118 | 


--------------------------------------------------------------------------------
/contents/automated-benchmarks/tips-and-tricks.md:
--------------------------------------------------------------------------------
 1 | # Tips and tricks
 2 | 
 3 | ## Managing contamination
 4 | In general, you should assume that a dataset publicly available on the internet is or will be contaminated. 
 5 | 
 6 | Solutions to mitigate this include:
 7 | - providing a **canary string** in the evaluation set (like in [BigBench](https://github.com/google/BIG-bench)): it is a specific character combination that model creators can look for in their training sets, which would indicate that it contains an evaluation
 8 | - providing evaluation sets in **[encrypted](https://arxiv.org/abs/2309.16575) or [gated](https://huggingface.co/datasets/Idavidrein/gpqa)** forms so that they can't be parsed easily by web crawlers - therefore not ending up accidentally in training sets 
 9 | - running [dynamic benchmarks](https://arxiv.org/abs/2104.14337): benchmarks regularly updated through time so that models can't "learn the answers by heart" (but it makes datasets more costly)
10 | - if you are running a benchmark, trying to [detect contamination](https://arxiv.org/abs/2311.06233) post-hoc (for example, by looking at the generation perplexity or designing adversarial versions of the prompts - however, no method is a foolproof contamination detection method)
11 | 
12 | However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training.
13 | 
14 | ## Practical issues you might encounter
15 | 
16 | ### Fine-tuned models, system prompts and chat templates
17 | A number of instruction tuned models are going to perform terribly if you do not make sure to:
18 | - add their system prompt at the very beginning of inference
19 | - prompt them using a chat template (usually adding `Assistant` and `User` prefixes to the dialogue turns - learn more about this in [this cool guide](https://huggingface.co/docs/transformers/main/en/chat_templating))
20 | 
21 | It's also very important to not assume that different tokenizers will behave the same, especially with respect to chat templates, as you can see in this cool picture about tokenization spacing and chat templates, from [this tweet](https://x.com/danielhanchen/status/1796952220619157694).
22 | 
23 | ![Spacing, tokenization and template](https://pbs.twimg.com/media/GPANfpiasAA9b6F?format=png&name=medium)
24 | 
25 | ### Tokenization
26 | 
27 | 1. **Tokenizing the context and choices together or separately**
28 | 
29 | When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model. 
30 | 
31 | However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices is not easy, as the context tokens can "bleed out" into them, messing up the comparison.
32 | 
33 | So if this is the case for your model, you might want to compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
34 | 
35 | 2. **Paying attention to start and end of sentence tokens**
36 | 
37 | Some models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
38 | 
39 | You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics. 
40 | 
41 | 3. **Multilinguality and tokenization**
42 | 
43 | When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.
44 | 
45 | 4. **Code evaluations and end of sentence tokens**
46 | 
47 | Code models usually have been trained with `\n\t` as a single token. This means that when generating text, they will often generate `\n\t` in one step. A task which defines `\n` as an end of sentence token (= to stop the generation) will let the model continue generating after a `\n\t`, if predicted as one token, since it's not the same as `\n`. But you would actually still want the model to stop. In these cases, you either need to update your end of sentence tokens, or define a mechanism to backtrack on the character representation of the latest tokens to stop (and cut) the generation a posteriori.
48 | 
49 | ### Easy speed up for MCQA evaluations
50 | You can speed up your MCQA predictions by a lot if you make sure your model needs to predict only one token for the task.
51 | 
52 | This way, instead of running your `number_of_choices` predictions (`context + choice 1`, `context + choice 2`, etc), you can simply run inference on `context` and compute the probability distribution on the full vocabulary (which will include all your one token choices) to get your logprobabilities of interest, and do this step in one pass. 
53 | 
54 | (That's how we do it in `lighteval`).
55 | 
56 | ## Unexpectedly bad results on generative evaluations
57 | 
58 | The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are: 
59 | - too strict model output parsing (before computing the metric) which leads to the answer being lost
60 |     - Fixing: adapt your parsing
61 | - unability of the models to follow your output format in few shot (frequent in recent models trained with instructions data, like llama 3.2 or Qwen 2.5)
62 |     - Fixing: either adapt your prompt format, or just assume that models should be able to follow it in few shot
63 | - exceedingly verbose model which never gets to the correct answer (more frequent in long context models and something we observed with Qwen and CommandR models)
64 |     - Fixing: either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly
65 | 
66 | 


--------------------------------------------------------------------------------
/contents/general-knowledge/model-inference-and-evaluation.md:
--------------------------------------------------------------------------------
 1 | # Model inference and evaluation
 2 | 
 3 | ## Introduction
 4 | Current large language model work in a simple way: given some text as input, they have learned to predict plausible follow up. 
 5 | 
 6 | This is done in two steps.
 7 | ### Tokenization
 8 | The input text (called a *prompt* at inference) is first split into *tokens*, small units of texts (which can be one or several characters, up to the word level) each associated with a number. The whole range of tokens a model can parse is called its *vocabulary*. *(To understand this more in depth, go read the [Tokenization](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/tokenization.md) page)*.
 9 | 
10 | ### Prediction
11 | 
12 | ![](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/llm_tk_1.png?raw=true)
13 | 
14 | From this input text, the LLM generates a probability distribution of the most likely next tokens over all the vocabulary. To get a continued generation, we can take the most probable token (give or take some added randomness to get more interesting outputs) as the next one, then repeat the operation, using the new token as the end of the prompt, etc.
15 | 
16 | ## What do you want to predict?
17 | LLM evaluations mostly fall into 2 categories:
18 | - Given a prompt and one (or several) answers, what is probability of said answer(s) for my model?
19 | - Given a prompt, what text does my model generate?
20 | ### Log-likelihood evaluations
21 | For log-likelihood evaluations, we want the conditional probability of one or several choices given a prompt - in other terms, what is the likelihood to get a specific continuation given an input? 
22 | So:
23 | - we concatenate each choice with the prompt, and pass them to our LLM, which outputs the logits of each token depending on the previous ones
24 | - we only keep the last logits (associated with the choice tokens), and apply a log softmax to get log-probabilities (where the range is `[-inf, 0]` instead of `[0-1]`)
25 | - we then sum all individual tokens log probabilities to get the overall choice log probability
26 | - we can finally apply a normalization based on choice length
27 | 
28 | ![](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/llm_logprob.png?raw=true)
29 | 
30 | This allows us to apply one of the following metrics:
31 | - get the preferred answer of a model among several choice, like in the above picture. (*However, this can advantage scores of models which would have, freely, generated something else, like `Zygote` in the picture.*)
32 | - test if a single choice has a probability above 0.5
33 | - study model calibration. A well calibrated model is a model for which the correct answers have the highest probabilities. 
34 |   *(To learn more about calibration, you can check [this paper](https://arxiv.org/abs/2207.05221) from Anthropic, on what it is, how to detect it, and how to train models to be well calibrated, and [this paper](https://arxiv.org/abs/2311.14648) on some possible limits of calibration).*
35 | 
36 | ### Generative evaluations
37 | For a generative evaluation, we want the text generated by the model given an input prompt. 
38 | 
39 | It is obtained in an auto-regressive way: we pass the prompt to the model, look at the most likely next token, select it as being the model's "choice first token", then repeat until we reach an end of generation condition (maximum length, special token to stop the generation, etc). All the tokens generated by the model are consider its answer to the prompt.
40 | 
41 | ![](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/llm_gen.png?raw=true)
42 | 
43 | 
44 | 
45 | We can then compare this generation with references and score the distance between both (using either simple metrics like exact match, more complex metrics like BLEU, or models as judges). 
46 | 
47 | ### Going further
48 | -  ⭐ [Blog on several ways to evaluate MMLU](https://huggingface.co/blog/open-llm-leaderboard-mmlu) , by my team at Hugging Face. I recommend reading it if you want to delve deeper into the differences between multi choice log-likelihood evaluations and generative ones, including what it can mean with respect to score changes
49 | 	- The above illustrations come from the blog and have been made by Thom Wolf
50 | - ⭐ [A beautiful mathematical formalization of the above inference methods](https://arxiv.org/abs/2405.14782v2), from EleutherAI. Go to the Appendix directly.
51 | ## Constraining model outputs
52 | In a number of cases, we want the model output to follow a specific format, for example to compare them to a reference.
53 | ### Using a prompt
54 | The easiest way to do this is to add a task prompt which contains very specific instructions as to how the model should answer (`Provide numerical answers in digits.`,`Use no abbreviation.`, etc). 
55 | 
56 | It won't necessarily work all the time but should be good enough for high capability models. That's the approach we followed in the [GAIA](https://huggingface.co/papers/2311.12983) paper, and you can find our task prompt in the Submission tab of the [leaderboard](https://huggingface.co/spaces/gaia-benchmark/leaderboard) if you want some inspiration. 
57 | ### Few shots and in context learning
58 | The next way to do so is to constrain the model through what is called "in context learning". By providing examples in the prompt (what is called `few-shot prompting`), the model is implicitly biased towards following the repeated prompt shape for the actual sample. 
59 | 
60 | It's a method which was overall working quite well until end of 2023! However, the widespread adoption of instruction-tuning methods and the addition of instruction data in later stages of model pre-training (continuous pre-training) seem to have biased more recent models towards specific output formats (what is being called [here](https://arxiv.org/abs/2407.07890) `Training on the test task`, and what I would call `overfitting the prompt format`). It's also a method which can be limited for older models with smaller context sizes, as some few-shot examples can not fit into the context window.
61 | ### Structured text generation
62 | Structured text generation constrains the outputs to follow a given path, defined by a grammar or by regular expressions, for example. The `outlines` library implements this using finite state machines, which is very neat. (Other approaches exist, such as using interleaved generation for json generation, but the FSM one is my favorite).
63 | 
64 | To understand more about what happens when using structured generation, you can check the [blog](https://huggingface.co/blog/evaluation-structured-outputs) we wrote together: structured generation reduce prompt variance in evaluation, and make results and rankings more stable. You can also check the overall `outlines` [blog](https://blog.dottxt.co/) for interesting implementations and observations linked to structured generation. 
65 | 
66 | However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show that structured generation can lower model performance on some tasks (like reasoning), by moving the prior too far away from the expected probability distribution.
67 | 
68 | ### Going further
69 | -  ⭐ [Understanding how Finite State Machine when using structured generation](https://blog.dottxt.co/coalescence.html), by Outlines. Super clear guide on how their method works! 
70 | - [The outlines method paper](https://arxiv.org/abs/2307.09702), a more academic explanation of the above
71 | - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
72 | 


--------------------------------------------------------------------------------
/contents/general-knowledge/tokenization.md:
--------------------------------------------------------------------------------
 1 | # Tokenization
 2 | 
 3 | ## Why and how do we tokenize text?
 4 | Since large language models are actually big mathematical functions, they eat numbers, not text. 
 5 | 
 6 | Say you want to transform a sentence to numbers. You first need to decide how to cut your sentence into small pieces, then map every small piece to a number; this is *tokenization*.
 7 | 
 8 | In the past, people would try to map each character of a text with its index in a alphabet (`a` -> 1, `b` -> 2, etc) which is called *character based tokenization* (you split between characters). On the other end of the spectrum, people also tried to map each word with its index in a dictionary (`a` -> 1, `aardvark` -> 2, `ab` -> 3, etc) which is called *word based tokenization* (you split on spaces, if your language has spaces - if not, it's a bit harder).
 9 | 
10 | Both these methods share a strong limitation: they remove information from the input text. They erase semantic connections that you can see from word shape (ex: `dis similar`, `similar`, `similar ity`, `similar ly`), information we would like our model to retain, so it connects related words together.
11 | (Plus, what happens if you suddenly have a completely new word in input? It gets no number, and your model can't process it 😔 )
12 | 
13 | Some people therefore had the idea to cut words into sub-words, and assign index to these sub-words (`dis`, `similar`, `ity`, `ly`)!
14 | 
15 | This was initially done using morpho-syntactic rules ("morpho-syntax" is like the grammar of word creation). Now most people use byte pair encoding (BPE), a smart statistical method to create the sub-words automatically depending on their frequency in a reference text.
16 | 
17 | So as a summary: tokenization is a way to map small units of texts (which can be one or several characters, up to the word level) to numbers (similar to an index). When you want to process text, your input text (called a *prompt* at inference) is split into these *tokens* by a tokenizer. The whole range of tokens a model or tokenizer can parse is called its *vocabulary*. 
18 | #### Going further: Understanding tokenization
19 | I advise reading one of the first 2 links in depth.
20 | - ⭐ [Explanation of different tokenization methods in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/4)
21 | - ⭐ [Conceptual guide about tokenization in the 🤗 doc](https://huggingface.co/docs/transformers/en/tokenizer_summary)
22 | - [Course by Jurafsky on tokenization (and other things)](https://web.stanford.edu/~jurafsky/slp3/2.pdf) - more academical in its approach, skip to 2.5 and 2.6 (the rest is interesting too but too broad)
23 | 
24 | #### Going further: Byte Pair Encoding
25 | - ⭐ [Explanation of BPE in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
26 | - [Paper introducing BPE to NLP](https://aclanthology.org/P16-1162/)
27 | 
28 | 
29 | ## Some of the many problems of tokenizations
30 | ### Choosing the correct vocabulary size
31 | The size of the vocabulary indicates how many individual tokens (for example, sub-words) the model will have to learn. 
32 | 
33 | A vocabulary which is **too big** might contain some very rare words as full tokens (for example: `aardvark`), which can lead to 2 problems. 
34 | 
35 | If such a rare word almost never appears in the training data, it can be hard to connect to other concepts, and the model might be unable to infer what it is about. 
36 | 
37 | On the other hand, if it appears rarely and only in specific contexts, it can be linked to some very specific other words: for example, if you train on forum data, and your tokenizer mapped a username as one single token in its vocabulary, your model might then associate this token to the specific user's content.
38 | 
39 | A vocabulary which is **too small** will present 2 other problems: worst representation capabilities, and increased cost at inference. 
40 | 
41 | Let's go back to our above example, where we tokenized words derived from `similar`. Using a pseudo BPE approach (large vocabulary) to tokenize `similarly` has split the word into 2 tokens (`similar`, `ly`). If we had used instead character level tokenization (therefore with a very small vocabulary, the size of an alphabet), the same word would be cut into 9 tokens (`s`, `i`, `m`, `i`, `l`, `a`, `r`, `l`, `y`). 
42 | 
43 | Where the first method splits `similarly` into tokens which have an individual semantic  meaning, it's not the case in the second method: with too small a vocabulary, we lost some semantic representation. The difference in representations length also means that it's many times as costly to generate our word with a smaller vocabulary (takes 9 tokens instead of 2, so 5 times more costly!).
44 | 
45 | At the moment, most people seem to use heuristics for vocabulary size, which seems correlated to number of languages covered and model size, so it's likely that using a number of tokens close to the reference models of a similar size could work for you.
46 | #### Going further: Rare tokens effect
47 | - [SolidGoldMagikarp post on Less Wrong](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)
48 | 	- Very interesting read on how some people identified very rare tokens in Open AI's vocabulary - this is quite cool because it's done without access to the model's internals (we don't know what the training data contains for example)  
49 | - [Fishing for Magikarp, paper by Cohere](https://arxiv.org/abs/2405.05417)
50 | 	- Follow up work on to detect these tokens
51 | 
52 | ### Managing several languages
53 | (Recommended: read an explanation of BPE before this section)
54 | When building or choosing your tokenizer, you construct your vocabulary from reference text. This means that your tokenizer will know vocabulary words and characters from this reference text. Usually, it means using data in English, with a Latin script. 
55 | 
56 | If you want to add new language, and your new language uses the same script and share some roots, you could theoretically hope that some of your original language semantics transfer to the new language. 
57 | 
58 | However, if you want to allow your tokenizer to correctly split text in other languages (especially languages written in other scripts) you'd better include data from these languages when building said tokenizer. Most of the time, though, this data will contain an unbalanced proportion of the initial language (ex: English) to the new language (ex: Thai, or Burmese), the initial language being much more present. Since most efficient tokenizer methods used nowadays (like BPE) create their complex vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level.
59 | 
60 | This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
61 | 
62 | #### Going further: Language and tokenization
63 | - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
64 | 	- The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
65 | - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/)
66 | 	- I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
67 | 
68 | ### What about numbers?
69 | When building your tokenizer, you need to decide what to do about numbers. Do you only index 0 to 9, and assume all other numbers will be compositions of digits, or do you want to store numbers up to, say, one billion, individually? Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. Maybe new approaches to tokenization, such as hierarchical tokenization, might be needed for this.
70 | #### Going further: Number tokenization
71 | - ⭐ [A nice visual demo by Yennie Jun of how tokenizers of Anthropic, Meta, OpenAI, and Mistral models split numbers](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down) 
72 | - [Small history by Beren Millidge of the evolution of number tokenization through the years](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)
73 | 


--------------------------------------------------------------------------------
/contents/human-evaluation/basics.md:
--------------------------------------------------------------------------------
 1 | # Basics
 2 | 
 3 | ## What is human evaluation?
 4 | Human evaluation is simply asking humans to evaluate models. 
 5 | In this document, we'll look at post-hoc evaluation: your model has been trained, you have a given task in mind, and humans are providing scores.
 6 | 
 7 | ### Systematic evaluation
 8 | There are 3 main ways to do this in a systematic manner.
 9 | 
10 | If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (eg: `try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not`), and access to one (or several) model(s) that they can interact with, then ask to provide their scores and reasoning.
11 | 
12 | If **you already have a dataset** (eg: `a set of prompts that you want to make sure your model will not answer`), you prompt your model with them, and provide the prompt, output and scoring guidelines to humans (`the model gets 0 if it answers with private information, 1 otherwise`). 
13 | 
14 | Lastly, if **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
15 | 
16 | Notes: 
17 | - *For evaluation of already deployed production models, you can also ask users for feedback, and do A/B testing then.*
18 | - *[AI audits](https://arxiv.org/abs/2401.14462) (external systematic evaluation of models) are usually human based, but out of scope for this document.
19 | 
20 | ### Casual evaluation
21 | Two other approaches exist to do human-based evaluation, in a more casual way.
22 | 
23 | **Vibes-checks** are manual evaluations done by individuals, usually on undisclosed prompts, to get an overall feeling of how well models perform on many use cases (from coding to quality of smut written). Often shared on Twitter and Reddit, results mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for). However, they can be [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need).
24 | 
25 | **Arenas** are crowdsourced human evaluation to rank models. 
26 | A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". 
27 | ## Pros and cons of human evaluation
28 | 
29 | Human evaluation is very interesting for the following reasons:
30 | - **Flexibility**: If you define clearly enough what you are evaluating, you can get scores for about anything!
31 | - **Absence of contamination**: If you ask humans to write new questions to test your system, they should not be present in your training data (hopefully)
32 | - **Correlation with human preference**: That one is quite obvious, since that's what you're using to score. 
33 |   *Note: However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.*
34 | 
35 | However, it also present a number of limitations:
36 | - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness. 
37 | - **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.) 
38 | - **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
39 | - **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
40 | ### Systematic human evaluation
41 | Pros of systematic human evaluations, especially with paid annotators, are
42 | - **Getting high quality data** adapted to your use case, that you will be able to build on later (if you need to develop preference models for example)
43 | - **Data privacy**: If you rely on paid human annotators, especially if in-house, your datasets should be relatively safe, whereas using LLM-evalution with closed source API models presents less guarantee on what happens to your data, since you send it to an external service.
44 | - **Explainability**: Scores obtained by the models will be explainable by the humans who annotated them.
45 | 
46 | Systematic human evaluations present some added issues:
47 | - **Cost**: If you pay your annotators correctly, this can get expensive fast. It's also likely you'll need rounds of iterative evaluation so that you can refine your guidelines, which adds to the cost.
48 | - **Un-scalability**: Unless you are evaluating a production like system with user feedback, human evaluations are not really scalable, as each new round requires mobilizing new evaluators (and paying them).
49 | - **Lack of reproducibility**: Unless you keep the exact same annotators continuously and your guidelines are perfectly unambiguous, it's likely some evaluations are going to be hard to reproduce precisely.
50 | 
51 | ### Casual human evaluation
52 | Pros of casual human evaluations are:
53 | - **Lesser cost**: since you rely on your crowd's good will
54 | - **Edge case discovery**: since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases
55 | - **Better scalability**: as long as you have many interested and willing participants, casual human evaluation scales better and has a lower entry cost 
56 | 
57 | The obvious problems of casual approaches (without annotator selection) are:
58 | - **High subjectivity**: it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
59 | - **Unrepresentative preference ranking**: since young western men are over re-represented on tech-sides of the internet, it can lead to very skewed preferences, mismatched to those of the general population, both in terms of topics explored and overall rankings.
60 | - **Easy to game**: if you're using unfiltered crowdsourced annotators, it's quite easy for a 3rd party to game your evaluation, for example to raise the score of a given model (since a number of models have a distinctive writing style)
61 | 


--------------------------------------------------------------------------------
/contents/human-evaluation/tips-and-tricks.md:
--------------------------------------------------------------------------------
 1 | # Tips and tricks
 2 | Here are a few practical tips you might want consider when using human annotators to build an evaluation dataset. If you haven't done so yet, we recommend reading first the page on "Using human annotators" and then come back to this page.
 3 | 
 4 | ## Designing the task
 5 | 
 6 | - **Simple is better**: Annotation tasks can get unnecessarily complex, so keep it as simple as possible. Keeping the cognitive load of the annotators to a minimum will help you ensure that they stay focused and make annotations of a higher quality.
 7 | 
 8 | - **Check what you show**: Only show the necessary information for annotators to complete the task and make sure you don't include anything that could introduce extra bias.
 9 | 
10 | - **Consider your annotators time**: Where and how things are displayed can introduce extra work or cognitive load and therefore negatively impact in the quality of results. For example, make sure that the texts and the task are visible together and avoid unnecessary scrolling. If you combine tasks and the result of one informs the other, you can display them sequentially. Think about how everything is displayed in your annotation tool and see if there's any way you can simplify even more.
11 | 
12 | - **Test the setup**: Once you have your task designed and some guidelines in place, make sure you test it yourself on a few samples before involving the whole team, and iterate as needed. 
13 | 
14 | ## During the annotation
15 | 
16 | - **Annotators should work independently**: It's better if annotators don't help each other or see each other's work during the task, as they can propagate their own biases and cause annotation drift. Alignment should always happen through comprehensive guidelines. You may want to train any new team members first on a separate dataset and/or use inter-annotator agreement metrics to make sure the team is aligned.
17 | 
18 | - **Consistency is key**: If you make important changes to your guidelines (e.g., changed a definition or instruction, or have added/removed labels), consider if you need to iterate over the annotated data. At least, you should track the changes in your dataset through a metadata value like `guidelines-v1`.
19 | 
20 | ## Hybrid human-machine annotation
21 | 
22 | Sometimes teams face contraints on time and resources but don't want to sacrifice on the pros of human evaluation. In these cases, you may use the help of models to make the task more efficient.
23 | 
24 | - **Model-aided annotation**: You may use the predictions or generations of a model as pre-annotations, so that the annotation team doesn't need to start from scratch. Just note that this could introduce the model's biases into human annotations, and that if the model's accuracy is poor it may increase work for annotators.
25 | 
26 | - **Supervise model as a judge**: You can combine the power of the model as a judge methodology (see the section on "Model as a judge") and human supervisors who validate or discard the results. Note that the biases discussed in the "Pros and cons of human evaluation" will apply here.
27 | 
28 | - **Idenfity edge cases**: For an even faster task, use a jury of models and then have your human supervisor(s) step in where models disagree or there's a tie to break. Again, be aware of the biases discussed in the "Pros and cons of human evaluation".
29 | 
30 | ## End to end tutorial
31 | 
32 | To build you own custom evaluation setup following these tips, you can follow this [practical tutorial](https://github.com/argilla-io/argilla-cookbook/tree/main/domain-eval) from Argilla. It guides you through building a custom evaluation task for your domain, using synthetic data and manual evaluation with [Argilla](https://github.com/argilla-io/argilla/) and [distilabel](https://github.com/argilla-io/distilabel). The guide starts from domain documents and results in a custom evaluation task that you can use to evaluate your model with [lighteval](https://github.com/huggingface/lighteval).


--------------------------------------------------------------------------------
/contents/human-evaluation/using-human-annotators.md:
--------------------------------------------------------------------------------
 1 | # Using human annotators
 2 | 
 3 | I suggest reading Section 3 of this [review](https://aclanthology.org/2024.cl-3.1/) of good practices in data annotation quality. If you want production level quality and have the means to implement all of these methods, go ahead! 
 4 | 
 5 |   ![Best_annotation_practices](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/best_annotation_practices.png?raw=true)
 6 | 
 7 | However, important guidelines (no matter your project size) are the following, once you defined your task and scoring guidelines.
 8 | 
 9 | - **Workforce selection, and if you can monetary incentive**
10 | You likely want the people working on your task to:
11 | 1) obey some demographics. 
12 | 	Some examples: be native speakers of the target language, have a higher education level, be experts in a specific domain, be diverse in their geographical origins, etc. 
13 | 	 Your needs will vary depending on your task.
14 | 1) produce high quality work. 
15 | 	It's notably important now to add a way to check if answers are LLM-generated, and you'll need to filter some annotators out of your pool.
16 |   *Imo, unless you're counting on highly motivated crowdsourced annotators, it's always better to pay your annotators correctly.*
17 | 
18 | - **Guideline design** 
19 | Make sure to spend a lot of time really brainstorming your guidelines! That's one of the points on which we spent the most time for the [GAIA](https://huggingface.co/gaia-benchmark) dataset.
20 | 
21 | - **Iterative annotation** 
22 | Be ready to try several rounds of annotations, as your annotators will misunderstand your guidelines (they are more ambiguous than you think)! Generating samples several times will allow your annotators to really converge on what you need.
23 | 
24 |   - **Quality estimation** and **Manual curation**
25 | You want to control answers (notably via inter-annotator agreement if you can get it) and do a final selection to keep only the highest quality/most relevant answers.
26 | 
27 | Specialized tools to build annotated high quality datasets like [Argilla](https://argilla.io/) can also help you. 
28 | ### Going further
29 | - ⭐ [How to set up your own annotator platform in a couple minutes](https://huggingface.co/learn/cookbook/enterprise_cookbook_argilla), by Moritz Laurer. A good read to get some hands on experience using open source tools (like Argilla and Hugging Face), and understanding better the dos and don'ts of human annotation at scale.
30 | - ⭐ [A guide on annotation good practices](https://aclanthology.org/2024.cl-3.1/). It's a review of all papers about human annotation dating from 2023, and it is very complete. Slightly dense, but very understandable.
31 | - [Another guide on annotation good practices](https://scale.com/guides/data-labeling-annotation-guide), by ScaleAI, specialised in human evaluations. Its a more lightweigth complement to the above document.
32 | - [Assumptions and Challenges of Capturing Human Labels](https://aclanthology.org/2024.naacl-long.126/) is a paper on how to look at source of annotator disagreement and mitigate them in practice
33 | 


--------------------------------------------------------------------------------
/contents/model-as-a-judge/basics.md:
--------------------------------------------------------------------------------
 1 | # Basics
 2 | 
 3 | ## What is a judge model evaluation?
 4 | Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations. 
 5 | 
 6 | Judge models range from small specialized classifiers (think "spam filter", but for toxicity for example) to LLMs, either large and generalist or small and specialized. In the latter case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`). 
 7 | 
 8 | Model as judges allow to score text on complex and nuanced properties. 
 9 | For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators. 
10 | 
11 | That's where models as judges come into play. 
12 | 
13 | They are used on 3 main tasks:
14 | - *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
15 | - *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
16 | - *Computing the similarity* between a model output and a reference 
17 | 
18 | *Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
19 | 
20 | ## Pros and cons of using judge-LLMs
21 | Judge LLMs have been used for the following points:
22 | - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner
23 | - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data.
24 | - **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators.
25 | - **Alignment with human judgments**: They are somehow correlated with human judgments.
26 | 
27 | There are also downside to all of these:
28 | - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see [model-as-a-judge/Tips and tricks]). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
29 | - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
30 | - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
31 | 
32 | ## How to start?
33 | - If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
34 | You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
35 | 


--------------------------------------------------------------------------------
/contents/model-as-a-judge/designing-your-evaluation-prompt.md:
--------------------------------------------------------------------------------
 1 | # Designing your evaluation prompt
 2 | 
 3 | ## General prompt design tips
 4 | Some general guidelines I've come across online when designing the prompt itself are:
 5 | - Provide a clear description of the task at hand:
 6 | 	- `Your task is to do X`. 
 7 | 	- `You will be provided with Y`.
 8 | - Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
 9 | 	- `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
10 | 	- `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
11 | - Provide some additional "reasoning" evaluation steps:
12 | 	- `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
13 | - Specify the desired output format (adding fields will help consistency)
14 | 	- `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
15 | 
16 | You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
17 | 
18 | Other tidbits:
19 | - Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
20 | - If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
21 | - Using one prompt per capability to score tends to give better and more robust results
22 | 
23 | ## Improving judgment accuracy
24 | You can also improve accuracy using the following, possibly more costly, techniques:
25 | - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
26 | - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy 
27 | - **CoT**: [improves accuracy](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
28 | - **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
29 | - Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model. 
30 | 	- It can be made considerably less costly by leveraging many smaller models instead of one big expensive model. 
31 | 	- You can also experiment with using one model with variations on temperature
32 | - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
33 | 
34 | Note on prompting: Depending on the stakes of your use case, to remove as much bias as possible, you would want to look at work done in sociology on how to design good surveys. If you treat your evaluator as a replacement for a human annotator, then you need to look at similar metrics: computing inter-annotator agreement, using correct survey design methodology to mitigate bias, etc.
35 | 
36 | However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
37 | 


--------------------------------------------------------------------------------
/contents/model-as-a-judge/evaluating-your-evaluator.md:
--------------------------------------------------------------------------------
 1 | # Evaluating your evaluator
 2 | 
 3 | Before using a judge-LLM in production or at scale, you want to first evaluate its quality for your task, to make sure its scores are actually relevant and useful for you. 
 4 | 
 5 | Note: *This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference.* 
 6 | 
 7 | So, once you have selected your model judge and its prompt, you'll need to do the following.
 8 | 
 9 | ## 1. Pick your baseline
10 | You'll need to compare your evaluator judgments to a baseline: it can be human annotations, the output of another judge model that you know is qualitative on your task, a gold truth, itself with another prompt, etc. 
11 | 
12 | You don't necessarily need a lot of examples (50 can be enough), but you need them to be extremely representative of your task, discriminative (representative of edge cases notably), and of as high quality as you can manage.
13 | 
14 | ## 2. Pick your metric
15 | Your metric will be used to compare your judge's evaluations with your reference. 
16 | 
17 | In general, this comparison is considerably easier to do if your model is predicting binary classes or doing pairwise comparison, as you'll be able to compute accuracy (for pairwise comparison), or precision and recall (for binary classes), which are all very easy to interpret metrics. 
18 | 
19 | Comparing the correlation of scores with human or model scoring will be harder to do. To understand why in more detail, I advise you to read this cool [blog section on the topic](https://eugeneyan.com/writing/llm-evaluators/#key-considerations-before-adopting-an-llm-evaluator).
20 | 
21 | In general, if you're a bit lost about what metrics to pick when (in terms of models, metrics, ...), you can also look at [this interesting graph](https://eugeneyan.com/assets/llm-eval-tree.jpg) from [the same above blog](https://eugeneyan.com/writing/llm-evaluators/) ⭐.
22 | 
23 | ## 3. Evaluate your evaluator
24 | For this step, you simply need to use your model and its prompt to evaluate your test samples! Then, once you get the evaluations, use your above metric and reference to compute a score for your evaluations.
25 | 
26 | You need to decide what your threshold for acceptance is. Depending on how hard your task is, you can aim for 80% to 95% accuracy, if you're doing pairwise comparison. Regarding correlations (if you're using scores), people in the literature tend to seem happy with 0.8 Pearson correlation with a reference. However, I've seen some papers declare that 0.3 indicates a good correlation with human annotators (^^") so ymmv.
27 | 
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/contents/model-as-a-judge/getting-a-judge-llm.md:
--------------------------------------------------------------------------------
 1 | # Getting a Judge-LLM
 2 | 
 3 | When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4),  using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
 4 | 
 5 | ## Using a generalist LLM
 6 | 
 7 | With the introduction of more capable LLMs (such as ChatGPT), some researchers started exploring using big models as judges. The best current big model judges tend to be closed source models (like Claude or gpt-o models) though the gap with open source is closing very fast thanks to high quality models such as [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e), [Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024) or [Llama 3.1-405-Instruct](meta-llama/Llama-3.1-405B-Instruct). 
 8 | 
 9 | Closed source models, despite their performance, present the multiple disadvantages of being:
10 | - under APIs, which mean that models (therefore results) can change with no notice, hurting the reproducibility of evals
11 | - black boxes, which makes them un-interpretable
12 | - possible sources of data leakage/lack of data privacy, as you send your data to a third party through the internet (which tends to be less safe than locally managed data), and you don't know for certain what is done with it (you often need to opt out of it being used in training sets).
13 | 
14 | However, they also allow anyone to have access to a high quality model without needing to setup things locally or requiring access to hardware. This pros are now also present for most high quality open models, which are accessible through model providers, and solve the first 2 problems above.
15 | 
16 | You'll find a good cost analysis of model providers [here](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) if you need help picking one.
17 | 
18 | ## Using a tiny specialized LLM judge model
19 | 
20 | You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
21 | 
22 | Some existing models:
23 | - Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset
24 | - Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
25 | - JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
26 | 
27 | ## Training your own
28 | You can also make the choice to train or fine-tune your own LLM-as-judge.
29 | 
30 | You first need to gather preference data for your task of interest, which can come
31 | - From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
32 | - From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
33 | 
34 | Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can 
35 | - distill into a new smaller model
36 | - quantize.
37 | - then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data
38 | 	- apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590)
39 | 


--------------------------------------------------------------------------------
/contents/model-as-a-judge/tips-and-tricks.md:
--------------------------------------------------------------------------------
 1 | # Tips and tricks
 2 | 
 3 | ## Mitigating well known biases of LLM as judges: 
 4 | 
 5 | - **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
 6 | 	- You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
 7 | - **Self-preference**: they tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
 8 | 	- You can mitigate this by using a jury
 9 | - **Blindness to input perturbation**: models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale. 
10 | 	- You can mitigate this by 
11 | 		- asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
12 | 		- providing a coherent grading scale in the prompt.
13 | - **Position-bias**: they tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice
14 | 	- You can mitigate this by 
15 | 		- switching answer positions randomly
16 | 		- computing the log-probabilities of all possible choices to get a normalized answer
17 | - **Verbosity-bias** (or length-bias): they tend to like more verbose answers
18 | 	- You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
19 | - **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
20 | 	- However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.
21 | - **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
22 | 	- You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
23 | 
24 | ## Picking correct tasks for an LLM judge
25 | 
26 | LLM evaluators:
27 | - are **bad at identifying hallucinations** in general, particularly what are called partial hallucinations (which look close to the ground truth but are actually slightly different) (see [this](https://arxiv.org/abs/2305.11747) and [this](https://arxiv.org/abs/2303.08896))
28 | - have a low to OK-ish correlation with human annotators on [summarization](https://arxiv.org/abs/2304.02554) ([here too](https://arxiv.org/abs/2303.16634)), [faithfulness](https://arxiv.org/abs/2307.16877), and are not consistently correlated with human judgement more broadly against [a scope of tasks](https://arxiv.org/abs/2406.18403)
29 | 


--------------------------------------------------------------------------------
/contents/model-as-a-judge/what-about-reward-models.md:
--------------------------------------------------------------------------------
 1 | # What about Reward Models?
 2 | 
 3 | ## What is a Reward Model?
 4 | 
 5 | Reward models learn to predict a score from human annotations for given prompt/completion pairs. The end goal is for them to do predictions aligned with human preference. 
 6 | Once trained, these models can then be used to improve other models, by acting as a a reward function which is a proxy for human judgment.
 7 | 
 8 | ### Pairwise score
 9 | 
10 | The most common type of reward model is the Bradley-Terry model, which outputs a single score, following:
11 | 
12 | $$p(\text{completion b is better than completion a}) = \text{sigmoid}(\text{score}_b - \text{score}_a)$$
13 | 
14 | This model is trained using only pairwise comparisons of completions, which are easier to collect than scores, but can only compare several completions for one prompt, and not completions across prompts.
15 | 
16 | Other models have expanded on this approach to predict a more nuanced probability that a completion is better than the other one ([example](https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B)). 
17 | 
18 | This allows them to (theoretically) judge subtle differences between completions, at the cost of not being able to easily save and compare many different scores across prompts for the same test set. In addition, context length and memory limits can become an issue when comparing too long completions.
19 | 
20 | ### Absolute score
21 | 
22 | Some reward models such as [SteerLM](https://arxiv.org/abs/2311.09528) output absolute scores, which can be used to evaluate completions directly without the need for pairwise comparisions. These models can be easier to use for evaluation, but are also harder to collect data for, as absolute scores tend to be less stable than pairwise scores in human preferences. 
23 | 
24 | More recently, models have been proposed that output both absolute and relative scores, such as [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) and [ArmoRM](https://arxiv.org/abs/2406.12845).
25 | 
26 | 
27 | ## How do I use a Reward Model for Evaluation?
28 | 
29 | Given a dataset of prompts, we can generate completions from a language model and ask a reward model to score them.
30 | 
31 | For models that give absolute scores, the resulting scores can be averaged to get a reasonable summary score.
32 | 
33 | However, in the more common case of relative scores, the average reward can be biased by outliers (a few very good or very bad completions) as different prompts may have inherently different reward scales (some prompts are way harder or easier than others).
34 | 
35 | Instead, we can use 
36 | - win rates: take a reference set of completions and calculate the percentage of completions from the model that are ranked higher than the reference completions. It is slightly more granular. 
37 | - win probabilities: the mean probability of the completions being better than the reference completions, which can give a more fine-grained and smoothly changing signal.
38 | 
39 | ## Pros and Cons of Reward Models
40 | 
41 | Reward models are typically:
42 | - **Very fast**: Getting a score is as simple as running a forward pass of a relatively small model once (since we only get a score, and not long text, contrary to judge-LLMs)
43 | - **Deterministic**: The same scores will be reproduced through the same forward pass
44 | - **Unlikely to suffer from positional bias**: As most models take only one completion, they can not be influenced by the order. For pairwise models, positional bias is often also minimal, as long as the training data was balanced with respect to containing both first and second answers as being the best.
45 | - **Require no prompt engineering**: since the model will simply output a score from one or two completions depending on preference data it's been trained on.
46 | 
47 | On the other hand they:
48 | - **Require specific fine-tuning**: This can be a relatively costly step, and elthough they inherit many capabilities from a base model, they may still perform poorly on tasks that are out of the training distribution.
49 | - **Loose efficiency when used both in reinforcement learning and evaluation** (or when using direct alignment algorithms on datasets that are similar to the training data of the reward model), as the language model may overfit to the reward model's preferences.
50 | 
51 | ## Tips and Tricks for using Reward Models for Evaluation
52 | 
53 | - A good place to find high performing models is the [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
54 | - You can look at how reward models have been used in the [Nemotron](https://arxiv.org/abs/2406.11704) paper. 
55 | - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
56 | - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) recent paper, can allow you to detect model degradation and select optimal checkpoints.
57 | 


--------------------------------------------------------------------------------
/contents/troubleshooting/troubleshooting-inference.md:
--------------------------------------------------------------------------------
 1 | # Troubleshooting inference
 2 | 
 3 | ## My model is very slow!
 4 | ### Changing the batch size
 5 | If you want absolute reproducibility (given a specific hardware and a specific evaluation prompt), you're probably using a batch size of one. However, moving to higher batch sizes will likely make your evaluation faster (given that it fits within the memory requirements of your hardware)
 6 | 
 7 | ### Data parallelism
 8 | You can also duplicate your model on several GPUs instead of loading it on one single GPU, and provide subsets of the data to each GPU copy, then aggregate the computation results. 
 9 | This means that each data stream will be handled in parallel, at the same time as the others, which divides your total execution time by the number of GPUs. 
10 | However, if you can, all GPUs should be on a single node to avoid inter-node bottlenecks.
11 | 
12 | ### Changing the inference code
13 | Not all inference libraries run at the same speed, and some code is more optimized than other. You'll need to experiment a bit to find which libraries have the fastest inference, and if you are using pytorch, I recommend looking at the model inference optimization checklist [here](https://pytorch.org/serve/performance_checklist.html).
14 | 
15 | ### Changing the precision
16 | If your model is very slow, you can reduce its size by reducing the precision of the computations. A model stored in float32 does very precise computations (using 32bits per number stored!) that are also very memory and compute heavy - moving to `blfoat16` or `float16` (half the precision) should make the model twice as fast at a loss of precision which should almost not matter. If you want bumps in speed, you can quantize it even more, to 8 or 4 bits (using `gptq` or `bitsandbytes` for example), as n-bit matrix computations should be faster and your model will take even less space in memory (however, some quantization libraries might be a bit slow, so test things out for your use cases!). 
17 | 
18 | ## My model is very big!
19 | ### Estimating memory requirements
20 | You can estimate the minimal theoretical memory required to load a given model (and therefore hardware) with the **following formula**:
21 | 
22 | `<memory (in GB)> = <number of parameters (in G)> * <precision factor>`
23 | 
24 | Since you can store 8 bits in a Byte, the memory required is the total number of parameters times the number of Bytes required to store one parameter. The precision factor is therefore 4 for `float32`,  2 for `float16` or `bfoat16`, 1 for `8bit`, and 0.5 for `4bit` models, etc.
25 | 
26 | And that's it! 
27 | 
28 | I would actually recommend using  `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
29 | 
30 | ### What should you do if your model does not fit on a GPU?
31 | #### Quantization
32 | The first obvious thing is to play on the `<precision factor>` above: going from float32 to 4 bits reduces memory requirements by 8! 
33 | However, using too low a precision can give worse results, so for some models (especially medium range), you might want to stay in float16 or 8bit. (Quantization seems to affect very big models performance less, possibly because of information redundancy).
34 | #### Model parallelism
35 | Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
36 | 
37 | The 2 main types of model parallelism are
38 | - Pipeline parallelism, where the model is split at the whole layer level, and the layers are dispatched on different GPUs. Since layer 1's output is layer 2's input, this leads to a slower execution, as GPUs will be idle while waiting, which is called a "bubble" (and data must be transferred from one GPU to the next). The bubble can be reduced by splitting the inputs into smaller batches. It's being natively added to PyTorch with the `PiPPy` [lib](https://github.com/pytorch/PiPPy), and this is what `accelerate` uses under the hood for parallelism.
39 | - Tensor parallelism, where the model is split at the matrix computation level. This means that the matrices will be split on rows or columns, and the total result aggregated. This is incredibly efficient as long as all GPUs are on the same node (to avoid inter node network bottlenecks), but can be hard to code. You'll find cool implementations of this in the `vllm` lib. It provides **insane speedups**.
40 | 
41 | The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).
42 | 
43 | #### CPU offloading
44 | CPU offloading moves some of the computations and models parts to CPU, in order to reduce GPU memory usage. It's **considerably slower** than any other method here, mostly because you need to move data from one device to another all the time.
45 | 
46 | An example of this is [ZeRO-Offload](https://arxiv.org/abs/2101.06840) by Deepspeed, which distributes parameters between CPU and GPU (on top of using other optimization described in the ZeRO-2 paper). On CPU are passed gradients, optimizer states and fp32 model parameter computations during optimisation, whereas on GPU, you'll find fp16 parameters and forward/backward pass, to leverage CPU memory used and GPU computations while minimizing communication between both.
47 | 
48 | ### My model fits on a GPU but I still get OOMs!
49 | You likely have a problem with your context size, then. 
50 | 
51 | We recommend: 
52 | 1) testing if your model truly does fit on a GPU with some dummy inference data loaded. This dummy inference data should have a big enough context size (representative of your task)
53 | 2) lowering the batch size, or removing the auto-batch size search which could lead to an accidental OOM error, if you have this enabled
54 | 3) more generally, making sure that samples are presented to your model in inverse context size order, to be sure that your model will fail directly if the context size is too big, and not after having run for X hours.
55 | 


--------------------------------------------------------------------------------
/contents/troubleshooting/troubleshooting-math-parsing.md:
--------------------------------------------------------------------------------
  1 | # Using LaTeX to evaluate MATH capabilities
  2 | 
  3 | Parsing latex is hard. This is an issue when evaluating a model expecting $\LaTeX$ as output. This is the case for the [MATH benchmark](https://huggingface.co/datasets/lighteval/MATH).
  4 | 
  5 | This benchmark uses $\LaTeX$ to represent mathematical calculations and symbols. Evaluating this task should be a matter of parsing and comparing the ground truth and the model's output.  
  6 | Turns out, there is no right way to parse $\LaTeX$:
  7 | 
  8 | 
  9 | ![](../../assets/sympy_doc.png)  
 10 | *From the [`sympy`](https://github.com/sympy/sympy) documentation*
 11 | 
 12 | The lm-evaluation harness uses [`sympy`](https://github.com/sympy/sympy) (a Python library for symbolic mathematics) to parse latex and compare expressions.  
 13 | When using `sympy` to try and parse the ground truths (using the ground truth against itself), we only get around 0.94 accuracy.
 14 | How could that be? Well, it turns out `sympy` cannot parse certain (correct $\LaTeX$) expressions.
 15 | 
 16 | For example: 
 17 | 
 18 | ```
 19 | couldn't parse one of [0,1) or [0,1), I expected one of these: ']'
 20 | [0,1)
 21 | ~~^
 22 | ```
 23 | 
 24 | ```
 25 | couldn't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here
 26 | (-\iny,-5]\cup[5,\iny)
 27 | ~~~~~~^
 28 | ```
 29 | 
 30 | ```
 31 | couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don't understand this
 32 | -\frac{1}{{}2x}
 33 | ~~~~~~~~~~~^
 34 | ```
 35 | 
 36 | ### How do I get around this?
 37 | 
 38 | You could either re-write the $\LaTeX$ [grammar](https://github.com/sympy/sympy/blob/master/sympy/parsing/latex/lark/grammar/latex.lark), adding needed features to
 39 | the code, or add manual checks to your code to improve model scores. After
 40 | almost falling into a deep rabbit hole, we decided that adding string
 41 | comparison checks to our code would be sufficient.
 42 | 
 43 | ![Fix to the Lm Eval Harness](../../assets/lm_eval_diff.png)  
 44 | *Fix to the LM Evaluation Harness*
 45 | 
 46 | ### Results
 47 | 
 48 | Here is a table comparing old and new results of the first 25 models.
 49 | 
 50 | <div id="xdihwljbql" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
 51 | <table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
 52 | <thead>
 53 |   <tr class="gt_heading">
 54 |     <td colspan="5" class="gt_heading gt_title gt_font_normal">Comparison of original and fixed parser on MATH benchmark</td>
 55 |   </tr>
 56 | <tr class="gt_col_headings gt_spanner_row">
 57 |   <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="2" colspan="1" scope="col" id="Model">Model</th>
 58 |   <th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="2" scope="colgroup" id="Score">
 59 |     <span class="gt_column_spanner">Score</span>
 60 |   </th>
 61 |   <th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="2" scope="colgroup" id="Rank">
 62 |     <span class="gt_column_spanner">Rank</span>
 63 |   </th>
 64 | </tr>
 65 | <tr class="gt_col_headings">
 66 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Original">Original</th>
 67 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Fixed parser">Fixed parser</th>
 68 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Original">Original</th>
 69 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Fixed parser">Fixed parser</th>
 70 | </tr>
 71 | </thead>
 72 | <tbody class="gt_table_body">
 73 |   <tr>
 74 |     <td class="gt_row gt_left">rombodawg/Rombos-LLM-V2.5-Qwen-72b</td>
 75 |     <td class="gt_row gt_right">47.58</td>
 76 |     <td class="gt_row gt_right">50.68</td>
 77 |     <td style="color: #FFFFFF; background-color: #000000;" class="gt_row gt_right">1</td>
 78 |     <td style="color: #FFFFFF; background-color: #000000;" class="gt_row gt_right">1</td>
 79 |   </tr>
 80 |   <tr>
 81 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.2-qwen2-72b</td>
 82 |     <td class="gt_row gt_right">41.16</td>
 83 |     <td class="gt_row gt_right">43.43</td>
 84 |     <td style="color: #FFFFFF; background-color: #41181f;" class="gt_row gt_right">2</td>
 85 |     <td style="color: #FFFFFF; background-color: #41181f;" class="gt_row gt_right">2</td>
 86 |   </tr>
 87 |   <tr>
 88 |     <td class="gt_row gt_left">arcee-ai/Arcee-Nova</td>
 89 |     <td class="gt_row gt_right">40.48</td>
 90 |     <td class="gt_row gt_right">42.90</td>
 91 |     <td style="color: #FFFFFF; background-color: #82303e;" class="gt_row gt_right">3</td>
 92 |     <td style="color: #FFFFFF; background-color: #82303e;" class="gt_row gt_right">3</td>
 93 |   </tr>
 94 |   <tr>
 95 |     <td class="gt_row gt_left">fblgit/TheBeagle-v2beta-32B-MGS</td>
 96 |     <td class="gt_row gt_right">39.43</td>
 97 |     <td class="gt_row gt_right">42.52</td>
 98 |     <td style="color: #FFFFFF; background-color: #c3495e;" class="gt_row gt_right">4</td>
 99 |     <td style="color: #FFFFFF; background-color: #c3495e;" class="gt_row gt_right">4</td>
100 |   </tr>
101 |   <tr>
102 |     <td class="gt_row gt_left">rombodawg/Rombos-LLM-V2.5-Qwen-32b</td>
103 |     <td class="gt_row gt_right">39.12</td>
104 |     <td class="gt_row gt_right">41.99</td>
105 |     <td style="color: #000000; background-color: #ca6866;" class="gt_row gt_right">5</td>
106 |     <td style="color: #000000; background-color: #ca6866;" class="gt_row gt_right">5</td>
107 |   </tr>
108 |   <tr>
109 |     <td class="gt_row gt_left">dnhkng/RYS-XLarge</td>
110 |     <td class="gt_row gt_right">38.97</td>
111 |     <td class="gt_row gt_right">41.24</td>
112 |     <td style="color: #000000; background-color: #a58c5e;" class="gt_row gt_right">6</td>
113 |     <td style="color: #000000; background-color: #a58c5e;" class="gt_row gt_right">6</td>
114 |   </tr>
115 |   <tr>
116 |     <td class="gt_row gt_left">dfurman/CalmeRys-78B-Orpo-v0.1</td>
117 |     <td class="gt_row gt_right">37.92</td>
118 |     <td class="gt_row gt_right">40.71</td>
119 |     <td style="color: #000000; background-color: #6ec352;" class="gt_row gt_right">8</td>
120 |     <td style="color: #000000; background-color: #80b156;" class="gt_row gt_right">7</td>
121 |   </tr>
122 |   <tr>
123 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.2-rys-78b</td>
124 |     <td class="gt_row gt_right">37.92</td>
125 |     <td class="gt_row gt_right">39.95</td>
126 |     <td style="color: #000000; background-color: #6ec352;" class="gt_row gt_right">8</td>
127 |     <td style="color: #000000; background-color: #4cbd81;" class="gt_row gt_right">9</td>
128 |   </tr>
129 |   <tr>
130 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.4-rys-78b</td>
131 |     <td class="gt_row gt_right">37.69</td>
132 |     <td class="gt_row gt_right">40.41</td>
133 |     <td style="color: #000000; background-color: #4cbd81;" class="gt_row gt_right">9</td>
134 |     <td style="color: #000000; background-color: #5ece55;" class="gt_row gt_right">8</td>
135 |   </tr>
136 |   <tr>
137 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.3-rys-78b</td>
138 |     <td class="gt_row gt_right">36.56</td>
139 |     <td class="gt_row gt_right">38.97</td>
140 |     <td style="color: #000000; background-color: #3aacad;" class="gt_row gt_right">10</td>
141 |     <td style="color: #000000; background-color: #3aacad;" class="gt_row gt_right">10</td>
142 |   </tr>
143 |   <tr>
144 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.1-rys-78b</td>
145 |     <td class="gt_row gt_right">36.40</td>
146 |     <td class="gt_row gt_right">38.90</td>
147 |     <td style="color: #000000; background-color: #279cd9;" class="gt_row gt_right">11</td>
148 |     <td style="color: #000000; background-color: #279cd9;" class="gt_row gt_right">11</td>
149 |   </tr>
150 |   <tr>
151 |     <td class="gt_row gt_left">Qwen/Qwen2.5-72B</td>
152 |     <td class="gt_row gt_right">36.10</td>
153 |     <td class="gt_row gt_right">38.67</td>
154 |     <td style="color: #000000; background-color: #23a7e6;" class="gt_row gt_right">12</td>
155 |     <td style="color: #000000; background-color: #23a7e6;" class="gt_row gt_right">12</td>
156 |   </tr>
157 |   <tr>
158 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.1-qwen2-72b</td>
159 |     <td class="gt_row gt_right">36.03</td>
160 |     <td class="gt_row gt_right">38.07</td>
161 |     <td style="color: #000000; background-color: #25bce6;" class="gt_row gt_right">13</td>
162 |     <td style="color: #000000; background-color: #36d0e2;" class="gt_row gt_right">15</td>
163 |   </tr>
164 |   <tr>
165 |     <td class="gt_row gt_left">Qwen/Qwen2-Math-72B-Instruct</td>
166 |     <td class="gt_row gt_right">35.95</td>
167 |     <td class="gt_row gt_right">38.14</td>
168 |     <td style="color: #000000; background-color: #27d2e5;" class="gt_row gt_right">14</td>
169 |     <td style="color: #000000; background-color: #27d2e5;" class="gt_row gt_right">14</td>
170 |   </tr>
171 |   <tr>
172 |     <td class="gt_row gt_left">dfurman/Qwen2-72B-Orpo-v0.1</td>
173 |     <td class="gt_row gt_right">35.42</td>
174 |     <td class="gt_row gt_right">38.14</td>
175 |     <td style="color: #000000; background-color: #36d0e2;" class="gt_row gt_right">15</td>
176 |     <td style="color: #000000; background-color: #25bce6;" class="gt_row gt_right">13</td>
177 |   </tr>
178 |   <tr>
179 |     <td class="gt_row gt_left">abacusai/Smaug-Qwen2-72B-Instruct</td>
180 |     <td class="gt_row gt_right">35.35</td>
181 |     <td class="gt_row gt_right">37.46</td>
182 |     <td style="color: #000000; background-color: #6691d6;" class="gt_row gt_right">16</td>
183 |     <td style="color: #000000; background-color: #d73a91;" class="gt_row gt_right">19</td>
184 |   </tr>
185 |   <tr>
186 |     <td class="gt_row gt_left">anthracite-org/magnum-v1-72b</td>
187 |     <td class="gt_row gt_right">35.27</td>
188 |     <td class="gt_row gt_right">37.69</td>
189 |     <td style="color: #FFFFFF; background-color: #ae33c4;" class="gt_row gt_right">18</td>
190 |     <td style="color: #000000; background-color: #7e72d0;" class="gt_row gt_right">16</td>
191 |   </tr>
192 |   <tr>
193 |     <td class="gt_row gt_left">alpindale/magnum-72b-v1</td>
194 |     <td class="gt_row gt_right">35.27</td>
195 |     <td class="gt_row gt_right">37.69</td>
196 |     <td style="color: #FFFFFF; background-color: #ae33c4;" class="gt_row gt_right">18</td>
197 |     <td style="color: #000000; background-color: #7e72d0;" class="gt_row gt_right">16</td>
198 |   </tr>
199 |   <tr>
200 |     <td class="gt_row gt_left">Qwen/Qwen2-72B-Instruct</td>
201 |     <td class="gt_row gt_right">35.12</td>
202 |     <td class="gt_row gt_right">37.69</td>
203 |     <td style="color: #000000; background-color: #d73a91;" class="gt_row gt_right">19</td>
204 |     <td style="color: #FFFFFF; background-color: #c614be;" class="gt_row gt_right">18</td>
205 |   </tr>
206 |   <tr>
207 |     <td class="gt_row gt_left">dnhkng/RYS-XLarge-base</td>
208 |     <td class="gt_row gt_right">34.67</td>
209 |     <td class="gt_row gt_right">37.16</td>
210 |     <td style="color: #000000; background-color: #e3715f;" class="gt_row gt_right">20</td>
211 |     <td style="color: #000000; background-color: #e3715f;" class="gt_row gt_right">20</td>
212 |   </tr>
213 |   <tr>
214 |     <td class="gt_row gt_left">Undi95/MG-FinalMix-72B</td>
215 |     <td class="gt_row gt_right">33.61</td>
216 |     <td class="gt_row gt_right">36.10</td>
217 |     <td style="color: #000000; background-color: #f4c314;" class="gt_row gt_right">22</td>
218 |     <td style="color: #000000; background-color: #eea82d;" class="gt_row gt_right">21</td>
219 |   </tr>
220 |   <tr>
221 |     <td class="gt_row gt_left">abacusai/Dracarys-72B-Instruct</td>
222 |     <td class="gt_row gt_right">33.61</td>
223 |     <td class="gt_row gt_right">35.65</td>
224 |     <td style="color: #000000; background-color: #f4c314;" class="gt_row gt_right">22</td>
225 |     <td style="color: #000000; background-color: #eac222;" class="gt_row gt_right">22</td>
226 |   </tr>
227 |   <tr>
228 |     <td class="gt_row gt_left">Qwen/Qwen2.5-32B</td>
229 |     <td class="gt_row gt_right">32.85</td>
230 |     <td class="gt_row gt_right">35.50</td>
231 |     <td style="color: #000000; background-color: #d1b64b;" class="gt_row gt_right">23</td>
232 |     <td style="color: #000000; background-color: #d1b64b;" class="gt_row gt_right">23</td>
233 |   </tr>
234 |   <tr>
235 |     <td class="gt_row gt_left">anthracite-org/magnum-v2-72b</td>
236 |     <td class="gt_row gt_right">31.65</td>
237 |     <td class="gt_row gt_right">34.06</td>
238 |     <td style="color: #000000; background-color: #b7aa75;" class="gt_row gt_right">24</td>
239 |     <td style="color: #000000; background-color: #b7aa75;" class="gt_row gt_right">24</td>
240 |   </tr>
241 |   <tr>
242 |     <td class="gt_row gt_left">dnhkng/RYS-Huge-bnb-4bit</td>
243 |     <td class="gt_row gt_right">31.57</td>
244 |     <td class="gt_row gt_right">33.84</td>
245 |     <td style="color: #000000; background-color: #9e9e9e;" class="gt_row gt_right">25</td>
246 |     <td style="color: #000000; background-color: #9e9e9e;" class="gt_row gt_right">25</td>
247 |   </tr>
248 | </tbody>
249 | </table>
250 | </div>
251 | 
252 | 


--------------------------------------------------------------------------------
/contents/troubleshooting/troubleshooting-reproducibility.md:
--------------------------------------------------------------------------------
 1 | # Troubleshooting reproducibility
 2 | 
 3 | Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 4 | Let's explore why.
 5 | 
 6 | ## Different code base
 7 | To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce. 
 8 | 
 9 | Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely. 
10 | 
11 | If you want to easily understand what kind of discrepancies happen when using different implementations, you can explore [this blog](https://huggingface.co/blog/open-llm-leaderboard-mmlu) (⭐) we wrote with the eval team at HuggingFace. It studies the differences we observed between 3 common implementations of the MMLU evaluation (in `lm_eval`, `helm`, and in the original author implementation), and how they change model scores. 
12 | 
13 | *Note: This is precisely for this reason that a Hugging Face team decided to launch the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), to get unified and homogeneous comparisons of models scores in order to compare them to internal experiments.*
14 | 
15 | ### Other subtle ways in which the implementation can be different
16 | We've observed that the following were easy things to mess up, even when using the same code base:
17 | - **Different random seeds.** 
18 | 	- Normally, inference is less affected by random seeds than training. However, they can still affect some CUDA operations (see the PyTorch page on [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)) and change predictions if you're using a non greedy generation strategy. They can also affect the prompt if you're using few-shots, and some pre or post-processing functions. 
19 | 	  -> A tiny change can result in a couple of points of difference.
20 | - **Actually different metrics**. 
21 |   Metrics can be different in practice even if they share the same name. Some examples:
22 | 	- If the original implementation is a *log likelihood* `exact match` (computing the log probabilities of different possible answers), and you're using a *generative* `exact match` (only comparing the main greedy generation with the reference), you won't get the same scores.
23 | 	- We also saw, in evaluation code bases, a number of tasks which were defined as `exact match`, but were actually `prefix exact match` (comparing only the beginning of the generation with the reference), or `suffix exact match` (the opposite), or `quasi exact match` (exact match with a normalization). 
24 | 	 -> You therefore can't rely only on the metric name to determine what is happening, and need to look at the code.
25 | - **Different normalization**.
26 | 	- To go back to our above `exact match` comparison example, in `lm_eval` v1, a number of tasks were simply named generative `exact match`: you would assume from this that the prediction is *compared as such* to a reference. 
27 | 	  Looking at the code, the prediction would instead go through a normalization step (removing punctuation, homogenizing numbers, etc) before being compared to the reference. This will obviously change results quite a lot. 
28 | 	  (The `lm_eval` v2 now includes the normalization name in most metric names.)
29 | 	 -> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation).
30 | 
31 | ## Different prompt
32 | 3 main things can come into play for prompt variation.
33 | ### Prompt itself
34 | The format you are using for the prompt can and will change scores wildly. 
35 | 
36 | For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as:
37 | ```
38 | Question: <text of the question>
39 | Choices:
40 | ```
41 | ```markdown
42 | | A. <Choice A> | (A) <Choice A> | <Choice A> | 
43 | | B. <Choice B> | (B) <Choice B> | <Choice B> | 
44 | | C. <Choice C> | (C) <Choice C> | <Choice C> | 
45 | | D. <Choice D> | (D) <Choice D> | <Choice D> | 
46 | ```
47 | ```
48 | Answer: 
49 | ```
50 | and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
51 | 
52 | These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*. We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
53 | 
54 | Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
55 | 
56 | This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
57 | 
58 | This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval).
59 | ### System prompt and chat template
60 | Chat models usually have been through instruction/preference training or fine-tuning. During this stage, they have learned to follow specific templates when inferring. For example, templates can require starting rounds of dialogue with a general prompt (called the `system prompt`) prefixed by specific tokens (usually `System: `). Said prompt is here to provide high-level instructions for the model, such as the contents of a persona, or general answering style instructions. Rounds of dialogue can also require adding prefix key words to text, such as `User` for queries and `Assistant` for answers.
61 | 
62 | When using few shot, you also need to select if you want examples to be provided multi-turn (mimicking user/assistant turns) or all at once (in a single user prompt).
63 | 
64 | Not following the chat template expected by the model at inference will kill its performance, as it will drive its output outside of the probability space it's been converging on.
65 | 
66 | ### Few-shots samples
67 | Two things are easy to mess up with few-shot samples (see `general-knowledge/Model inference` if you're unsure what it is).
68 | 
69 | Obviously, you need to use the **same number of few-shot samples** as your task of reference. 
70 | 
71 | However, you also need to use the **exact same samples** as the model you are comparing to, as using different samples will change results (which is not too surprising, if we assume some samples are better at expressing the task than others). More surprising maybe: you not only need to use the exact same samples, but also present them in the **exact same order**. Varying the order on the same samples led us to observe up to 3 points of difference on some subsets of MMLU (you can see [some results here](https://huggingface.co/blog/evaluation-structured-outputs) , it's the third colorgrid).
72 | 
73 | This is also a place where paying attention to the random seeds is important.
74 | 
75 | ## Different generation parameters
76 | For generative evaluations, parameters to pay attention to are:
77 | - making sure you are using the **same end of sentence token**
78 | - making sure you are allowing your model to **generate the same number of tokens** for the evaluation
79 | - making sure, if using sampling, that you are using the **same seed/temperature parameters**
80 | 
81 | ## Different model loading
82 | Some sources of differences that we have observed are:
83 | - using **different hardware**.
84 |   Pytorch does not ensure reproducibility of non deterministic operations across hardware
85 | - using **different libraries**.
86 |   For example, if you use `transformers` vs `vllm` as your backend for inference, matrix computations are not managed exactly in the same way)
87 | - using **different batch sizes**. 
88 |   It's been documented in several evaluation libraries and model backends that using different batch sizes will change inference results - if you want fully reproducible evaluations, you should fix the batch size, though it might not always be possible for memory issues
89 | - using **different loading precision** for your model weights.
90 |   Using a lower precision can reduce memory and inference costs, but it will also change the numerical results, since you are using different versions of the weights.
91 | 


--------------------------------------------------------------------------------
/resources/About NLP.md:
--------------------------------------------------------------------------------
 1 | ## General knowledge
 2 | - [NLP for You](https://lena-voita.github.io/nlp_course.html), by Lena Voita. One of the best NLP courses online, really step by step and nice to read.
 3 | - [The NLP Course](https://huggingface.co/learn/nlp-course/chapter1/1), by Hugging Face. Extremely complete, with lots of code snippets so you can get started fast!
 4 | 
 5 | ## Understanding LLM architectures
 6 | - [The Annotated Encoder Decoder](https://bastings.github.io/annotated_encoder_decoder/), by Jasmijn Bastings. Walks you through the 2015 Bahdanau paper introducing the encoder decoder with attention for recurrent networks, with code explanations at each step.
 7 | - [The Annotated Transformers](https://nlp.seas.harvard.edu/2018/04/03/attention.html), by Sasha Rush. Walks you through the 2016 Vaswani paper introducing the transformers, with code explanations at each step.
 8 | - [The Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/), by Jay Alammar. Good complement to the previous link, with visualisations instead of code.
 9 | - [The Annotated S4](https://srush.github.io/annotated-s4/), by Sasha Rush. Walks you through the Structured State Space for Sequence Modeling paper, with code at each step. If you've wonder what state spaces models are, take a look!
10 | - [A Visual Guide to MoE](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts), by Maarten Grootendorst. First get a read of one of the transformers guide above, then have a look at this guide! A lot of nice visualisations.
11 | 
12 | ## Prompting
13 | - [Show me the prompt](https://hamel.dev/blog/posts/prompt/)
14 | 
15 | 


--------------------------------------------------------------------------------
/resources/About evaluation.md:
--------------------------------------------------------------------------------
 1 | ## Knowledge
 2 | ### General
 3 | - [Foundational Model Development Cheatsheet](https://fmcheatsheet.org/), by AllenAI
 4 | 
 5 | ### Automatic evaluation
 6 | Two cool overviews on the challenges of automatic evaluation! 
 7 | - [Challenges in LM evaluation](https://github.com/lm-evaluation-challenges/lm-evaluation-challenges.github.io/blob/main/%5BMain%5D%20ICML%20Tutorial%202024%20-%20Challenges%20in%20LM%20Evaluation.pdf), a presentation by Hailey Schoelkopf and Lintang Sutawika. 
 8 | - [Lessons from the trenches on Reproducible Evaluation of LMs](https://arxiv.org/abs/2405.14782), a paper by EleutherAI
 9 | - Two podcasts by Latent Space on evaluation
10 |     - [Benchmarks 101](https://www.latent.space/p/benchmarks-101), on automatic benchmarks history and well-known associated issues 
11 |     - [Benchmarks 201](https://www.latent.space/p/benchmarks-201), on which evaluation method to use when, plus some tidbits about the Leaderboard with yours truly!
12 | 
13 | 
14 | ### LLM as a judge
15 | Cool summaries and experience feedbacks:
16 | - https://eugeneyan.com/writing/llm-evaluators/
17 | - https://cameronrwolfe.substack.com/p/llm-as-a-judge
18 | - https://dylandigitalgarden.com/2024/July/July+31%2C+2024+LLM+%26+VLM-as-a-Judge
19 | 
20 | ## Software
21 | ### Evaluation suites
22 | - [`lm_eval`](https://github.com/EleutherAI/lm-evaluation-harness/), by Eleuther (also known as "the Harness"). The powerhouse of LLM evaluations, allowing you to evaluate any LLMs from many providers on a range of benchmarks, in a stable and reproducible way.
23 | - [`lighteval`](https://github.com/huggingface/lighteval), by Hugging Face (disclaimer: I'm one of the authors). A light LLM evaluation suite, focused on customization and recent benchmarks.
24 | 
25 | ### Leaderboards
26 | - [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), by Hugging Face. 
27 |   Neutral 3rd party evaluation of Open LLMs on reference static benchmarks (open to submissions)
28 | - [HELM](https://crfm.stanford.edu/helm/lite/latest/#/leaderboard), by Stanford. 
29 |   Also evaluates models on static benchmarks, but uses win-rates to rank models
30 | - [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), by LMSys 
31 |   Arena using crowdsourced human evaluation to score 150 LLMs
32 | - [LLM Performance Leaderboard](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard), by Artificial Analysis
33 |   Performance benchmarks and pricing of the biggest LLM API providers, if you want to use an API instead of running things locally
34 | - [All our blogs about evaluations and leaderboards](https://huggingface.co/blog?tag=leaderboard)
35 | - [Leaderboard finder](https://huggingface.co/spaces/leaderboards/LeaderboardFinder): Find the most relevant leaderboard for your use case
36 | 
37 | ### Tutorials
38 | - [End-to-end custom domain evaluation tutorial](https://github.com/argilla-io/argilla-cookbook/tree/main/domain-eval): This tutorial guides you through building a custom evaluation task for your domain. It uses with synthetic data and manual evaluation with [Argilla](https://github.com/argilla-io/argilla/) and [distilabel](https://github.com/argilla-io/distilabel).
39 | 
40 | 


--------------------------------------------------------------------------------
/translations/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | Hi! 
 2 | 
 3 | If you want to contribute a translation:
 4 | - fork the repository
 5 | - create a branch on your repo `adding_<langcode>`
 6 | - create a folder for your language using its [ISO 639-2 code](https://www.loc.gov/standards/iso639-2/php/code_list.php) if needed
 7 | - translate the files you want, while respecting the folder architecture
 8 | - commit and push to the branch
 9 | - open a PR on the main repo!
10 | 
11 | Then we'll review, maybe do some round of edits if needed, and merge! 
12 | 
13 | Once it's merged, you're done 🤗


--------------------------------------------------------------------------------
/translations/zh/contents/automated-benchmarks/basics.md:
--------------------------------------------------------------------------------
 1 | # 基础概念
 2 | 
 3 | *注：本文内容与我写的 [通用评估博客](https://huggingface.co/blog/clefourrier/llm-evaluation) 存在部分重叠*
 4 | ## 什么是自动评估基准？
 5 | 
 6 | 自动化基准测试通常按照以下方式工作：你希望了解你的模型在某些方面的表现。这些“某些方面”可以是一个明确定义的具体任务，例如“我的模型在垃圾邮件分类中的表现如何？”，也可以是一个更抽象和通用的能力，例如“我的模型的数学能力有多强？”。
 7 | 
 8 | 基于此，你可以通过以下方式构建评估：
 9 | 
10 | 数据集：
11 | 数据集由多个样本组成。这些样本包含模型的输入，有时还包括一个参考答案（称为“gold”），用于与模型的输出进行比较。
12 | 样本的设计通常是为了尽量模拟你想测试模型的场景。例如，如果你在研究电子邮件分类，你可以创建一个包含垃圾邮件和非垃圾邮件的样本数据集，并尝试加入一些具有挑战性的边界案例等。
13 | 
14 | 评估指标：
15 | 评估指标用于对模型进行评分。例如：你的模型对垃圾邮件的分类准确度如何？正确分类的样本得分为1，错误分类的得分为0。
16 | 评估指标使用模型的输出来进行评分。在大型语言模型（LLMs）的情况下，人们主要关注两种输出：
17 | 
18 | 模型根据输入生成的文本（生成式评估，generative evaluation）
19 | 提供给模型的一个或多个序列的对数概率（多项选择评估，有时称为 MCQA，或者困惑度评估 perplexity evaluations）
20 | 有关更多信息，请查看[模型推理与评估页面](https://huggingface.co/docs)。
21 | 
22 | 在模型没有见过 (即未出现在训练集) 的数据上进行评估会更有意义，得出的模型 **泛化性** 结论才更准确。比如在只见过假冒银行垃圾邮件的模型上测试其能否正确分类与 “健康” 相关的垃圾邮件。
23 | 
24 | 注：*模型只能在训练数据上预测效果良好 (没有隐式地学习到更高层次的通用范式) 的现象叫做 **过拟合**。这就类似于一个学生死记硬背了考试题目，却没有理解背后的知识点。所以只用训练集中的数据测试评估 LLM 得到的分数指标实际上是模型不具备的能力。*
25 | 
26 | ## 自动评估基准的优劣势
27 | 优势：
28 | - **一致性和可重复性**：在同一个模型上运行相同的自动评估基准 10 次，测试结果也是相同的 (除非受到硬件或模型自身随机性的影响)。所以相同任务下，多个模型的测试排名结果是公正的。
29 | - **低成本规模效益**：目前自动评估基准是评估模型成本最低的方式之一。
30 | - **易于理解**：大部分自动化方式的评价指标理解起来都非常容易。
31 |   *例如：精确匹配可以理解为生成文本跟参考文本是否完全一致；准确率可以理解为做出的选项有多大程度是正确的 (不过对于像 `BLEU` 或 `ROUGE` 这种评价方式，理解难度会稍微高一些)。*
32 | - **高质量测试集**：许多自动评估基准的测试集都来自专家级生成数据集或现有的高质量数据集 (如 MMLU 或 MATH)。当然也不是说这些测试集就完美无瑕，例如 MMLU 就被发现存在一些解析错误以及事实谬误，所以后来出现了一批改进的数据集，如 MMLU-Pro 和 MMLU-Redux。
33 | 
34 | 劣势：
35 | - **复杂任务难以保证效果**：自动评估基准通常在测试效果容易定义和评估的任务上表现良好 (如分类任务)。一旦任务比较复杂而且难以拆分为目标明确的子任务时，表现可能不及预期。
36 |   *例如：测试模型的 “数学能力” 任务。具体是算术、还是逻辑、亦或是推演新数学概念的能力？*
37 |   所以出现了一些无需拆分为子任务的 **通用性** 评估方式，由此评估出的模型整体表现就是评估目标的 **优良代理**。 
38 | - **数据污染**：网络上的数据一旦以纯文本的形式公开，那么由于数据爬虫，这些数据总归会出现在模型训练集中。所以在评估时很难保证模型真的没有见过测试集。
39 | 


--------------------------------------------------------------------------------
/translations/zh/contents/automated-benchmarks/designing-your-automatic-evaluation.md:
--------------------------------------------------------------------------------
  1 | # 设计你的自动评估任务 
  2 | 
  3 | ## 选择数据集
  4 | 做评估时，你可以选择现有的数据集 (参考 [一些评估数据集](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) 页面) 作为测试集，也可以设计自己的数据集。有一点非常重要，请注意：**评估的结果与评估的数据集质量高度相关**。
  5 | 
  6 | ### 使用现有的数据集
  7 | 这部分强烈建议仔细阅读！
  8 | #### 数据集需要注意的问题
  9 | 样本是由谁创建的？
 10 | 在我看来，按照样本的标注员素质高低，数据集质量大致排名如下：专家构建数据集 > 付费标注数据集 > 众包数据集 > MTurk 数据集。
 11 | 你可以在数据集的说明文档 (data card) 找到标注员的统计信息，可以帮助理解数据集语言多样性。
 12 | 
 13 | - **样本是否经过其他标注员或作者的审核？** 
 14 | 你需要先弄明白：
 15 | 	- 不同标注员标注结果是否一致？
 16 | 	-  完整数据集是否经过作者审核？
 17 | 标注员通常不是目标语言的母语使用者（例如 AWS Mechanical Turk），否则可能会出现拼写错误、语法错误或无意义的答案。
 18 | 
 19 | - **是否给标注员提供了明确的数据创建指导？**
 20 | 换句话说，数据集样本间的标注标准是否一致？
 21 | 
 22 | #### 检查样本  
 23 | 随机抽取 50 个样本进行人工检查：
 24 | - *检查质量*：
 25 | 	- 问题是否明确且不含歧义？
 26 | 	- 对应的回答是否正确？(*例如：TriviaQA 的每个问题通常包含多个标准答案，有时这些答案会相互冲突。*)
 27 | 	- 信息是否完整？(*例如: MMLU 有许多问题中缺少参考示意图。*)
 28 | - *检查与任务相关性*：
 29 | 	- 样本问题是否是 LLM 特定评估任务的问题类型？
 30 | 	- 样本是否与测试用例相关？
 31 | 
 32 | 数据集样本数量同样重要 (以确保自动评估基准结果在统计上显著，一般至少需要 100 个测试样本)。
 33 | ### 设计自己的数据集
 34 | 有 3 种设计方法：
 35 | #### 整合数据
 36 | 要使用自己的测试集评估模型执行特定任务的能力，可以从不同的现成数据源整理和聚合。实际上有许多评估测试集都是以这种方式构建的，例如 MATH 和 LSAT 就聚合了人工评估数据集。当然在整理数据时，请遵循上文的质量与任务相关性检查步骤。
 37 | #### 人工标注
 38 | 关于 `人工标注` 的内容，本指南有一整个篇幅详细介绍，可以自行点击 [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md) 阅读。
 39 | #### 合成数据
 40 | - **使用 LLM 合成**
 41 | 这部分可以参考 HF 员工的 [Cosmopedia](https://huggingface.co/blog/cosmopedia) 博客！虽然此篇主要研究如何构建训练集，但想法和技术同样适用于构建测试集。
 42 | 合成的测试集仍需手动检查 (遵循上文步骤)。
 43 | 
 44 | - **基于规则合成**
 45 | 如果任务允许，这个绝佳的方法几乎可以无限获取测试样本，并且避免数据污染。
 46 | 参考 [NPHardEval](https://arxiv.org/abs/2312.14890)，[DyVal](https://arxiv.org/abs/2309.17167)，[MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698) 等。
 47 | 
 48 | ## 选择推理方法
 49 | 除了测试集，还需要选择合适的推理方法。
 50 | 
 51 | 对于多项选择问答任务 (通常用于测试模型的知识储备或消除歧义的能力)，使用对数概率 (MCQA) 非常有效。
 52 | - 优势： 
 53 | 	- 可以保证所有模型都能获取正确答案。
 54 | 	- 能够提供模型 “置信度” 代理 (以及校准)。
 55 | 	- 评估速度快，尤其是单 token 预测任务时 (选择索引 A/B/C/D 或 Yes/No 等)。
 56 | 	- 允许获取小模型在任务表现上的信号。
 57 | - 劣势：
 58 | 	- 可能高估小模型的表现。如果不做限制，会使得模型生成的内容超出可选范围。
 59 | 	- 估结果可能不具代表性。一些模型 [倾向于按多项选择的顺序生成特定选择](https://arxiv.org/abs/2309.03882)。
 60 | 
 61 | 对于测试模型流畅性、推理或回答问题能力的任务，使用 QA 生成非常有效。
 62 | - 优势：
 63 | 	- 与人类关心的点一致，即 LLM 生成文本是否流畅的能力。
 64 | - 劣势：
 65 | 	- 可能存在评分困难 (见下面的 `度量标准` 部分)。
 66 | 	- 成本比对数似然评估稍高，尤其是需要采样的任务。
 67 | 
 68 | ## 选择 prompt
 69 | Prompt 设计关键问题：
 70 | - 提供给模型的关于任务的信息量大小
 71 | - 如何向模型提供信息
 72 | 
 73 | MCQA 或 QA 任务的通用 prompt 设计范式一般包含以下几个部分：
 74 | - 任务 prompt (可选)：描述任务。
 75 | - 上下文：为问题提供额外的背景信息。
 76 | 	- *例如: 对于内容总结或信息提取任务，可以提供内容来源*
 77 | - 问题：prompt 的核心内容。
 78 | - 对于多项选择评估任务，可以增加选项。
 79 | - 连接词 (`问题`、`上下文`、`选项`等)。
 80 | 
 81 | 定义 prompt 时需要注意：
 82 | - 在语义等价的 prompt 中，即使非常微小的变化也可能导致巨大差异的结果 (详见 [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md) 的 `Different prompt` 部分)，并且 prompt 格式也可能对特定模型的输出造成影响。
 83 | 	- 如何缓解这一问题：
 84 | 		- 高成本方法：使用不同的 prompt 变体进行多次评估。
 85 | 		- 低成本方法：使用多种 prompt 格式分别分配给多个等效难度的测试样本进行单次评估。
 86 | - 在 prompt 中提供示例可以帮助模型输出遵循预期格式，示例可以通过连接词添加至 prompt。
 87 | - 注意模型可能倾向于对特定的 prompt 格式过拟合。
 88 | 	- [这篇论文](https://arxiv.org/abs/2407.07890) 对此有更详尽的探讨，文中展示了一些模型因在测试集 **格式** 上过拟合而导致的评估分数过高的情况。
 89 | 	- 我们特别观察到，在 Open LLM Leaderboard 2 上, Llama 3.2 和 Qwen 2.5 出于这个原因已经不再提供 few-shot 示例的 prompt 格式。
 90 | - 对于一些测试任务的指标，你可能希望模型的输出限制在一个小范围。
 91 |   *可以跳转 [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md) 页面的 `Constraining model outputs` 部分了解更多信息。*
 92 | 
 93 | ## 选择评估指标
 94 | 如果你关注 **对数概率** 评估，那么你期望的度量指标会很简单：准确率 (选择最佳选项的频率)。如果在这个基础上你还想要进行标准化 (通过长度、字符、token 或 PMI)，那么度量指标就会变成困惑度 (perplexity)、召回率或 F1 分数。
 95 | 
 96 | 对于 **生成式** 评估，你期望的度量指标范围会更广。
 97 | 为此你需要：
 98 | 1. 确定生成结果的度量顺序，是直接拿生成结果比较，还是先使用某种方式进行标准化。
 99 | 	- 标准化如果设计不当，评估结果会有失偏颇 (参考这篇 [博客](https://huggingface.co/blog/open-llm-leaderboard-drop))。但总的来说，它们都能在任务层面提供信号。
100 | 	- 标准化对某些特定任务 (例如数学能力评估) 非常重要，因为你可能需要从格式化输出中提取有效的结果。
101 | 	- 如果你想要通过添加机制 (如思维链) 来评估准确率，那么标准化同样重要，因为你需要将推理轨迹从实际结果中去除。
102 | 2. 确定生成结果与参考答案的比较方式。
103 |    你可以采用任意的比较方法。评估匹配程度的有：精确匹配、前缀匹配等；评估摘要和翻译能力的有：ROUGE、BLEU、n-gram 等。更多评价指标可以点击 [这个页面](https://github.com/huggingface/lighteval/wiki/Metric-List) 查看，我会在后续更新关于在何时使用哪种指标的章节。
104 | 
105 | 总的来说，选择哪种评价指标取决于你的任务内容。对于某些领域 (如医疗、聊天机器人)，你可能不想要评估平均性能，而是需要评估 **最差表现** (如医疗知识输出质量、如果输出不实的后果等)。(*可以查看 [这篇博客](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/) 深入了解*)
106 | 
107 | ## 智能新任务：功能性测试是什么？
108 | 对于代码领域，显然仅评估生成代码的语义是不够的，必须测试代码实际运行情况。所以需要专门设计一个功能性测试：对于给定 prompt 生成的代码段，测试并评估其是否能正确通过单元测试。
109 | 
110 | 这种功能性测试方法极具前景，因为：
111 | - 使得生成测试用例更容易 (大部分情况下都可以基于规则生成测试用例)
112 | - 减少过拟合
113 | - 可以评估模型的特定主动能力
114 | 
115 | 不过很多新奇的想法需要一些创造性的工作才能实现！
116 | 
117 | IFEval 是一个不错的例子，它是用来测试模型指令遵循能力的评估基准，通过创建多个格式化指令 (*例如：添加指定数量的特殊符号，仅将一句话字母大写，等等*) 并严格测试生成结果的遵循与否。功能性测试的想法仍需更多的工作来扩展到其他的特征测试上！
118 | 


--------------------------------------------------------------------------------
/translations/zh/contents/automated-benchmarks/tips-and-tricks.md:
--------------------------------------------------------------------------------
 1 | # 技巧与提示
 2 | 
 3 | ## 数据污染管理
 4 | 通常我们会假设在互联网上公开可用的数据集是存在数据污染问题的。
 5 | 
 6 | 缓解措施有：
 7 | - 测试集中加入 **哨兵字符串 (canary string)** (如[BigBench](https://github.com/google/BIG-bench))，这是一种特殊的字符组合，使得模型创建者可以在训练集中查找，来表明该数据中是否包含评估。
 8 | - 测试集采用 **[加密](https://arxiv.org/abs/2309.16575) 或 [门控](https://huggingface.co/datasets/Idavidrein/gpqa)** 形式，以防被网络爬虫解析并意外地被加入训练集。
 9 | - 进行 [动态基准测试](https://arxiv.org/abs/2104.14337)，定期更新基准测试来防止模型 “死记硬背答案” (但会增加数据集成本)。
10 | - 进行后验 [污染检测](https://arxiv.org/abs/2311.06233)，例如检验生成结果的困惑度，或者设计对抗版的 prompt 等。然而不幸的是，目前没有任何一种污染检测方法是完美的。
11 | 
12 | 不过也不用太担心，数据集被污染并不意味着训练过程就没有意义和信号收益。
13 | 
14 | ## 可能遇到的实际问题
15 | 
16 | ### 微调模型、系统提示和聊天模板
17 | 要避免指令微调模型的表现过于糟糕，需要做到：
18 | - 在推理 prompt 的最前面添加系统 prompt
19 | - 使用聊天模板 prompt (通常在对话轮次添加 `Assistant` 和 `User` 前缀，可以查看 [这篇指南](https://huggingface.co/docs/transformers/main/en/chat_templating) 了解详情) 
20 | 
21 | 另外，你无需假设不同的分词器实际表现会相同，特别是在处理聊天模板的时候。更具体的说明可以查看 [这条推文](https://x.com/danielhanchen/status/1796952220619157694) 中附带的图片。
22 | 
23 | ![Spacing, tokenization and template](https://pbs.twimg.com/media/GPANfpiasAA9b6F?format=png&name=medium)
24 | 
25 | ### 分词
26 | 
27 | 1. **多选问答 (MCQA) 评估与分词**
28 | 
29 | 一般来说，在 MCQA 评估中将上下文与选项放在一起做分词会更好，因为这样生成的 tokens 序列对于模型来说会自然一些。
30 | 
31 | 不过，有些分词器 (如 [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) 不满足以下条件：`enc(context + choice) = enc(context) + enc(choice)` (并且有可能增加或删除空格)。因此把上下文和选项放在一起处理的话，上下文的分词结果会 “渗透” 到选项的分词结果中，进而导致选项的对数概率结果被混淆而不可信。
32 | 
33 | 如果你的模型也有这个问题，你可以尝试分别对上下文和选项计算分词结果，然后去除额外添加的特殊字符，最后将两个结果拼接到一起。
34 | 
35 | 2. **文本评估与句子的起始终止 token**
36 | 
37 | 起始 token：有些模型 (如 `Gemma`) 在推理时会对 [句子起始位置的 token](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) 极为敏感。要想知道你的模型是否存在这个问题，你可以多做几次实验，在每次评估中手动添加不同的起始 token 来检查一下。
38 | 
39 | 终止 token：有时候模型会出现无法终止生成的问题，也就是模型未能生成终止 token (比如 `\n`)，这是因为模型不会单独预测终止 token，而只能包含在更高级的 token (如 `\n\n`，在代码模型中也可能是单个 token) 中生成。此时可能需要添加一个特定的检查来 “回溯” 生成的文本来确保在正确的位置截断句子，以确保计算的度量指标无误。
40 | 
41 | 3. **多语言评估与分词**
42 | 
43 | 在进行多语言评估时，需要根据评估任务和度量指标来确定文本分词方法。由于某些语言不使用空格作为单词间的分隔符 (例如韩语、泰语、日语和中文等)，所以需要特定语言的分词器才能合理的切分，否则就会影响一些度量指标分数，如 [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212)、F1 分数等。
44 | 
45 | 4. **代码评估与终止 token**
46 | 
47 | 代码模型通常会在训练时将 `\n\t` 单独作为一个 token。这意味着在生成文本时会将 `\n\t` 一步生成。如果某个任务将 `\n` 定义为终止 token (表示生成停止)，但模型生成单个 token 却是 `\n\t`，那生成过程就会无限持续。为了让模型能够停止生成，需要更新文本的终止 token，或者定义一种机制来回溯最新 token 的字符表征，以便停止 (并截断) 生成。
48 | 
49 | ### MCQA 评估的简单加速
50 | 如果你的 MCQA 评估任务只需要模型预测一个 token，那么预测速度就可以大大加快。
51 | 
52 | 你可以单次运行 `上下文` 推理得到全词汇表 (其中就包括了所有的选项) 的概率分布，进而按对数概率获取你感兴趣的 token，这样就避免了对每个选项的多次推理 (`上下文 + 选项 1`, `上下文 + 选项 2`, 等等)，从而实现加速。
53 | 
54 | (我们在 `lighteval` 中就是这么做的)。
55 | 
56 | ## 如果生成式评估中的结果不及预期怎么办？
57 | 
58 | 首先要经常检查模型的生成结果。排查可能原因时，需要关注以下常见问题：
59 | - 由于模型输出的解析过于严格 (在计算指标之前) 而导致答案丢失
60 |     - 解决方法：调整解析方式
61 | - 模型无法以 few-shot 的方式遵循输出格式 (这个问题在近期训练的指令模型经常出现，如 Llama 3.2 或 Qwen 2.5)。
62 |     - 解决方法：调整 prompt 格式，或者做一个模型能够遵循 few-shot 示例的假设
63 | - 模型输出过于冗长而导致无法得到正确答案 (在长上下文模型中更为常见，如 Qwen 和 CommandR)
64 |     - 解决方法：增加上下文长度，或者在任务 prompt 中添加指令，还可以做一个模型能够输出简练回答的假设。
65 | 
66 | 


--------------------------------------------------------------------------------
/translations/zh/contents/general-knowledge/model-inference-and-evaluation.md:
--------------------------------------------------------------------------------
 1 | # 模型推理与评估
 2 | 
 3 | ## 引言
 4 | 现阶段大型语言模型的工作原理很简单：给定输入文本，模型学习预测合理的后续内容。
 5 | 
 6 | 这个过程分为两步。
 7 | ### 分词
 8 | 输入文本 (推理阶段也叫 *prompt (提示词)*) 首先被拆分为多个 *token*，一个 token 表示文本的一小段 (可以是一个或多个字符，也可以是一个单词)，每个 token 都会被映射为一个数字。模型能够解析的全部 token 的范围叫做该模型的 *vocabulary (词汇表)*。*([Tokenization](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/tokenization.md) 页面有对分词更深入的解释)*.
 9 | 
10 | ### 预测
11 | 
12 | ![](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/llm_tk_1.png?raw=true)
13 | 
14 | 给定输入文本，LLM 会在所有词汇表中生成下一个 token 的概率分布。如要持续生成，我们可以选择概率最高的 token (此过程中可以引入一些随机性来获得更有趣的输出) 拼接至 prompt 的末尾作为下一个输入来生成后续 token，然后重复这个操作，依此类推。
15 | 
16 | ## 评估的预测目标
17 | LLM 评估根据预测目标可以大致分为两类：
18 | - 给定一个 prompt 和对应的一个 (或多个) 回答，模型预测该回答的概率是多少？
19 | - 给定一个 prompt，模型生成回答的内容是什么？
20 | ### 对数似然评估
21 | 对数似然评估的预测目标为：给定 prompt 下的单选或多选回答的条件概率。换句话说就是，输入一句话，模型输出特定续写的可能性有多大？ 
22 | 具体步骤为:
23 | - 将每个选项与 prompt 拼接并输入给 LLM，模型根据前面的内容输出每个 token 的 logit 值
24 | - 仅保留最后的 logit 值 (与选项的 token 相关)，应用对数 softmax 来获取对数概率 (范围为 `[-inf, 0]`，而不是 `[0-1]`)
25 | - 然后将所有选项 token 的对数概率相加，获得多选的整体对数概率
26 | - 最后可以根据选项长度进行归一化
27 | 
28 | ![](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/llm_logprob.png?raw=true)
29 | 
30 | 可以使用以下任一方式来计算评估指标：
31 | - 获取模型在多选回答中的首选偏好，如上图。(*不过这种评估方式会使得此类模型得分偏高：不做限制时会生成选项之外回答的模型。如图中的 `Zygote`*)
32 | - 测试模型单选概率是否高于 0.5
33 | - 研究模型校准。一个校准良好的模型会对正确回答预测最高的概率。
34 |   *(要深入了解校准可以阅读 Anthropic 的 [这篇论文](https://arxiv.org/abs/2207.05221)，里面详细介绍了校准的定义、检测方法、以及怎样训练一个校准良好模型。另外要了解校准的一些局限性可以阅读 [这篇论文](https://arxiv.org/abs/2311.14648))。*
35 | 
36 | ### 生成式评估
37 | 对于生成式评估的预测目标，我们希望得到给定 prompt 下模型生成的文本。 
38 | 
39 | 我们可以通过自回归的方式获得生成文本：将 prompt 输入模型，将模型预测概率最高的下一个 token 选定为模型的 “首选 token”，然后重复这一过程，直到达到生成结束条件 (最大 token 长度、停止生成的特殊 token 等)。模型生成的所有 tokens 都视为对该 prompt 的回答文本。
40 | 
41 | ![](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/llm_gen.png?raw=true)
42 | 
43 | 
44 | 
45 | 然后我们可以将生成文本与参考回答进行比较，并通过计算两者之间的差距进行评分 (可以使用简单指标评判，例如是否精确匹配，或者诸如 BLEU 的复杂指标，甚至可以使用模型进行评估)。
46 | 
47 | ### 深入阅读推荐链接
48 | -  ⭐ [多种评估 MMLU 的方法](https://huggingface.co/blog/open-llm-leaderboard-mmlu)，我所在的 Hugging Face 团队撰写的博客。如果你想更深入了解多选对数似然评估和生成式评估之间的差异，包括这些差异对评分变化的意义，推荐阅读此文。
49 | 	- 上述插图来自 Thom Wolf 的博客
50 | - ⭐ [EleutherAI 论文中对上述推理方法背后绝美的数学公式推导](https://arxiv.org/abs/2405.14782v2)，可以直接跳到附录部分查看。
51 | ## 约束模型输出
52 | 在许多情况下，我们希望模型的输出遵循特定格式，例如在需要与参考回答进行比较时。约束方法有：
53 | ### 使用 prompt
54 | 最简单的方法是添加一个任务 prompt，其中包括了非常具体的指示，来告诉模型如何回答 (例如：`使用数字形式回答`、`不要使用缩写` (`Provide numerical answers in digits.`,`Use no abbreviation`) 等)。
55 | 
56 | 虽然这种方法并不能保证每次都有效，但对于高能力模型来说效果已经足够好。这也正是我们在 [GAIA](https://huggingface.co/papers/2311.12983) 论文中采用的方法，如果你想获取一些任务 prompt 的书写灵感，可以在 [leaderboard](https://huggingface.co/spaces/gaia-benchmark/leaderboard) 页面的 Submission 标签中找到我们设计的任务 prompt。
57 | ### Few-shot 和上下文学习
58 | 另一种约束模型输出的方法是 “上下文学习 (in context learning)”。通过在 prompt 中提供示例 (称为 `few-shot prompting`)，使得模型隐性地遵循重复的 prompt 模式，进而输出回答文本。
59 | 
60 | 这种方法直到 2023 年底整体效果都还不错！然而随着指令微调 (instruction-tuning) 方法的广泛使用，以及在模型预训练的后期阶段 (继续预训练) 加入了更多的指令数据，似乎使得新模型倾向于遵循特定的输出格式 (在 [这篇论文](https://arxiv.org/abs/2407.07890) 中称为 `在测试集上训练`，或者叫 `过拟合 prompt 格式`)。另外对于稍旧的模型，特别是上下文窗口较小的模型，这种方法也有点受限，因为某些 few-shot 示例可能无法被拟合到上下文窗口里。
61 | ### 结构化文本生成
62 | 结构化文本生成的方法通过预定义语法或正则表达式来约束模型输出。例如，`outlines` 库通过有限状态机 (FSM) 实现了这个方法，非常巧妙。(实际还有其他实现方法，比如使用交错生成来约束 JSON 格式输出。但 FSM 是我个人最喜欢的方法)
63 | 
64 | 如想了解结构化生成方法的原理，可以阅读我们写的这篇 [博客](https://huggingface.co/blog/evaluation-structured-outputs)：结构化生成降低了评估中的 prompt 方差，使得评估结果和排名更加稳定。你还可以查看 `outlines` 库的概览 [博客](https://blog.dottxt.co/)，了解更多与结构化生成相关的有趣实现和观察结果。 
65 | 
66 | 不过，一些最新 [研究](https://arxiv.org/pdf/2408.02442) 表明，结构化生成在某些任务 (例如因果推断) 中可能会降低模型性能，因为它使得先验分布偏离了预期的概率分布。
67 | 
68 | ### 深入阅读推荐链接
69 | -  ⭐ [理解结构化生成的有限状态机工作原理](https://blog.dottxt.co/coalescence.html)，由 Outlines 出品的博客，提供了非常清晰的可视化指南。
70 | - [outlines 方法的论文](https://arxiv.org/abs/2307.09702)，对上述博客更加学术的解释
71 | - [Guidance-AI 的官方 repo 实现的交错生成方法](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration)，是另一种约束模型遵循特定格式输出的方法。
72 | 


--------------------------------------------------------------------------------
/translations/zh/contents/general-knowledge/tokenization.md:
--------------------------------------------------------------------------------
 1 | # 分词
 2 | 
 3 | ## 为什么以及如何对文本进行分词？
 4 | LLM 本质上是庞大的数学函数，模型只能处理数字，无法直接处理文本。
 5 | 
 6 | 所以我们需要把文本转化为数字。具体地，如给定一句文本，须先将它以某种方式切分成多个小段，再将每个小段映射为一个数字。这个过程就叫 *分词 (tokenization)*。
 7 | 
 8 | 在英文中，以前大家的做法是将文本句子中的每个字符与字母表索引对应起来 (例如 `a` -> 1, `b` -> 2 等)，这称为 *基于字符的分词* (按字符切分)。另一种做法是，将每个单词与词典索引对应起来 (例如 `a` -> 1, `aardvark` -> 2, `ab` -> 3 等)，这称为 *基于单词的分词* (英文或其他带有空格的语言可以按空格切分，不含空格的语言就会稍微复杂一些，例如中文)。
 9 | 
10 | 然而这些方法都存在很强的局限性：它们丢失了输入文本中的部分信息，尤其是单词形态中反映出的语义联系 (例如，`dis similar`、`similar`、`similar ity`、`similar ly` 等)。实际上我们希望模型能够保留并学到这些信息，以便把有语义联系的单词关联起来。
11 | (此外，如果输入文本突然出现一个全新的单词，就没法将它映射到已有数字中，那么模型就处理不了这个词😔)
12 | 
13 | 很自然地，有人提出了将单词切分为子词，并为这些子词分配数字索引 (例如 `dis`、`similar`、`ity`、`ly`)！
14 | 
15 | 最开始这个过程是根据形态句法规则 ("morpho-syntax"，即构词法的语法规则，如词缀、词形等) 实现的。现在使用更多的是字节对编码 (BPE, byte pair encoding)，这是一种智能统计方法，可以根据参考文本中的词频自动计算和切分子词。
16 | 
17 | 小结：分词是一种将切分的文本小段 (可以是一个或多个字符，也可以是一个单词) 映射为数字 (类似于索引) 的方法。对输入文本 (推理阶段也叫 *prompt*) 进行切分和映射的工具叫 *分词器 (tokenizer)*，文本被切分的小段叫 *token*，分词器或模型能够解析的全部 token 范围叫 *词汇表 (vocabulary)*。
18 | #### 深入理解分词：
19 | 建议深入阅读前两个链接。
20 | - ⭐ [HuggingFace 🤗 NLP 课程中对不同分词方法的解释](https://huggingface.co/learn/nlp-course/en/chapter2/4)
21 | - ⭐ [HuggingFace 🤗 transformers 库说明文档中的分词基本概念及代码用法](https://huggingface.co/docs/transformers/en/tokenizer_summary)
22 | - [Jurafsky 制作的分词课程](https://web.stanford.edu/~jurafsky/slp3/2.pdf)。这个课程更偏学术，可以直接跳到 2.5 和 2.6 章 (其余章节也很有趣，但范围太广有些超纲) 阅读。
23 | 
24 | #### 深入理解字节对编码
25 | - ⭐ [HuggingFace 🤗 NLP 课程中对 BPE 的解释](https://huggingface.co/learn/nlp-course/en/chapter6/5)
26 | - [首次提出 BPE 的论文](https://aclanthology.org/P16-1162/)
27 | 
28 | 
29 | ## 分词中要注意的问题
30 | ### 选择合适的词汇表大小
31 | 词汇表的大小决定了模型需要学习的单位 token (例如子词) 的数量。 
32 | 
33 | 如果词汇表 **太大**，可能会包含一些非常罕见的词，不做切分直接映射为一个完整的词 token (例如：`aardvark`)，这会导致两个问题：
34 | 
35 | 一方面，如果这些几乎从未出现过的罕见词加入训练数据中，模型可能很难学习它与其他语义概念的联系，导致难以推断其含义。
36 | 
37 | 另一方面，如果这些罕见词仅出现在特定的上下文中，模型可能会将它关联到某些非常具体的词。例如在论坛数据上训练时，分词器将某个用户名完整映射为一个 token 并加入词汇表，那么模型可能会将此 token 关联到该用户发布的特定内容上去。
38 | 
39 | 如果词汇表 **太小**，同样会带来两个问题：表达能力太弱、推理成本增加。
40 | 
41 | 回到之前对 `similar` 的派生词进行分词的例子。如果使用类 BPE 方法 (大词汇表) 将会把 `similarly` 切分为两个 tokens (`similar` 和 `ly`)。而如果使用基于字符的分词方法 (词汇表非常小，只包含字母表)，那么这个单词就会被切分为 9 个 tokens (`s`, `i`, `m`, `i`, `l`, `a`, `r`, `l`, `y`)。
42 | 
43 | 第一种方法将 `similarly` 切分为了具有独立语义的 token，而第二种方法由于词汇表太小，导致切分时丢失了一些语义表征。另外在推理生成一个单词时，token 表征长度的不同还导致了推理成本的差距，较小的词汇表生成的成本更高 (9 个 tokens 的生成成本比 2 个 tokens 增加了 4 倍以上!)。
44 | 
45 | 目前大部分人采用启发式方法，根据模型大小和涵盖的语言数量来推断词汇表大小。在选择适合自己模型的词汇表大小时，也可以参考同等量级模型的词汇表大小来确定。
46 | #### 深度理解罕见 token
47 | - [SolidGoldMagikarp 在 Less Wrong 上的帖子](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)
48 | 	- 很有趣的文章，介绍了大家如何在 OpenAI 的词汇表中找到极其罕见的 token。这真的很酷！因为是在未访问模型内部信息的情况下完成的 (例如没人知道 OpenAI 的训练数据内容)。 
49 | - [Cohere 的论文：Fishing for Magikarp](https://arxiv.org/abs/2405.05417)
50 | 	- 检测模型 token 的工作。
51 | 
52 | ### 管理多语言环境
53 | (建议：在阅读本部分前先了解 BPE 的相关内容)
54 | 选择分词器时，词汇表是根据参考文本构建的，也就是说分词器是认识参考文本中的词汇和字符的，而一般情况下文本主要为英语和拉丁语。
55 | 
56 | 如果你想要模型学习一门新语言，而且很幸运，该语言使用的字母表和词根等与原语言相同，那么理论上模型学到的原语言的语义表征可以迁移到新语言中。
57 | 
58 | 然而，如果你希望分词器能够对其他语言的文本进行正确的分词 (尤其是字母表不同的语言)，那么最好在构建分词器时就包含这些语言的数据。大部分情况下，新语言 (如泰语或缅甸语) 数据的加入会使得与原语言 (如英语) 的比例不平衡，原语言的占比通常高的多。由于当前常用的高效分词方法 (如 BPE) 是基于词频来创建复杂的词汇表，而英语使用频率最高，其单词会被切分为较长的 token。相比之下，不太常见语言的单词则会被切分为字符级别的 token。
59 | 
60 | 这种现象导致了多语言分词中的不平衡现象：不常见 (低频、或 *低资源*) 的语言反而需要更多的 tokens 数才能生成与英语相同长度的文本。
61 | 
62 | #### 深入理解多语言分词
63 | - ⭐ [Yennie Jun 对跨语言分词问题的精美分析与演示](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
64 | 	- 非常清晰明了！也建议亲自试用一下 HuggingFace 上的在线多语言分词 [space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)。
65 | - ⭐ [Aleksandar Petrov 制作的关于分词不平衡的 demo](https://aleksandarpetrov.github.io/tokenization-fairness/)
66 | 	- 推荐阅读 `Compare tokenization of sentences` 部分，来对不同语言推理的成本差异有一个大致的感觉。
67 | 
68 | ### 数字分词问题
69 | 构建分词器时，还需要决定如何处理数字：是仅对 0 到 9 进行索引，并假设其他所有数字都可以由这些基础数字组合；还是要单独存储一定量级的 (例如一百万以内的) 所有数字？目前的知名模型采取了多种方法处理数字分词问题，但哪种方法更有利于数学推理仍然没有确定的答案。或许会出现诸如层级分词的新方法更适合处理这个问题。
70 | #### 深入理解数字分词
71 | - ⭐ [Yennie Jun 制作的可视化 demo，展示了 Anthropic、Meta、OpenAI 以及 Mistral 模型如何对数字分词](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down) 
72 | - [Beren Millidge 总结的多年来数字分词问题的演变](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)
73 | 


--------------------------------------------------------------------------------
/translations/zh/contents/human-evaluation/basics.md:
--------------------------------------------------------------------------------
 1 | # 基础概念
 2 | 
 3 | ## 什么是人工评估？
 4 | 人工评估是指让人类评价模型输出回答的好坏。
 5 | 本文讨论的都是后验评估，即模型已经完成训练，给定一个任务让人类进行评估。
 6 | 
 7 | ### 系统化评估
 8 | 系统化的人工评估主要有 3 种方式：
 9 | 
10 | 如果你手头 **没有现成的数据集**，但还是想测试一些模型的能力，可以采用人工评估：提供一个任务说明和打分指南 (例如：`尝试与模型交互，迫使模型输出不当语言，即包含冒犯性、歧视性、暴力等。如果模型输出了不当语言，则得分为 0，反之为 1。`)，以及可供交互的测试模型，然后就可以让标注员人工操作并评分，同时列出评分理由。
11 | 
12 | 如果你手头 **已经有数据集** (例如 `收集了一组 prompt，并确保这些 prompt 不会迫使模型输出不当回答`)，可以自行将 prompt 输入模型得到输出，然后将输入 prompt、输出回答、打分指南一起提供给标注员评估 (`如果模型意外输出不当，则得分为 0，反之为 1`)。
13 | 
14 | 如果你手头 **既有数据集也有评分结果**，可以让人工标注员通过 [错误注释](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) 的方法 (*这种方法同样可以作为评估系统，适用于上面的情况*) 来对评估进行审查。在测试新评估系统时，这一步非常重要，但是技术测层面属于对评估系统的评估，因此略微超出本文的讨论范围。
15 | 
16 | 注：
17 | - *如要对已部署的生产模型做评估，可以考虑进行人工 A/B测试及反馈。*
18 | - *[AI 审计 (AI audits)](https://arxiv.org/abs/2401.14462) (模型外部系统评估) 也是一种人工评估方式，但不在本文讨论范围。
19 | 
20 | ### 非正式评估
21 | 基于人类的评估方法还有两种不那么正式的方法：
22 | 
23 | **Vibes 检查** 是一种使用非公开数据进行人工评估的方法，用来在多个场景用例 (如代码编程和文学创作等) 上测试来把握整体效果。评估结果通常会被当作轶事证据而分享在 Twitter 和 Reddit 上，不过它们很容易受到主观认知偏差的影响 (换句话说，人们往往只相信自己相信的结果)。尽管如此，这些结果依然能作为 [你自己测试的一个不错起点](https://olshansky.substack.com/p/vibe-checks-are-all-you-need)。
24 | 
25 | **Arenas** 是一种众包人工评估的方法，用来给多个模型表现排名。
26 | 一个知名的例子是 [LMSYS 聊天机器人 Arena 评估](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), 社区用户通过与多个模型对话来分辨孰优孰劣并投票。总的投票结果将汇总为 Elo 排名 (这场多个模型比赛的排名)，来评判出 “最优模型”。
27 | ## 人工评估的优劣势
28 | 
29 | 优势：
30 | - **灵活性**：只要评估定义的足够明确，人工评估几乎适用于所有任务！
31 | - **无数据污染**：人工书写的问题 prompt 不会跟训练集有交叉 (希望如此)。
32 | - **与人类偏好的相关性**：这条显而易见，毕竟是按人工标准来评分的。
33 |   *注：进行人工评估时，尽量确保标注员的多样性，以保证评估结果的泛化性。*
34 | 
35 | 劣势：
36 | - **第一印象偏差**：人工标注员往往根据 [第一印象](https://arxiv.org/abs/2309.16349) 来评估回答的质量，有时候会忽略对事实的考证。
37 | - **语气偏差**：众包标注员对语气特别敏感，容易低估一些表述比较坚定的句子而出现事实或逻辑错误。比如模型以自信的语气说出错误的内容，标注员可能很难发觉，进而导致对输出更为自信的模型的评分偏高。相比之下，专家标注员受语气偏差的影响更低。
38 | - **自我偏好偏差**：人们有时候会 [偏向于选择迎合自己观点的答案](https://arxiv.org/abs/2310.13548)，而不是事实正确的答案。
39 | - **身份背景偏差**：不同身份背景的人具有不同的价值观，可能导致评估模型时表现出显出差异 (例如在模型输出的 [不当回答评估](https://arxiv.org/abs/2205.00501) 中，对何为不当表述的理解偏差)。
40 | ### 系统化人工评估
41 | 系统化人工评估 (尤其是付费的人工) 的优势：
42 | - **高质量数据**：可以根据评估任务量身定制测试集，为你开发 (例如需要开发偏好模型) 和评估模型提供进一步支持。
43 | - **数据隐私**：付费标注员 (尤其是内部人员) 通常很注重数据安全性，反而 LLM 评估的闭源 API 模型的数据隐私性较差，因为你需要将个人数据发送给外部服务。
44 | - **可解释性**：标注员在评分时会清晰的说明打分理由。
45 | 
46 | 缺点：
47 | - **成本较高**：当然你需要支付给标注员费用。甚至为了优化评估指南，你还需要多轮迭代，这会使得费用更高。
48 | - **扩展性差**: ：除非你的评估任务非常依赖用户反馈，否则人工评估方法的扩展性确实不太好，因为每次进行一轮新的评估都需要重新调动人员 (并支付费用)。
49 | - **重复性低**：除非你能保证每次评估都是同一批标注员并且评分标准完全明确，否则不同的标注员对评估结果的可能无法精确复现。
50 | 
51 | ### 非正式人工评估
52 | 优势：
53 | - **成本较低**：社区成员自愿参与，费用支付较少。
54 | - **发现边缘用例**：由于限制较少，成员自发的创造性可能会发现一些有趣的边缘用例。
55 | - **扩展性高**：只要有足够多的社区成员自愿参与，评估的扩展性就会更好，且参与门槛较低。
56 | 
57 | 劣势：
58 | - **高度主观性**：由于社区成员的自身的 [文化局限性](https://arxiv.org/abs/2404.16019v1)，尽管标准一致，也很难在评分时保持一致性。不过 “群体智慧” 效应 (参考 Galton 的 Wiki 页面) 可能在大量的评分投票中平滑地缓解这一问题。
59 | - **评分偏好不具代表性**：由于年轻西方男性在互联网技术社区中的占比过高，可能导致评估的偏好严重失衡，这跟实际上普通大众的口味并不一致，因此会影响评估的准确性。
60 | - **容易被操控**：如果你请的众包标注员没经过筛选，第三方机构很容易通过操控他们来导致模型的评分异常 (如偏高)，尤其是当模型的写作风格比较独特的时候。
61 | 


--------------------------------------------------------------------------------
/translations/zh/contents/human-evaluation/tips-and-tricks.md:
--------------------------------------------------------------------------------
 1 | # 技巧与提示
 2 | 建议阅读本文之前先阅读 "Using human annotators" 页面。本文将介绍使用人工标注构建评估数据集时的一些实用建议。
 3 | 
 4 | ## 任务设计
 5 | 
 6 | - **简单至上**：标注任务避免不必要的复杂。将标注员的认知负担降低到最低有助于确保他们保持专注，从而提高标注质量。
 7 | 
 8 | - **检查信息**：标注任务避免引入不必要的信息。仅提供任务必需的信息即可，确保不对标注员产生额外偏见。
 9 | 
10 | - **内容简化**：事物的展示位置和方式差异都可能导致额外的工作量和认知负担，进而影响标注质量。例如文本和任务在同一个页面展示就能避免不必要的滚动操作，再或者多个串行任务结合时可以按顺序展示。请仔细思考你的标注工具中所有内容的展示方式，看看是否还有简化空间。
11 | 
12 | - **测试设置**：任务设计以及标注指引完成之后，确保先在少量样本上自行测试通过，再邀请整个标注团队参与，并根据需要进行迭代。
13 | 
14 | ## 标注过程
15 | 
16 | - **独立标注**：为避免标注员的个人偏见在团队内传播而导致结果偏差，标注员在任务过程中应做到：不互相帮助、不借鉴答案。标注指引的对齐原则应贯穿任务始终，需使用独立数据集培训新标注员或者采用标注间一致性指标来保证整个标注团队的结果一致。
17 | 
18 | - **版本一致**：如果标注文档需要重大更新 (例如，定义或指令更改、添加或删除标签)，则要决定是否对已标注的数据进行迭代，最少也得对更改的数据集进行版本追踪，可以使用如 `guidelines-v1` 的元数据值。
19 | 
20 | ## 混合人机标注
21 | 
22 | 人工标注固然优势很大，但有时候标注团队会受到一些限制，如时间和资源。此时，可以部分利用模型来提高标注效率。
23 | 
24 | - **模型辅助标注**：可以使用模型的预测或生成结果作为预标注，来避免标注团队从零开始。需要注意的是这可能会引入模型偏差，例如模型的准确率较低时反而会增加标注工作量。
25 | 
26 | - **监督模型评估**：可以将模型评估 (参考 “Model as a judge” 页面) 和人工监督的方法论相结合来对结果进行验证或丢弃。需要注意引入的偏差 (参考 “人工评估的优劣势” 部分)。
27 | 
28 | - **识别边缘案例**：为使任务更加高效，可以先用一组模型初步判断，待模型意见偏差过大或正反平局时再引入人工监督员。同样需要注意引入的偏差 (参考 “人工评估的优劣势” 部分)。
29 | 
30 | ## 端到端教程
31 | 
32 | 如果你想完整的构建自己的评估任务，可以参考 Argilla 出品的 [实用评估教程](https://github.com/argilla-io/argilla-cookbook/tree/main/domain-eval)，文中详细介绍了使用 [Argilla](https://github.com/argilla-io/argilla/) 和 [distilabel](https://github.com/argilla-io/distilabel) 进行合成数据、人工评估等来构建特定领域的评估任务。构建完成后可以使用 [lighteval](https://github.com/huggingface/lighteval) 库进行评估。


--------------------------------------------------------------------------------
/translations/zh/contents/human-evaluation/using-human-annotators.md:
--------------------------------------------------------------------------------
 1 | # 人工标注员
 2 | 
 3 | 推荐阅读 [这篇综述](https://aclanthology.org/2024.cl-3.1/) 的第三章，介绍了许多数据标注质量管理的实践经验。如果你追求的是生产级的质量，并且具备实施条件，那么请继续阅读吧！ 
 4 | 
 5 |   ![Best_annotation_practices](https://github.com/huggingface/evaluation-guidebook/blob/main/assets/best_annotation_practices.png?raw=true)
 6 | 
 7 | 无论项目规模多大，一旦定义了具体的评估任务和打分细则，请注意：
 8 | 
 9 | - **选择合适的标注员，如果可能的话提供经济激励**
10 | 你可能希望参与任务的标注员具有以下品质：
11 | 1) 符合特定的人口统计特征。
12 | 	例如：母语是测试目标语言、较高的教育水平、特定领域的专业知识、多样化的地域背景等。 
13 | 	 根据评估任务不同，对标注员统计特征需求也不一样。
14 | 1) 提供高质量标注。
15 | 	有些任务中筛选合适的标注员很重要，比如近期有一种任务是检查回答是否是 LLM 生成的。
16 |   *个人认为，除非你众包标注员有强烈的自我驱动意识，否则一般还是支付合理的费用更好。*
17 | 
18 | - **设计标注准则** 
19 | 请务必深入思考制定标注准则，非常值得花费大量时间去做！我们在制作 [GAIA](https://huggingface.co/gaia-benchmark) 数据集时的耗时最多的地方就是这里。
20 | 
21 | - **迭代标注** 
22 | 很多时候标注员会误解标注指南 (他们的想法可能比你想象的更模棱两可)，所以要做好多轮迭代标注的准备，来不断改进直到达到你的需求。
23 | 
24 |   - **质量检查** 和 **手动筛选**
25 | 你需要仔细检查答案的质量 (检查标注员间的答案一致性)，并筛选出质量最优、相关性最高的答案。
26 | 
27 | 你也可以使用专用工具来构建高质量标注数据集，如 [Argilla](https://argilla.io/)。
28 | ### 深入阅读推荐链接：
29 | - ⭐ [五分钟构建自己的标注平台](https://huggingface.co/learn/cookbook/enterprise_cookbook_argilla)，Moritz Laurer 出品的数据标注教程。这篇文章介绍了使用开源工具 (如 Argilla 和 Hugging Face) 的实际经验，可以帮助更好的理解大规模人工标注的注意事项。
30 | - ⭐ [标注实践指南](https://aclanthology.org/2024.cl-3.1/)。这是一篇 2023 年所有关于人工标注论文的综述，内容完整，干货满满，但很容易理解。
31 | - [ScaleAI 出品的另一篇标注实践指南](https://scale.com/guides/data-labeling-annotation-guide)，专注于人工评估。它是对上述文档的更轻量级补充。
32 | - [关于减少人工标注分歧的假设与挑战](https://aclanthology.org/2024.naacl-long.126/)，论文探讨了标注员间分歧来源的原因，以及在实践中的缓解方法。
33 | 


--------------------------------------------------------------------------------
/translations/zh/contents/model-as-a-judge/basics.md:
--------------------------------------------------------------------------------
 1 | # 基础概念
 2 | 
 3 | ## 什么是评估模型？
 4 | 评估模型 (Judge models) 是一种 **用于评估其他神经网络的神经网络**。大多数情况下它们用来评估生成文本的质量。
 5 | 
 6 | 评估模型涵盖的范围很广，从小型的特定分类器 (例如 “垃圾邮件分类器”) 到大型的 LLM，或大而广、或小而专。使用 LLM 作为评估模型时，需要提供一个 prompt 来解释对模型评分的细则 (例如：`请对语句流畅度从 0 到 5 评分，0 分表示完全不可理解，…`)。
 7 | 
 8 | 使用模型作为评估工具可以对文本中复杂和细微的特性有效的评估。  
 9 | 例如精确匹配预测文本和参考文本的任务，只能评估模型预测正确事实或数字的能力。但要评估更开放性的经验能力 (如文本流畅水平、诗词文学质量或输入忠实程度) 则需要更复杂的评价工具。
10 | 
11 | 这就是评估模型最初的切入点。 
12 | 
13 | 它们通常用于三大任务。
14 | - *为生成文本打分*：使用预先定义的评分标准与范围来评估文本的某些属性 (如流畅度、有害性、一致性、说服力等)。
15 | - *成对比较*：对比模型的两个输出，以选出在给定属性上表现更好的文本。
16 | - *计算文本相似度*：用于评估参考文本和模型输出的匹配程度。
17 | 
18 | *注：本文目前主要关注 LLM + prompt 的评估方法。不过建议你还是了解一下简单分类器评估模型的工作原理，因为这种方法在许多测试用例中都具有稳定的表现。最近也出现了一些新的有前景的方法，例如奖励模型作为评估模型 (在 [这篇报告](https://research.nvidia.com/publication/2024-06_nemotron-4-340b) 中提出，本指南中也简单写了一篇 [文章](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md) 介绍奖励模型)。*
19 | 
20 | ## LLM 评估模型的优劣势：
21 | 优势：
22 | - **客观性**：与人类相比，LLM 评估模型在自动化地做出经验性判断时更加客观。
23 | - **规模化和可复现**：LLM 评估模型可以在非常大规模数据上做评估，并且评估结果可以复现。
24 | - **成本较低**：与支付人工标注员报酬相比，由于无需训练新模型，只要使用现有的高质量 LLM 和 prompt 就可以进行评价任务，因此评估模型成本较低。
25 | - **与人类判断对齐**：LLM 评估结果在一定程度上与人类的判断具有相关性。
26 | 
27 | 劣势：
28 | - LLM 评估模型看似客观，实际上具有更难被检测到的 **隐藏偏差**，这是因为我们无法主动地发掘这些偏差 (参考 [model-as-a-judge/Tips and tricks] 章节)。此外，缓解人类偏差可以通过设计一些内容具体或统计稳健的调查问卷的方式 (这在社会学领域已有近百年的研究)，而缓解 LLM 偏差的方式就没那么成熟了。另外，使用 LLM 评估 LLM 可能会产生 “回音室效应”，即潜移默化地加强了模型的固有偏差。
29 | - LLM 评估模型虽然具有规模化优势，但同时也会生成大量的数据需要仔细检查。例如模型可以生成思维路径或数据推理，但产生的结果需要更多的分析。
30 | - LLM 评估模型在通常情况下便宜，但在某些具体任务中如需获取质量更高的评估结果而聘请专家级人工标注员，那么成本会相应增加。
31 | 
32 | ## 如何开始？
33 | - 如果你想尝试设置自己 LLM 评估模型，推荐阅读由 Aymeric Roucher 撰写的 [LLM 评估模型指南](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐)！
34 | 一些使用工具：[distilabel](https://distilabel.argilla.io/latest/) 代码库，它能够基于 LLM 生成和迭代数据集。[Ultrafeedback 论文](https://arxiv.org/abs/2310.01377) 中提到的方法以及相应的 [教程](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/)。[Arena Hard 基准实现教程](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/)。
35 | 


--------------------------------------------------------------------------------
/translations/zh/contents/model-as-a-judge/designing-your-evaluation-prompt.md:
--------------------------------------------------------------------------------
 1 | # 设计你自己的评估 prompt
 2 | 
 3 | ## 通用 prompt 设计建议
 4 | 我总结的互联网上通用 prompt 的通用设计原则如下：
 5 | - 任务描述清晰：
 6 | 	- `Your task is to do X (你的任务是 X)`. 
 7 | 	- `You will be provided with Y (你拿到的信息是 Y)`.
 8 | - 评估标准精细，评分细则详尽 (如有必要)：
 9 | 	- `You should evaluate property Z on a scale of 1 - 5, where 1 means ... (根据属性 Z 的表现进行评分，评分范围为 1 - 5，其中 1 分表示 ...)`
10 | 	- `You should evaluate if property Z is present in the sample Y. Property Z is present if ... (请指出样本 Y 中是否具备属性 Z，如果具备，那么 ...)`
11 | - 加入一些 “推理” 评估步骤
12 | 	- `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ... (评估此任务之前，请先仔细阅读样本 Y，识别出 ...，然后再 ...)`
13 | - 输出格式明确 (添加特定字段可以提升一致性)
14 | 	- `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score} (以 JSON 格式回答，格式为 {"Score": 评分, "Reasoning": 评分推理过程})`
15 | 
16 | Prompt 书写灵感可以参考 [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) 或 [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) 的 prompt 模板。
17 | 
18 | 其他要点：
19 | - 成对比较比对输出评分[更能反映人类的偏好](https://arxiv.org/abs/2403.16950)，且通常更稳健
20 | - 如果任务确实需要对输出评分为具体的值，建议使用整数，并详细解释 [每个分值的代表含义](https://x.com/seungonekim/status/1749289437165769177)，或添加说明 prompt `如 provide 1 point for this characteristic of the answer, 1 additional point if ... (回答具备某项特性得 1 分，如果 ... 再加 1 分)` 等
21 | - 尽量每评估一项能力就使用专门评分 prompt，会得到更好而鲁棒的结果
22 | 
23 | ## 提升评估准确性
24 | 可以通过以下方式或技术来提升评估准确性 (有可能会增加成本)：
25 | - **Few-shot 示例**：提供少量示例可以帮助模型理解和推理，但也会增加上下文长度。
26 | - **引用参考**：提供参考内容可以提高模型输出的准确性。
27 | - ***思维链 (CoT)**：要求模型 **在评分之前** 给出推理过程，可以 [提高准确性](https://arxiv.org/abs/2212.08073) (参考这篇 [帖子](https://x.com/seungonekim/status/1749289437165769177))。
28 | - **多轮分析**：可以更好地 [检测事实性错误](https://arxiv.org/abs/2305.13281)
29 | - **陪审团机制**：汇总多个评价模型的结果 [比单一模型的结果更好](https://arxiv.org/abs/2404.18796)。 
30 | 	- 使用多个小模型替代一个大模型可以大幅降低成本。
31 | 	- 也可以使用一个模型的多个温度参数来进行多次实验。
32 | - 社区意外发现，prompt 引入奖励机制 (`例如：回答正确将得到一只小猫`) 可以提高回答正确性。这个方法的效果视场景而异，你可以根据需求灵活调整。
33 | 
34 | 注：如要减少模型偏见，可以参考社会学中的问卷设计，然后根据使用场景来书写 prompt。如想使用模型来替代人工评估，可以设计类似的评价指标：如计算标注员一致性，使用正确的问卷方法来减少偏见等。
35 | 
36 | 不过在实际应用中，大多数人并不需要完全可复现且高质量无偏的评估，快速且略显粗糙的 prompt 就能满足需求。(只要知悉使用后果，这种情况也是能接受的)。
37 | 


--------------------------------------------------------------------------------
/translations/zh/contents/model-as-a-judge/evaluating-your-evaluator.md:
--------------------------------------------------------------------------------
 1 | # 评估你的评估结果
 2 | 
 3 | 在生产中或大规模使用 LLM 评估模型之前，你需要先评估它在目标任务的表现效果如何，确保它的评分跟期望的任务表现一致。
 4 | 
 5 | 注：*如果评估模型的输出结果是二元分类，那么评估会相对简单，因为可使用的解释性分类指标有很多 (如准确率、召回率和精确率)。但如果输出是在某个范围内的分数，评估起来就会困难一些，因为模型输出和参考答案的相关性指标很难与分数映射的非常准确。*
 6 | 
 7 | 在选定 LLM 评估模型以及设计 prompt 之后，还需要：
 8 | 
 9 | ## 1. 选择基线
10 | 你需要将选定模型的评估结果与基线对比。基线可以是很多种类型，如：人工标注结果、标准答案、其他表现良好评估模型的结果、其他 prompt 对应模型的输出，等等。
11 | 
12 | 测试用例的数量不需要非常多 (50 个足以)，但必须极具代表性 (例如边缘用例)、区分性、并且质量足够高。
13 | 
14 | ## 2. 选择评估指标
15 | 评估指标是用来比较评估结果和参考标准之间的差距的。 
16 | 
17 | 通常来说，如果比较对象是模型的二元分类或成对比较属性，评估指标计算起来就非常容易，因为一般使用召回率 (二元分类)、准确率 (成对比较)、和精确率作为评估指标，这些指标容易理解、且具有可解释性。
18 | 
19 | 如果比较对象是模型得分与人类评分，则计算指标就会困难一些。如要深入理解可以阅读 [这篇博客](https://eugeneyan.com/writing/llm-evaluators/#key-considerations-before-adopting-an-llm-evaluator)。
20 | 
21 | 总的来说，如果你不清楚如何选择合适的评估指标或者评估模型，可以参考 [这篇博客](https://eugeneyan.com/writing/llm-evaluators/) 中的 [图表](https://eugeneyan.com/assets/llm-eval-tree.jpg) ⭐。
22 | 
23 | ## 3. 评估你的评估结果
24 | 这一步你只需用评估模型和测试 prompt 来评估在样本上的表现，拿到评估结果之后使用上一步选定的评估指标计算分数即可。
25 | 
26 | 你需要确定一个阈值来决定结果归属，阈值大小取决于你的任务难度。例如成对比较任务的准确率指标可以设为 80% 到 95%，再比如评分排名任务的相关性指标，文献中经常使用 0.8 的皮尔逊相关系数，不过也有一些论文认为 0.3 足以表明与人工评估的相关性良好。所以标准不是死的，根据任务灵活调整吧 (^^") ！
27 | 
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/translations/zh/contents/model-as-a-judge/getting-a-judge-llm.md:
--------------------------------------------------------------------------------
 1 | # 选择 LLM 评估模型
 2 | 
 3 | 使用现有的 LLM 评估模型时，你可以选择：[通用性强、能力高的大模型](https://arxiv.org/abs/2306.05685v4)、[专业性强、特定数据偏好的小模型](https://arxiv.org/abs/2405.01535)、或自行训练模型。
 4 | 
 5 | ## 使用大型专家 LLM
 6 | 
 7 | 随着更强大的 LLMs (如 ChatGPT) 的不断推出，研究者们开始探索使用 LLM 作为评估模型。目前在评估任务上表现最好的仍然是闭源模型 (如 Claude 或 gpt-o)，不过得益于高质量开源模型 (如 [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)，[Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024)，以及 [Llama 3.1-405-Instruct](meta-llama/Llama-3.1-405B-Instruct)) 快速发展，开源与闭源模型之间的差距正在迅速缩小。
 8 | 
 9 | 尽管闭源模型效果很好，也存在一些局限性：
10 | - 只能使用 API 方式调用。这意味着 API 背后的模型以及输出结果可能随着版本更替 (有时候甚至不发公告说明) 而改变，进而影响评估结果的可重复性。
11 | - 黑箱性质。可解释性较差。
12 | - 数据隐私泄露风险。使用 API 时需要将数据发送给第三方 (比本地数据管理安全性差得多)，你无法确定数据会被怎样处理 (一般需要筛选一下训练集中的数据)。
13 | 
14 | 闭源模型的优点也很明显，无需本地配置或硬件支持，任何人都可以轻松地访问高质量模型。不过这些优点开源模型也具备，大多数高质量开源模型提供商本身就提供了 API 访问，同时还缓解了上述两个局限性。
15 | 
16 | 如果你需要选择模型提供商，可以参考 [这篇博客](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard)，了解一下成本分析的内容。
17 | 
18 | ## 使用小型专家 LLM
19 | 
20 | 你也可以选择使用小型专业 LLM 评估模型，它们的参数量通常只有几 B，可以在大多数现代消费级硬件上部署和运行，甚至可以自行从头训练或指令微调。直接使用也很简单，只需要遵循模型特定的 prompt 格式就行。
21 | 
22 | 可选的模型有：
23 | - Flow-Judge-v0.1 ([权重链接](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec))：3.8B 参数，基于 Phi-3.5-mini-instruct，在合成偏好数据集上微调
24 | - Prometheus ([权重链接](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0)，[论文链接](https://arxiv.org/abs/2310.08491))：13B 参数，在合成偏好数据集上从头训练。还有一个 7B [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) 版本：基于 Mistral-7B-Instruct-v0.2，在更大的合成偏好数据集上微调，且加入了权重融合
25 | - JudgeLM ([论文链接](https://arxiv.org/abs/2310.17631))：7B 到 33B 参数，在多个模型生成的合成偏好数据集上从头训练。
26 | 
27 | ## 自行训练 LLM
28 | 你也可以选择训练或微调自己的 LLM 评估模型。
29 | 
30 | 首先要为任务目标收集偏好数据。可选的数据有：
31 | - 现有的 [人类偏好数据集](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
32 | - 模型生成的偏好数据 (可以参考上述小型评估模型论文中的数据部分，也可以直接点击链接获取，如 Prometheus [偏好数据集](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) 和 [反馈数据集](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection))。
33 | 
34 | 然后需要决定模型训练方式，是从头训练还是微调现有模型。步骤包括：
35 | - 蒸馏大模型为小模型
36 | - 量化
37 | - 使用上一步的数据微调 (如果模型较大且计算资源有限，可以使用 peft 或 adapter 权重的方式)
38 | 	- 有帖子指出 [从奖励模型开始微调比指令模型的效果更好](https://x.com/dk21/status/1826292289930674590)
39 | 


--------------------------------------------------------------------------------
/translations/zh/contents/model-as-a-judge/tips-and-tricks.md:
--------------------------------------------------------------------------------
 1 | # 技巧与提示
 2 | 
 3 | ## LLM 评估模型已知偏差及缓解措施: 
 4 | 
 5 | - **缺乏内部一致性**：同一 prompt 输入评估模型执行多次得到的结果可能不一样 (如果温度参数不设为 0)。
 6 | 	- 缓解措施：遵循 “自我一致性 (self-consistency)” 设置 prompt，输入模型执行多次并保留多数结果
 7 | - **自我偏好**：LLM 评估模型更 [偏好自己的输出模式](https://arxiv.org/abs/2404.13076)，因此会对模式相似的结果评分偏高。
 8 | 	- 缓解措施：采用陪审团机制
 9 | - **输入扰动不敏感**：评估模型对 [扰动输入](https://arxiv.org/abs/2406.13439) 的辨识效果较差，[难以提供一致的评分范围](https://twitter.com/aparnadhinak/status/1748368364395721128) (更多实验结果可以参考 [这个链接](https://github.com/LeonEricsson/llmjudge/blob/main/README.md))。例如对于施加了相同程度噪声的文本，使用评估模型评估文本质量的评分无法反映噪声的程度。
10 | 	- 缓解措施：
11 | 		- 要求模型先输出详细的推理过程 [再输出评分](https://twitter.com/seungonekim/status/1749289437165769177)
12 | 		- 在 prompt 中添加一致的评分标准
13 | - **位置偏差**：评估模型更 [偏好特定位置的答案](https://arxiv.org/abs/2306.05685)。例如在成对比较时，Claude 和 GPT3.5 在多次测试中通常会偏好某一个位置，例如第一个或第二个答案。
14 | 	- 缓解措施：
15 | 		- 随机调整答案位置
16 | 		- 计算所有选项的对数概率并归一化
17 | - **冗长偏好 (长度偏差)**：评估模型更偏好冗长的答案。
18 | 	- 缓解措施：[考虑答案中的长度差异](https://arxiv.org/abs/2404.04475)
19 | - **[难以对齐人类答案](https://arxiv.org/abs/2308.15812)**：
20 | 	- 在所有评估中，[人工评估是否可以作为一个不错的基线尚有争议](https://arxiv.org/abs/2202.06935)。例如在某些特定领域 (如医学、法律、数学等)，如果标注员专业性不够，那么得到的结果可能跟直接采用 LLM 一样差。
21 | - **格式偏差**：如果输入模型的 prompt 格式与其训练数据的格式 [相差甚远](https://arxiv.org/pdf/2310.17631)，可能导致模型的评估结果不准确。例如，成对比较模型的训练集数据格式中提供了参考答案，如果在评估时没有给定参考答案或者给定的参考答案格式有误，那么评估结果就不可信。
22 | 	- 缓解措施：仔细遵循评估模型训练集 prompt 格式 (比如指令微调模型的格式)。
23 | 
24 | ## 选择合适的 LLM 评估任务
25 | 
26 | LLM 评估特性：
27 | - **很难识别幻觉**：尤其是部分幻觉 (与事实非常相近，仅有微小的区别而导致错误)。(可以参考这两篇论文：[链接 1](https://arxiv.org/abs/2305.11747) 和 [链接 2](https://arxiv.org/abs/2303.08896))。
28 | - **许多任务上与人工评估一致性不高**：如 [总结任务](https://arxiv.org/abs/2304.02554) (也可以参考 [这篇](https://arxiv.org/abs/2303.16634))、[输入遵循忠实度](https://arxiv.org/abs/2307.16877)，更多任务请参考 [这篇论文](https://arxiv.org/abs/2406.18403)。
29 | 


--------------------------------------------------------------------------------
/translations/zh/contents/model-as-a-judge/what-about-reward-models.md:
--------------------------------------------------------------------------------
 1 | # 奖励模型相关内容
 2 | 
 3 | ## 什么是奖励模型？
 4 | 
 5 | 奖励模型通过学习人工标注的成对 prompt 数据来预测分数，优化目标是对齐人类偏好。
 6 | 训练完成后，奖励模型可以作为人工评估代理的奖励函数，用来改进其他模型。
 7 | 
 8 | ### 成对比较评分
 9 | 
10 | 最常见的奖励模型类型是 Bradley-Terry 模型，它的输出是一个分值，遵循以下公式：
11 | 
12 | $$p(\text{答案 b 优于答案 a}) = \text{sigmoid}(\text{score}_b - \text{score}_a)$$
13 | 
14 | 奖励模型的训练数据只需要成对比较的答案，这比收集分数数据更容易。因此训练好的模型只能比较同一个 prompt 下的多个答案孰优孰劣，无法跨 prompt 比较。
15 | 
16 | 其他模型在此方法的基础上进行了扩展，可以预测一个回答优于另一个的概率值 (例如 [基于 LLaMA3 的奖励模型](https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B))。
17 | 
18 | 这样模型就能 (理论上) 以数值来判断多个回答之间的细微差别，不过只能针对同一 prompt 对应的回答进行对比，跨 prompt 的回答概率值就没有对比意义了。另外当回答较长时，可能会受到上下文长度和内存限制的影响。
19 | 
20 | ### 绝对分数
21 | 
22 | 还有一些奖励模型 (如 [SteerLM](https://arxiv.org/abs/2311.09528)) 的输出是绝对分数。这类模型使用起来更加方便，可以直接对回答评估分数，而无需构造成对。但是数据收集就比较困难了，因为在衡量人类偏好时，绝对分数就显得相对不那么稳定。
23 | 
24 | 最近有人提出了更强力的模型，可以同时输出绝对分数和相对分数。如 [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) 和 [ArmoRM](https://arxiv.org/abs/2406.12845)。
25 | 
26 | 
27 | ## 奖励模型用于评估的方法
28 | 
29 | 给定一个 prompts 数据集，输入 LLM 生成回答，并请求奖励模型对回答评分。
30 | 
31 | 如果使用的奖励模型输出是绝对分数，可以对所有回答的分数求平均来获取最终得分。
32 | 
33 | 其实更常用的奖励模型输出是相对分数，对其求平均可能会受到异常值的影响 (某些非常好或非常差的回答)，因为不同 prompt 的评估分数可能具有不同的尺度 (某些 prompt 会比其他的更简单或困难)。
34 | 
35 | 总上，我们可以使用： 
36 | - 胜率 (win rate)：取一组参考回答，计算模型回答优于参考回答的百分比，这种结果会更加精细。
37 | - 胜算概率 (win probabilities)：取一组参考回答，计算模型回答优于参考回答的平均概率，这种结果能够提供更细致和平滑的信号。
38 | 
39 | ## 奖励模型的优劣势
40 | 
41 | 优势：
42 | - **非常迅速**：奖励模型只需要得到一个分数 (与 LLM 评估模型需要得到长文本不同)，因此其推理速度和小模型速度相当。
43 | - **具有确定性**：前向过程相同，最终得分也会保持一致。
44 | - **对位置偏差不敏感**：大多数奖励模型一次只处理一个回答，所以很少受到顺序的影响。即使对于需要处理成对回答的奖励模型，只要它在训练时使用的数据在顺序上是均衡的，受位置偏差的影响很非常小。
45 | - **无需 prompt 工程**：很明显，奖励模型的任务目标只有一个，就是通过训练偏好数据来对一个或两个回答输出分数。
46 | 
47 | 劣势：
48 | - **需要特定微调**：即便微调后继承了基础模型的许多能力，不过在超出训练集范围的任务上可能还是表现不佳，另外这一步的成本会相对偏贵。
49 | - **在强化学习和任务评估领域 (或使用直接对齐算法处理与奖励模型训练集相似的数据集时) 效率较低**：语言模型会过拟合奖励模型的偏好数据。
50 | 
51 | ## 使用奖励模型进行评估的技巧与提示
52 | 
53 | - [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)：奖励模型排行榜，可以找到很多高性能模型。
54 | - [Nemotron](https://arxiv.org/abs/2406.11704) 论文中介绍了奖励模型的使用经验。
55 | - 对于那些仅评分单个 prompt 与回答的奖励模型，可以缓存多个模型结果，当测试新模型的表现时就能够很快得到结论。
56 | - [这篇论文](https://arxiv.org/abs/2410.11677v1) 对训练过程中的胜率或胜率概率进行了追踪整理，可以帮助检测模型退化以及选择最佳权重。
57 | 


--------------------------------------------------------------------------------
/translations/zh/contents/troubleshooting/troubleshooting-inference.md:
--------------------------------------------------------------------------------
 1 | # Troubleshooting inference
 2 | 
 3 | ## 模型运行非常慢怎么办？
 4 | ### 调整 batch size
 5 | 如果你想要评估结果完全可复现 (在特定的输入 prompt 和硬件条件下)，你可以把 batch size 可以设为 1。但如果增大 batch size (硬件条件允许的话) 将会加快推理速度。
 6 | 
 7 | ### 数据并行
 8 | 你可以将模型加载到多个 GPU 上，然后将数据集分为多个子集并分配给每个 GPU，最后汇总全部计算结果。
 9 | 这意味着每个数据流是被并行同时处理的，从而将总执行时间缩短至 GPU 数分之一。
10 | 尽量把 GPU 都放在一个节点上来避免跨节点传输瓶颈。
11 | 
12 | ### 调整代码
13 | 不同的推理库由于代码优化的种种差异，推理速度不尽相同。你可能需要做一些对比试验来选出速度最快的库。如果模型层面你使用 pytorch 实现，建议可以参考这份 [推理优化清单](https://pytorch.org/serve/performance_checklist.html)。
14 | 
15 | ### 调整精度
16 | 你可以通过调整计算精度来减小模型大小，进而加快推理速度。虽然 float32 精度 (每个数字使用 32 位存储) 的模型计算非常精确，但其消耗的内存和计算资源却非常大。降低精度为 `blfoat16` 或 `float16` (半精度) 可以加速一倍，而且不会影响计算结果。如果需要更进一步加速，可以尝试量化为更低精度，如 8 比特或 4 比特 (可以使用 `gptq` 或 `bitsandbytes` 库完成量化)。低位矩阵计算速度更快 (不过有些量化库反而有些慢，最好在你自己的模型上测试一下)，模型占用内存也更少。
17 | 
18 | ## 内存占用非常大怎么办？
19 | ### 估算内存需求
20 | 你可以用以下 **公式** 估算模型加载 (特定硬件) 所需的最小理论内存：
21 | 
22 | `<内存 (GB)> = <参数量 (G)> * <精度因子>`
23 | 
24 | 模型所需总内存等于总参数量乘以一个参数所需的字节数。其中 1 个字节 (Byte) 是 8 比特 (bit)，精度因子视情况取值 (`float32` 为 4、`float16` 或 `bfoat16` 为 2、`8bit` 为 1、`4bit` 为 0.5)。
25 | 
26 | 这就是基本的估算方法。
27 | 
28 | 实际使用时我建议这样计算：`<内存 (GB)> = <参数量 (G)> * (<精度因子> * 110%)`。因为推理阶段除了加载模型还需要加载数据，总内存需求会更多。
29 | 
30 | ### 一个 GPU 都装不下怎么办？
31 | #### 量化
32 | 第一个明显的方法是降低 `<精度因子>`：从 float32 降低到 4bit 可以降低 8 倍的内存占用！
33 | 不过精度太低也会导致结果变差。对于一些模型 (尤其是中等规模的模型) float16 或 8bit 就足够 (低精度对大型模型的影响较小，可能是因为信息冗余)。
34 | #### 模型并行
35 | 模型并行包含一系列技术：拆分大模型为多个小型子模型、分配子模型到不同的 GPU 上运行等。这种方法不会一次性地加载整个模型，因此减少了内存需求，但可能会更慢。
36 | 
37 | 模型并行有两种方法：
38 | - 管线并行。即在 layer 级拆分模型，不同的 layer 被分配到不同的 GPU 上。由于推理时前向过程是线性的，例如 layer 1 的输出是 layer 2 的输入，因此 layer 2 分配的 GPU 需要等待 layer 1 的计算结束才能开始 (也叫 “冒泡 (bubble)”)，同时数据和中间结果也需要 GPU 间传输，这也就导致执行速度较慢。不过可以通过将输入拆分为更小的批次来缓解冒泡，Pytorch 的原生库 [`PiPPy`](https://github.com/pytorch/PiPPy) 就支持这一功能，也是 `accelerate` 库后台实现并行的方法。
39 | - 张量并行。即在矩阵计算级拆分模型，矩阵按行或者列被拆分且分配到不同的 GPU 上计算以及合并结果。如果多个 GPU 在同一个节点上 (避免跨节点传输瓶颈)，这种并行方法会非常高效，但是代码实现有些困难。好在 `vllm` 库已经实现了，加速效果 **非常明显**。
40 | 
41 | 更多并行方法 (包括数据并行等) 可以参考 [这篇文档](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)。
42 | 
43 | #### CPU 卸载
44 | CPU 卸载可以将部分模型和计算从 GPU 转移到 CPU 上，以减少 GPU 内存占用。不过相比于其他方法，CPU 卸载要 **慢得多**，主要原因是需要将数据在设备之间频繁移动。
45 | 
46 | 例如，Deepspeed 的 [ZeRO-Offload](https://arxiv.org/abs/2101.06840) 可以将参数同时分配到 CPU 和 GPU (在 ZeRO-2 论文中有更详细的优化说明) 上。其中梯度、优化器状态、以及优化过程中的 fp32 模型参数在 CPU 上传递，而前向和反向过程中的 fp16 参数可以在 GPU 上传递，从而利用 CPU 内存优化并优化 GPU 计算，同时减少两者之间的通信。
47 | 
48 | ### 模型装进 GPU 但还是报 OOMs 错误怎么办？
49 | 你可能遇到了上下文大小相关的问题。
50 | 
51 | 我们建议：
52 | 1) 同时加载模型以及 dummy 数据，测试 GPU 是否会内存溢出，注意 dummy 数据的上下文长度应足够大 (能够代表你的任务)。
53 | 2) 降低 batch size，或者关闭自动搜索 batch size 功能 (如果启用，有时会导致 OOM 错误)。
54 | 3) 更一般地，确保输入模型的样本的上下文大小是按从大到小的顺序，这样如果上下文大小过大，模型就会直接报错，避免了模型一开始正常运行，直到某个时间才出问题。
55 | 


--------------------------------------------------------------------------------
/translations/zh/contents/troubleshooting/troubleshooting-math-parsing.md:
--------------------------------------------------------------------------------
  1 | # 使用LaTeX 评估数学能力
  2 | 
  3 | 解析 latex 很难。这个问题在评估输出为 $\LaTeX$ 的模型时经常会遇到，例如 HuggingFace 的 [数学评估基准](https://huggingface.co/datasets/lighteval/MATH)。
  4 | 
  5 | 这个基准使用 $\LaTeX$ 来表示数学领域的计算和符号。评估难点在于对模型输出与标准答案的解析和比较。
  6 | 结果表明，解析 $\LaTeX$ 没有标准方法。
  7 | 
  8 | 
  9 | ![](../../../../assets/sympy_doc.png)  
 10 | *摘自 [`sympy`](https://github.com/sympy/sympy) 文档*
 11 | 
 12 | lm-evaluation 框架使用 [`sympy`](https://github.com/sympy/sympy) (一个用于符号数学的 Python 库) 来对 latex 进行解析和比较。 
 13 | 使用 `sympy` 解析真值 (用真值自身对比测试) 只能得到约 0.94 的准确率。
 14 | 怎么会是这样呢？后来发现 `sympy` 无法解析某些 (标准的 $\LaTeX$) 表达式。
 15 | 
 16 | 例如：
 17 | 
 18 | ```
 19 | couldn't parse one of [0,1) 或 [0,1)， I expected one of these: ']'
 20 | [0,1)
 21 | ~~^
 22 | ```
 23 | 
 24 | ```
 25 | couldn't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here
 26 | (-\iny,-5]\cup[5,\iny)
 27 | ~~~~~~^
 28 | ```
 29 | 
 30 | ```
 31 | couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don't understand this
 32 | -\frac{1}{{}2x}
 33 | ~~~~~~~~~~~^
 34 | ```
 35 | 
 36 | ### 如何缓解这个问题？
 37 | 
 38 | 重写 $\LaTeX$ [语法](https://github.com/sympy/sympy/blob/master/sympy/parsing/latex/lark/grammar/latex.lark) 并在代码中添加必须功能；或者往代码里添加人工检查来提高模型得分。
 39 | 在掉入问题陷阱之后，我们决定在代码中添加字符串比较检查，这方法已经足以缓解。
 40 | 
 41 | ![Lm Eval 工具修复](../../../../assets/lm_eval_diff.png)  
 42 | *LM 评估工具修复*
 43 | 
 44 | ### 结果
 45 | 
 46 | 修复前后模型 top25 对比结果表格如下：
 47 | 
 48 | <div id="xdihwljbql" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
 49 | <table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
 50 | <thead>
 51 |   <tr class="gt_heading">
 52 |     <td colspan="5" class="gt_heading gt_title gt_font_normal">解析器修复前后模型在 MATH 基准测试结果对比</td>
 53 |   </tr>
 54 | <tr class="gt_col_headings gt_spanner_row">
 55 |   <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="2" colspan="1" scope="col" id="Model">Model</th>
 56 |   <th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="2" scope="colgroup" id="Score">
 57 |     <span class="gt_column_spanner">Score</span>
 58 |   </th>
 59 |   <th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="2" scope="colgroup" id="Rank">
 60 |     <span class="gt_column_spanner">Rank</span>
 61 |   </th>
 62 | </tr>
 63 | <tr class="gt_col_headings">
 64 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Original">Original</th>
 65 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Fixed parser">Fixed parser</th>
 66 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Original">Original</th>
 67 |   <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Fixed parser">Fixed parser</th>
 68 | </tr>
 69 | </thead>
 70 | <tbody class="gt_table_body">
 71 |   <tr>
 72 |     <td class="gt_row gt_left">rombodawg/Rombos-LLM-V2.5-Qwen-72b</td>
 73 |     <td class="gt_row gt_right">47.58</td>
 74 |     <td class="gt_row gt_right">50.68</td>
 75 |     <td style="color: #FFFFFF; background-color: #000000;" class="gt_row gt_right">1</td>
 76 |     <td style="color: #FFFFFF; background-color: #000000;" class="gt_row gt_right">1</td>
 77 |   </tr>
 78 |   <tr>
 79 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.2-qwen2-72b</td>
 80 |     <td class="gt_row gt_right">41.16</td>
 81 |     <td class="gt_row gt_right">43.43</td>
 82 |     <td style="color: #FFFFFF; background-color: #41181f;" class="gt_row gt_right">2</td>
 83 |     <td style="color: #FFFFFF; background-color: #41181f;" class="gt_row gt_right">2</td>
 84 |   </tr>
 85 |   <tr>
 86 |     <td class="gt_row gt_left">arcee-ai/Arcee-Nova</td>
 87 |     <td class="gt_row gt_right">40.48</td>
 88 |     <td class="gt_row gt_right">42.90</td>
 89 |     <td style="color: #FFFFFF; background-color: #82303e;" class="gt_row gt_right">3</td>
 90 |     <td style="color: #FFFFFF; background-color: #82303e;" class="gt_row gt_right">3</td>
 91 |   </tr>
 92 |   <tr>
 93 |     <td class="gt_row gt_left">fblgit/TheBeagle-v2beta-32B-MGS</td>
 94 |     <td class="gt_row gt_right">39.43</td>
 95 |     <td class="gt_row gt_right">42.52</td>
 96 |     <td style="color: #FFFFFF; background-color: #c3495e;" class="gt_row gt_right">4</td>
 97 |     <td style="color: #FFFFFF; background-color: #c3495e;" class="gt_row gt_right">4</td>
 98 |   </tr>
 99 |   <tr>
100 |     <td class="gt_row gt_left">rombodawg/Rombos-LLM-V2.5-Qwen-32b</td>
101 |     <td class="gt_row gt_right">39.12</td>
102 |     <td class="gt_row gt_right">41.99</td>
103 |     <td style="color: #000000; background-color: #ca6866;" class="gt_row gt_right">5</td>
104 |     <td style="color: #000000; background-color: #ca6866;" class="gt_row gt_right">5</td>
105 |   </tr>
106 |   <tr>
107 |     <td class="gt_row gt_left">dnhkng/RYS-XLarge</td>
108 |     <td class="gt_row gt_right">38.97</td>
109 |     <td class="gt_row gt_right">41.24</td>
110 |     <td style="color: #000000; background-color: #a58c5e;" class="gt_row gt_right">6</td>
111 |     <td style="color: #000000; background-color: #a58c5e;" class="gt_row gt_right">6</td>
112 |   </tr>
113 |   <tr>
114 |     <td class="gt_row gt_left">dfurman/CalmeRys-78B-Orpo-v0.1</td>
115 |     <td class="gt_row gt_right">37.92</td>
116 |     <td class="gt_row gt_right">40.71</td>
117 |     <td style="color: #000000; background-color: #6ec352;" class="gt_row gt_right">8</td>
118 |     <td style="color: #000000; background-color: #80b156;" class="gt_row gt_right">7</td>
119 |   </tr>
120 |   <tr>
121 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.2-rys-78b</td>
122 |     <td class="gt_row gt_right">37.92</td>
123 |     <td class="gt_row gt_right">39.95</td>
124 |     <td style="color: #000000; background-color: #6ec352;" class="gt_row gt_right">8</td>
125 |     <td style="color: #000000; background-color: #4cbd81;" class="gt_row gt_right">9</td>
126 |   </tr>
127 |   <tr>
128 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.4-rys-78b</td>
129 |     <td class="gt_row gt_right">37.69</td>
130 |     <td class="gt_row gt_right">40.41</td>
131 |     <td style="color: #000000; background-color: #4cbd81;" class="gt_row gt_right">9</td>
132 |     <td style="color: #000000; background-color: #5ece55;" class="gt_row gt_right">8</td>
133 |   </tr>
134 |   <tr>
135 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.3-rys-78b</td>
136 |     <td class="gt_row gt_right">36.56</td>
137 |     <td class="gt_row gt_right">38.97</td>
138 |     <td style="color: #000000; background-color: #3aacad;" class="gt_row gt_right">10</td>
139 |     <td style="color: #000000; background-color: #3aacad;" class="gt_row gt_right">10</td>
140 |   </tr>
141 |   <tr>
142 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.1-rys-78b</td>
143 |     <td class="gt_row gt_right">36.40</td>
144 |     <td class="gt_row gt_right">38.90</td>
145 |     <td style="color: #000000; background-color: #279cd9;" class="gt_row gt_right">11</td>
146 |     <td style="color: #000000; background-color: #279cd9;" class="gt_row gt_right">11</td>
147 |   </tr>
148 |   <tr>
149 |     <td class="gt_row gt_left">Qwen/Qwen2.5-72B</td>
150 |     <td class="gt_row gt_right">36.10</td>
151 |     <td class="gt_row gt_right">38.67</td>
152 |     <td style="color: #000000; background-color: #23a7e6;" class="gt_row gt_right">12</td>
153 |     <td style="color: #000000; background-color: #23a7e6;" class="gt_row gt_right">12</td>
154 |   </tr>
155 |   <tr>
156 |     <td class="gt_row gt_left">MaziyarPanahi/calme-2.1-qwen2-72b</td>
157 |     <td class="gt_row gt_right">36.03</td>
158 |     <td class="gt_row gt_right">38.07</td>
159 |     <td style="color: #000000; background-color: #25bce6;" class="gt_row gt_right">13</td>
160 |     <td style="color: #000000; background-color: #36d0e2;" class="gt_row gt_right">15</td>
161 |   </tr>
162 |   <tr>
163 |     <td class="gt_row gt_left">Qwen/Qwen2-Math-72B-Instruct</td>
164 |     <td class="gt_row gt_right">35.95</td>
165 |     <td class="gt_row gt_right">38.14</td>
166 |     <td style="color: #000000; background-color: #27d2e5;" class="gt_row gt_right">14</td>
167 |     <td style="color: #000000; background-color: #27d2e5;" class="gt_row gt_right">14</td>
168 |   </tr>
169 |   <tr>
170 |     <td class="gt_row gt_left">dfurman/Qwen2-72B-Orpo-v0.1</td>
171 |     <td class="gt_row gt_right">35.42</td>
172 |     <td class="gt_row gt_right">38.14</td>
173 |     <td style="color: #000000; background-color: #36d0e2;" class="gt_row gt_right">15</td>
174 |     <td style="color: #000000; background-color: #25bce6;" class="gt_row gt_right">13</td>
175 |   </tr>
176 |   <tr>
177 |     <td class="gt_row gt_left">abacusai/Smaug-Qwen2-72B-Instruct</td>
178 |     <td class="gt_row gt_right">35.35</td>
179 |     <td class="gt_row gt_right">37.46</td>
180 |     <td style="color: #000000; background-color: #6691d6;" class="gt_row gt_right">16</td>
181 |     <td style="color: #000000; background-color: #d73a91;" class="gt_row gt_right">19</td>
182 |   </tr>
183 |   <tr>
184 |     <td class="gt_row gt_left">anthracite-org/magnum-v1-72b</td>
185 |     <td class="gt_row gt_right">35.27</td>
186 |     <td class="gt_row gt_right">37.69</td>
187 |     <td style="color: #FFFFFF; background-color: #ae33c4;" class="gt_row gt_right">18</td>
188 |     <td style="color: #000000; background-color: #7e72d0;" class="gt_row gt_right">16</td>
189 |   </tr>
190 |   <tr>
191 |     <td class="gt_row gt_left">alpindale/magnum-72b-v1</td>
192 |     <td class="gt_row gt_right">35.27</td>
193 |     <td class="gt_row gt_right">37.69</td>
194 |     <td style="color: #FFFFFF; background-color: #ae33c4;" class="gt_row gt_right">18</td>
195 |     <td style="color: #000000; background-color: #7e72d0;" class="gt_row gt_right">16</td>
196 |   </tr>
197 |   <tr>
198 |     <td class="gt_row gt_left">Qwen/Qwen2-72B-Instruct</td>
199 |     <td class="gt_row gt_right">35.12</td>
200 |     <td class="gt_row gt_right">37.69</td>
201 |     <td style="color: #000000; background-color: #d73a91;" class="gt_row gt_right">19</td>
202 |     <td style="color: #FFFFFF; background-color: #c614be;" class="gt_row gt_right">18</td>
203 |   </tr>
204 |   <tr>
205 |     <td class="gt_row gt_left">dnhkng/RYS-XLarge-base</td>
206 |     <td class="gt_row gt_right">34.67</td>
207 |     <td class="gt_row gt_right">37.16</td>
208 |     <td style="color: #000000; background-color: #e3715f;" class="gt_row gt_right">20</td>
209 |     <td style="color: #000000; background-color: #e3715f;" class="gt_row gt_right">20</td>
210 |   </tr>
211 |   <tr>
212 |     <td class="gt_row gt_left">Undi95/MG-FinalMix-72B</td>
213 |     <td class="gt_row gt_right">33.61</td>
214 |     <td class="gt_row gt_right">36.10</td>
215 |     <td style="color: #000000; background-color: #f4c314;" class="gt_row gt_right">22</td>
216 |     <td style="color: #000000; background-color: #eea82d;" class="gt_row gt_right">21</td>
217 |   </tr>
218 |   <tr>
219 |     <td class="gt_row gt_left">abacusai/Dracarys-72B-Instruct</td>
220 |     <td class="gt_row gt_right">33.61</td>
221 |     <td class="gt_row gt_right">35.65</td>
222 |     <td style="color: #000000; background-color: #f4c314;" class="gt_row gt_right">22</td>
223 |     <td style="color: #000000; background-color: #eac222;" class="gt_row gt_right">22</td>
224 |   </tr>
225 |   <tr>
226 |     <td class="gt_row gt_left">Qwen/Qwen2.5-32B</td>
227 |     <td class="gt_row gt_right">32.85</td>
228 |     <td class="gt_row gt_right">35.50</td>
229 |     <td style="color: #000000; background-color: #d1b64b;" class="gt_row gt_right">23</td>
230 |     <td style="color: #000000; background-color: #d1b64b;" class="gt_row gt_right">23</td>
231 |   </tr>
232 |   <tr>
233 |     <td class="gt_row gt_left">anthracite-org/magnum-v2-72b</td>
234 |     <td class="gt_row gt_right">31.65</td>
235 |     <td class="gt_row gt_right">34.06</td>
236 |     <td style="color: #000000; background-color: #b7aa75;" class="gt_row gt_right">24</td>
237 |     <td style="color: #000000; background-color: #b7aa75;" class="gt_row gt_right">24</td>
238 |   </tr>
239 |   <tr>
240 |     <td class="gt_row gt_left">dnhkng/RYS-Huge-bnb-4bit</td>
241 |     <td class="gt_row gt_right">31.57</td>
242 |     <td class="gt_row gt_right">33.84</td>
243 |     <td style="color: #000000; background-color: #9e9e9e;" class="gt_row gt_right">25</td>
244 |     <td style="color: #000000; background-color: #9e9e9e;" class="gt_row gt_right">25</td>
245 |   </tr>
246 | </tbody>
247 | </table>
248 | </div>
249 | 
250 | 


--------------------------------------------------------------------------------
/translations/zh/contents/troubleshooting/troubleshooting-reproducibility.md:
--------------------------------------------------------------------------------
 1 | # Troubleshooting reproducibility
 2 | 
 3 | 假设你读了一篇最近的新模型技术报告，然后心血来潮想要在本机复现他们的结果，却发现根本没法复现，这是为什么？
 4 | 让我们来探讨一下原因。
 5 | 
 6 | ## 代码库不同
 7 | 要想复现论文或报告的评估得分并精确到小数点，首先要确保使用的代码库一致。
 8 | 
 9 | 一般情况下，你可以选择使用作者提供的默认评估代码，或者参考标准代码库实现，如 EleutherAI 的 `lm_eval` 或 HuggingFace 的 `lighteval` 等。但如果作者没有说明评估代码的来源，那很遗憾，基本上不太可能精确复现了。
10 | 
11 | 如果你想知道为什么代码实现不一样会导致结果差异，可以参考这篇我们与 HuggingFace 评估团队共同撰写的 [博客](https://huggingface.co/blog/open-llm-leaderboard-mmlu) (⭐)。博客中介绍了对 3 种常见 MMLU 评估代码 (`lm_eval`、`helm`、以及原作者实现) 的研究测试，重点解释了实现差异以及对模型得分的影响。 
12 | 
13 | *注：正因如此，HuggingFace 团队决定推出 [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)，以便统一规范，使得在排行榜上的模型得分之间的对比更加公正。*
14 | 
15 | ### 导致结果微妙差异的其他因素 
16 | 即便使用的代码库相同，也会因为实现上的小细节不同而导致结果差异，可能因素有：
17 | - **随机种子不同** 
18 | 	- 一般来说，推理阶段受随机种子影响的程度要比训练阶段小得多。不过种子对 CUDA 运算 (可以参考 PyTorch 的 [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html) 页面) 会产生一些影响从而改变预测结果，尤其是基于非贪心算法生成的时候。另外如果使用 few-shot 的方式推理，随机种子还可能影响 prompt、前后处理函数。
19 | 	  -> 有时候微小的差距可能导致评估结果好几分的差异。
20 | - **评估指标不同**. 
21 |   Metrics can be different in practice even if they share the same name. Some examples:
22 | 	- 对于 `精确匹配` 任务，如果原作者使用 *对数似然* (计算不同答案的对数概率) 计算评估得分，而你采用 *生成式* (只比较贪心生成结果与参考答案)，那你们计算出的指标就会不同。
23 | 	- 我们还发现，有一些评估代码库虽然定义为 `精确匹配`，但实际计算的却是 `前缀精确匹配` (只比较生成结果与参考答案的开头部分)，或者 `后缀精确匹配` (与前缀相反)，亦或是 `准精确匹配` (归一化的精确匹配)。
24 | 	 -> 因此，不能仅仅依赖于指标名称来判断实际评估方式，还是需要查看具体代码实现。
25 | - **归一化方式不同**
26 | 	- 还以 `精确匹配` 为例，在 `lm_eval` 库的 v1 版本中，有一些任务就被简单地命名为 *生成式* `精确匹配`，如果不看代码，你可能觉得就是上文提到的那样，直接对比预测结果和参考答案，不过：
27 | 	  如果查看代码会发现，在做对比之前预测结果还多做了一步归一化 (去除标点符号、统一数字格式等)，这明显会改变对比得分。
28 | 	  (现在 `lm_eval` 库的 v2 版本已经在多数指标中添加归一化名称了。)
29 | 	 -> 这是最容易出错的地方，特别是对于那些需要大量归一化或答案后处理的任务，例如数学评估 (需要从生成解释中提取答案)。
30 | 
31 | ## Prompt 不同
32 | Prompt 不同会导致模型输出和评分产生巨大的变化，主要包括 3 个类别：
33 | ### Prompt 自身
34 | Prompt 格式可能会显著影响最终得分。
35 | 
36 | 例如在多选题评估中，呈现选项的 prompt 常见格式有以下几种：
37 | ```
38 | 问题: <问题文本>
39 | 选项:
40 | ```
41 | ```markdown
42 | | A. <Choice A> | (A) <Choice A> | <Choice A> | 
43 | | B. <Choice B> | (B) <Choice B> | <Choice B> | 
44 | | C. <Choice C> | (C) <Choice C> | <Choice C> | 
45 | | D. <Choice D> | (D) <Choice D> | <Choice D> | 
46 | ```
47 | ```
48 | 回答: 
49 | ```
50 | 预测格式：`A`/`B`/`C`/`D` 或 `<Choice A/B/C/D>`.
51 | 
52 | 这些选项是 **语义等价** 的，虽然它们包含的内容完全相同，但仍然会导致评估得分差异。我们有一些 [实验结果](https://x.com/clefourrier/status/1777319187913875893/photo/1) (同一模型评估结果差异高达 7 分) 以及 [论文](https://arxiv.org/abs/2310.11324) 阅读发现，都得出了类似的结论。
53 | 
54 | 一些评估还会加入任务描述 prompt (例如：`以下问题都是关于 <主题> (The following questions are about <topic>)`)。同样地，这些 prompt 的存在与否也会影响评估得分。
55 | 
56 | 这篇 [优秀论文](https://arxiv.org/abs/2407.07890)⭐ 阐述了这一问题：许多模型在评估基准数据集上训练，过拟合了 prompt 和 标准回答的格式，才导致了对其他格式的 prompt 适应能力差。
57 | 
58 | 我们在 Open LLM Leaderboard 2 上的 Llama3.1 系列模型也观察到了这一现象。它们在标准测试集 MATH-Hard 上预测的答案完全正确，但在 few-shot 简单模板测试中评估得分反而较低，很可能就是过拟合了 GSM8K 数据集 (另一个数学测试集) 的 prompt 和答案格式。
59 | ### 系统 prompt 和聊天模板
60 | 聊天模型的训练或微调方式通常是指令或偏好，也就是学习推理时遵循模板的能力。例如多轮对话中，模板一般以通用 prompt (也叫 `system prompt`) 开始，并以特殊 token (一般是 `System: `) 作为前缀。通用 prompt 的作用是为模型提供高层次指令，如某个角色的描述或者通用的指令格式。对话中可能还需要在文本前缀添加关键词，如询问时添加 `User`，回答时添加 `Assistant`。
61 | 
62 | 小样本 (few-shot) 评估时，需要决定这些样本是一次性提供 (在一个 prompt 中)，还是多轮分次提供 (模仿上一段中的 user/assistant 模式)。
63 | 
64 | 如果不遵循模型推理的最佳模板格式，那么输出性能就会下降，因为会迫使输出偏离训练阶段收敛的概率空间。
65 | 
66 | ### 少样本 prompt
67 | 使用少样本评估 (不熟悉少样本可以参考 `General knowledge/Model inference`) 时，需要注意两点：
68 | 
69 | 显然，**few-shot 示例数量应与参考任务相同**。
70 | 
71 | 此外，必须保证评估时输入不同模型的 **少样本示例完全一致** (没错不用惊讶，不一致时也会影响结果差异，因为不同模型对不同样本的表现是不一样的)。甚至你还得保证输入的 **少样本示例顺序完全一致**。我们观察到 MMLU 子测试集上，有些模型因为示例输入顺序的变化导致的评估结果 (点击 [这里](https://huggingface.co/blog/evaluation-structured-outputs) 可以查看一些结果) 差异最高可达 3 分。
72 | 
73 | 这里的随机种子同样重要。
74 | 
75 | ## 生成参数不同
76 | 对于生成式评估的参数，需要注意：
77 | - 确保使用的 **句子终止 token 相同**
78 | - 确保模型评估的 **生成 token 数相同**
79 | - 确保采样的 **种子和温度系数相同**
80 | 
81 | ## 模型加载不同
82 | 我们观察到的差异点有：
83 | - **硬件不同**
84 |   Pytorch 库并不保证在不同硬件上进行非确定性操作时的可重现性
85 | - **代码库不同**
86 |   例如推理后端的库：`transformers` 和 `vllm`，它们对矩阵计算的管理方式就不尽相同。
87 | - **batch size 不同**
88 |   许多评估代码库和模型后端已经记录了这个问题：推理时使用的 batch size 不同会导致结果差异。如果你想要完全复现结果，必须固定 batch size (当然有时候会因为显存问题无法做到)。
89 | - **模型加载精度不同**
90 |   使用低精度加载模型 (实际上加载的是权重的不同版本) 可以减少内存占用和推理成本，但会导致计算结果改变。
91 | 


--------------------------------------------------------------------------------