├── .gitignore ├── docs └── source │ └── en │ ├── leaderboards │ ├── extras.md │ ├── finding_page.md │ ├── intro.md │ └── building_page.md │ ├── _toctree.yml │ ├── index.md │ └── open_llm_leaderboard │ ├── emissions.md │ ├── normalization.md │ ├── archive.md │ ├── faq.md │ └── about.md ├── README.md └── .github └── workflows ├── doc-pr-upload.yml ├── doc-pr-build.yml └── doc-build.yml /.gitignore: -------------------------------------------------------------------------------- 1 | # Directory for generated documentation build artifacts 2 | build_dir/ 3 | -------------------------------------------------------------------------------- /docs/source/en/leaderboards/extras.md: -------------------------------------------------------------------------------- 1 | # Building features around your leaderboard 2 | 3 | Several cool tools can be duplicated/extended for your leaderboard: 4 | - If you want your leaderboard to push model results to model cards, you can duplicate this [great space](https://huggingface.co/spaces/Weyaxi/leaderboard-results-to-modelcard) and update it for your own leaderboard. 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Setup 2 | 3 | ```bash 4 | pip install watchdog git+https://github.com/huggingface/doc-builder.git 5 | ``` 6 | 7 | # Build Documentation 8 | 9 | ```bash 10 | doc-builder build leaderboard docs/source/en --build_dir build_dir --not_python_module 11 | ``` 12 | 13 | # Preview Documentation 14 | 15 | ```bash 16 | doc-builder preview leaderboard docs/source/en/ --not_python_module 17 | ``` 18 | -------------------------------------------------------------------------------- /.github/workflows/doc-pr-upload.yml: -------------------------------------------------------------------------------- 1 | name: Upload PR Documentation 2 | 3 | on: 4 | workflow_run: 5 | workflows: ["Build PR Documentation"] 6 | types: 7 | - completed 8 | 9 | jobs: 10 | build: 11 | uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main 12 | with: 13 | package_name: leaderboards 14 | secrets: 15 | hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} 16 | comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }} 17 | -------------------------------------------------------------------------------- /.github/workflows/doc-pr-build.yml: -------------------------------------------------------------------------------- 1 | name: Build PR Documentation 2 | 3 | on: 4 | pull_request: 5 | paths: 6 | - 'docs/source/**' 7 | - 'assets/**' 8 | - '.github/workflows/doc-pr-build.yml' 9 | 10 | concurrency: 11 | group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} 12 | cancel-in-progress: true 13 | 14 | jobs: 15 | build: 16 | uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main 17 | with: 18 | commit_sha: ${{ github.event.pull_request.head.sha }} 19 | pr_number: ${{ github.event.number }} 20 | package: leaderboards 21 | languages: en 22 | additional_args: --not_python_module -------------------------------------------------------------------------------- /.github/workflows/doc-build.yml: -------------------------------------------------------------------------------- 1 | name: Build documentation 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | - doc-builder* 8 | - v*-release 9 | - use_templates 10 | paths: 11 | - 'docs/source/**' 12 | - 'assets/**' 13 | - '.github/workflows/doc-build.yml' 14 | 15 | jobs: 16 | build: 17 | uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main 18 | with: 19 | commit_sha: ${{ github.sha }} 20 | package: leaderboards 21 | languages: en 22 | additional_args: --not_python_module 23 | secrets: 24 | token: ${{ secrets.HUGGINGFACE_PUSH }} 25 | hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} 26 | -------------------------------------------------------------------------------- /docs/source/en/_toctree.yml: -------------------------------------------------------------------------------- 1 | - local: index 2 | title: 🤗 Leaderboards 3 | - title: "Leaderboards on the Hub" 4 | sections: 5 | - local: leaderboards/intro 6 | title: Introduction to leaderboards 7 | - local: leaderboards/finding_page 8 | title: Finding leaderboards 9 | - local: leaderboards/building_page 10 | title: Building your leaderboard 11 | - local: leaderboards/extras 12 | title: Extras 13 | - title: "Open LLM Leaderboard" 14 | sections: 15 | - local: open_llm_leaderboard/about 16 | title: About 17 | - local: open_llm_leaderboard/faq 18 | title: FAQ 19 | - local: open_llm_leaderboard/normalization 20 | title: Scores Normalization 21 | - local: open_llm_leaderboard/emissions 22 | title: CO2 calculation 23 | - local: open_llm_leaderboard/archive 24 | title: Archived versions 25 | -------------------------------------------------------------------------------- /docs/source/en/index.md: -------------------------------------------------------------------------------- 1 | # Leaderboards and Evaluations 2 | 3 | As the number of open and closed source machine learning models explodes, it can be very hard to find the correct model for your project. 4 | This is why we started our evaluations projects: 5 | - the `Open LLM Leaderboard` evaluates and ranks open source LLMs and chatbots, and provides reproducible scores separating marketing fluff from actual progress in the field. 6 | - `Leaderboards on the Hub` aims to gather machine learning leaderboards on the Hugging Face Hub and support evaluation creators. 7 | 8 | Explore machine learning rankings to find the best model for your use case, or build your own leaderboard, to test specific capabilities which interest you and the community! 9 | 10 |
11 |
12 | 13 |
Leaderboards on the Hub
14 |

A small introduction to all things leaderboards on the hub.

15 |
16 | 17 |
Open LLM Leaderboard
18 |

Curious about the Open LLM Leaderboard? Start here!

19 |
20 |
21 |
22 | -------------------------------------------------------------------------------- /docs/source/en/open_llm_leaderboard/emissions.md: -------------------------------------------------------------------------------- 1 | # CO2 calculation 2 | 3 | ## Function for CO2 calculation 4 | 5 | To calculate `CO₂ Emissions for Evaluation (kg)` value, we use the following function. You can try to reproduce it yourself: 6 | 7 | ```python 8 | def calculate_co2_emissions(total_evaluation_time_seconds: float | None) -> float: 9 | if total_evaluation_time_seconds is None or total_evaluation_time_seconds <= 0: 10 | return -1 11 | 12 | # Power consumption for 8 H100 SXM GPUs in kilowatts (kW) 13 | power_consumption_kW = 5.6 14 | 15 | # Carbon intensity in grams CO₂ per kWh in Virginia 16 | carbon_intensity_g_per_kWh = 269.8 17 | 18 | # Convert evaluation time to hours 19 | total_evaluation_time_hours = total_evaluation_time_seconds / 3600 20 | 21 | # Calculate energy consumption in kWh 22 | energy_consumption_kWh = power_consumption_kW * total_evaluation_time_hours 23 | 24 | # Calculate CO₂ emissions in grams 25 | co2_emissions_g = energy_consumption_kWh * carbon_intensity_g_per_kWh 26 | 27 | # Convert grams to kilograms 28 | return co2_emissions_g / 1000 29 | ``` 30 | 31 | ## Explanation 32 | 33 | The `calculate_co2_emissions()` function estimates CO₂ emissions in kilograms for a given evaluation time in seconds, assuming the workload is running on 8 NVIDIA H100 SXM GPUs in Northern Virginia. 34 | 35 | Here’s how it works: 36 | 37 | 1. If `total_evaluation_time_seconds` is `None` or non-positive, the function returns `-1`, indicating invalid input. 38 | > Each result file have a `total_evaluation_time_seconds` field. 39 | 40 | 2. Assumes 8 NVIDIA H100 SXM GPUs with a combined power usage of 5.6 kilowatts (kW), based on each GPU’s maximum 0.7 kW consumption ([source](https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet)). 41 | 42 | 3. Uses an average of 269.8 grams of CO₂ per kilowatt-hour (g CO₂/kWh) for electricity in Virginia, based on U.S. Energy Information Administration data ([source](https://www.eia.gov/electricity/state/virginia/)). 43 | 44 | 4. Converts the evaluation time from seconds to hours, then calculates total energy usage in kWh. 45 | 46 | 5. Calculates emissions in grams by multiplying energy use (kWh) by the carbon intensity. 47 | 48 | 6. Finally, divides the total grams by 1,000 to convert to kilograms. 49 | -------------------------------------------------------------------------------- /docs/source/en/leaderboards/finding_page.md: -------------------------------------------------------------------------------- 1 | # Finding the best leaderboard for your use case 2 | 3 | ## ✨ Featured leaderboards 4 | 5 | Since the end of 2023, we have worked with partners with strong evaluation knowledge, to highlight their work as a blog series, called [`Leaderboards on the Hub`](https://huggingface.co/blog?tag=leaderboard). 6 | 7 | Among these, here is a shortlist on some LLM-specific leaderboards you could take a look at! 8 | - Code evaluation: 9 | - [BigCode's Models Leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard) 10 | - [BigCode's BigCodeBench](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) 11 | - [LiveCodeBench](https://huggingface.co/blog/leaderboard-livecodebench) 12 | - [Meta's CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) 13 | - Mathematics abiliites: 14 | - [NPHardEval](https://huggingface.co/spaces/NPHardEval/NPHardEval-leaderboard) 15 | - Safety: 16 | - [DecodingTrust's Leaderboard](https://huggingface.co/spaces/AI-Secure/llm-trustworthy-leaderboard) 17 | - [HaizeLab's Red Teaming Resistance Benchmark](https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark) 18 | - Performance: 19 | - [Optimum's LLM Performance Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard) 20 | - [Artificial Analysis LLM Performance Leaderboard](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) 21 | 22 | 23 | 24 | 25 | This series is particularly interesting to understand the subtelties of evaluation across different modalities and topics, and we hope it will act as a knowledge base in the future. 26 | 27 | ## 🔍 Explore Spaces by yourself 28 | 29 | On the Hub, `leaderboards` and `arenas` are hosted as Spaces, like machine learning demos. 30 | 31 | You can either look for the keywords `leaderboard` or `arena` in the space title using the search bar [here](https://huggingface.co/spaces) (or [this link](https://huggingface.co/spaces?sort=trending&search=leaderboard)), in the full space using the "Full-text search", or look for spaces with correct metadata by looking for the `leaderboard` tags [here](https://huggingface.co/spaces?filter=leaderboard). 32 | 33 | We also try to maintain an [up-to-date collection](https://huggingface.co/collections/clefourrier/leaderboards-and-benchmarks-64f99d2e11e92ca5568a7cce) of leaderboards. If we missed your space, tag one of the members of the evaluation team in the space discussion! -------------------------------------------------------------------------------- /docs/source/en/leaderboards/intro.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | ## 🏅 What are leaderboards? 4 | 5 | `Leaderboards` are rankings of machine learning artefacts (most frequently generative models, but also embeddings, classifiers, ...) depending on their performance on given tasks across relevant modalities. 6 | 7 | They are commonly used to find the best model for a specific use case. 8 | 9 | For example, for Large Language Models, the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) allows you to find the best base pre-trained models in English, using a range of academic evaluations looking at language understanding, general knowledge, and math, and the [Chatbot Arena Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) provides a ranking of the best chat models in English, thanks to user votes on chat capabilities. 10 | 11 | So far on the Hub, we have leaderboards for text, image, video and audio generations, including specialized leaderboard for at least 10 natural (human) languages, and a number of capabilities such as math or code. We also have leaderboards evaluating more general aspects like energy performance or model safety. 12 | 13 | Some specific leaderboards reflect human performance obtained through a human-based voting system, where people compare models and vote for the better one on a given task. These spaces are called `arenas`. 14 | 15 | ## ⚖️ How to use leaderboards properly 16 | 17 | There are certain things to keep in mind when using a leaderboard. 18 | 19 | ### 1. Comparing apples to apples 20 | 21 | Much like in sports, where we have weight categories to keep rankings fair, when evaluating model artefacts, you want to compare similar items. 22 | 23 | For example, when comparing models, you want them to be 24 | - in the same weight class (number of parameters): bigger models often have better performance than smaller ones, but they usually cost more to run and train (in money, time, and energy) 25 | - at the same mathematical precision: the lower the precision of your model, the smaller and faster, but this can affect performance 26 | - in the same category: pre-trained models are good generalist bases, where fine-tuned models are more specialized and better performing on specific tasks, and merged models tend to have scores higher than their actual performance. 27 | 28 | ### 2. Comparing across a spectrum of tasks 29 | 30 | Though good generalist machine learning models are becoming increasingly common, it's not because an LLM is good at chess that it will output good poetry. If you want to select the correct model for your use case, you need to look at its scores and performance across a range of leaderboards and tasks, before testing it yourself to make sure it fits your needs. 31 | 32 | ### 3. Being careful about evaluation limitations, especially for models 33 | 34 | A number of evaluations are very easy to cheat, accidentally or not: if a model has already seen the data used for testing, its performance will be high "artificially", and reflect memorization rather than any actual capability on the task. This mechanism is called `contamination`. 35 | 36 | Evaluations of closed source models are not always still accurate some time later: as closed source models are behind APIs, it is not possible to know how the model changes and what is added or removed through time (contrary to open source models, where relevant information is available). As such, you should not assume that a static evaluation of a closed source model at time t will still be valid some time later. -------------------------------------------------------------------------------- /docs/source/en/leaderboards/building_page.md: -------------------------------------------------------------------------------- 1 | # Building a leaderboard using a template 2 | 3 | To build a leaderboard, the easiest is to look at our demo templates [here](https://huggingface.co/demo-leaderboard-backend) 4 | 5 | ## 📏 Contents 6 | 7 | Our demo leaderboard template contains 4 sections: two spaces and two datasets. 8 | 9 | - The `frontend space` displays the results to users, contains explanations about evaluations, and optionally can accept model submissions. 10 | - The `requests dataset` stores the submissions of users, and the status of model evaluations. It is updated by the frontend (at submission time) and the backend (at running time). 11 | - The `results dataset` stores the results of the evaluations. It is updated by the backend when evaluations are finished, and pulled by the frontend for display. 12 | - The `backend space` is optional, if you run evaluations manually or on your own cluster. It looks at currently pending submissions, and launches their evaluation using either the Eleuther AI Harness (`lm_eval`) or HuggingFace's `lighteval`, then updates the evaluation status and stores the results. It needs to be edited with your own evaluation suite to fit your own use cases if you use something more specific. 13 | 14 | ## 🪛 Getting started 15 | 16 | You should copy the two spaces and the two datasets to your org to get started with your own leaderboard! 17 | 18 | ### Setting up the frontend 19 | 20 | To get started on your own frontend leaderboard, you will need to edit 2 files: 21 | - src/envs.py to define your own environment variable (like the org name in which this has been copied) 22 | - src/about.py with the tasks and number of few-shots you want for your tasks 23 | 24 | ### Setting up fake results to initialize the leaderboard 25 | 26 | Once this is done, you need to edit the "fake results" file to fit the format of your tasks: in the sub dictionary `results`, replace task_name1 and metric_name by the correct values you defined in tasks above. 27 | ``` 28 | "results": { 29 | "task_name1": { 30 | "metric_name": 0 31 | } 32 | } 33 | ``` 34 | 35 | At this step, you should already have some results displayed in the frontend! 36 | 37 | Any more model you want to add will need to have a file in request and one in result, following the same template as already present files. 38 | 39 | ### Optional: Setting up the backend 40 | 41 | If you plan on running your evaluations on spaces, you then need to edit the backend to run the evaluations most relevant for you in the way you want. 42 | Depending on the suite you want to learn, this is the part which is likely to take the most time. 43 | 44 | However, this is optional if you only want to use the leaderboard to display results, or plan on running evaluations manually/on your own compute source. 45 | 46 | ## 🔧 Tips and tricks 47 | 48 | Leaderboards setup in the above fashion are adjustable, from providing fully automated evaluations (a user submits a model, it is evaluated, etc) to fully manual (every new evaluation is run with human control) to semi-automatic. 49 | 50 | When running the backend in Spaces, you can either : 51 | - upgrade your backend space to the compute power level you require, and run your evaluations locally (using `lm_eval`, `lighteval`, or your own evaluation suite); this is the most general solution across evaluation types, but it will limit you in terms of model size possible, as you might not be able to fit the biggest models in the backend 52 | - use a suite which does model inference using API calls, such as `lighteval` which uses `inference-endpoints` to automatically spin up models from the hub for evaluation, allowing you to scale the size of your compute to the current model. 53 | 54 | If you run evaluations on your own compute source, you can still grab some of the files from the backend to pull and push the `results` and `request` datasets. 55 | 56 | Once your leaderboard is setup, don't forget to set its metadata so it gets indexed by our Leaderboard Finder. You can find popular [Leaderboards on the Hub](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) as well as instructions on how to prepare your leaderboard for submission! 57 | -------------------------------------------------------------------------------- /docs/source/en/open_llm_leaderboard/normalization.md: -------------------------------------------------------------------------------- 1 | # Scores Normalization 2 | 3 | This page explains how scores are normalized on the Open LLM Leaderboard for the six presented benchmarks. We can categorize all tasks into those with subtasks, those without subtasks, and generative evaluation. 4 | 5 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-aPrFJjwdifhVLxzJcsYXeebqNi_5vaw?usp=sharing) 6 | 7 | **Note:** Click the button above to explore the scores normalization process in an interactive notebook (make a copy to edit). 8 | 9 | ## What is Normalization? 10 | Normalization is the process of adjusting values measured on different scales to a common scale, making it possible to compare scores across different tasks. For the Open LLM Leaderboard, we normalize scores to: 11 | 12 | 1. Account for the varying difficulty and random guess baselines of different tasks. 13 | 2. Provide a consistent scale (0-100) for all tasks, enabling fair comparisons. 14 | 3. Ensure that improvements over random guessing are appropriately reflected in the scores. 15 | 16 | 17 | ## General Normalization Process 18 | 19 | The basic normalization process involves two steps: 20 | 1. Subtracting the random baseline score (lower bound). 21 | 2. Scaling the result to a range of 0-100. 22 | 23 | We use the following normalization function: 24 | 25 | ```python 26 | def normalize_within_range(value, lower_bound, higher_bound): 27 | return (value - lower_bound) / (higher_bound - lower_bound) 28 | ``` 29 | 30 | ## Normalizing Tasks without Subtasks 31 | For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward: 32 | - Determine the lower bound (random guess baseline). 33 | - Apply the normalization function. 34 | - Scale to a percentage. 35 | 36 | ### Example: Normalizing GPQA Scores 37 | GPQA has 4 `num_choices`, so the lower bound is 0.25 (1/`num_choices` = 1/4 = 0.25). 38 | 39 | ```python 40 | raw_score = 0.6 # Example raw score 41 | lower_bound = 0.25 42 | higher_bound = 1.0 43 | 44 | if raw_score < lower_bound: 45 | normalized_score = 0 46 | else: 47 | normalized_score = normalize_within_range(raw_score, lower_bound, higher_bound) * 100 48 | 49 | print(f"Normalized GPQA score: {normalized_score:.2f}") 50 | # Output: Normalized GPQA score: 46.67 51 | ``` 52 | 53 | ## Normalizing Tasks with Subtasks 54 | For tasks with subtasks (e.g., MUSR, BBH), we follow these steps: 55 | - Calculate the lower bound for each subtask. 56 | - Normalize each subtask score. 57 | - Average the normalized subtask scores. 58 | 59 | ### Example: Normalizing MUSR Scores 60 | 61 | MUSR has three subtasks with different numbers of choices: 62 | - MUSR murder mysteries, num_choices = 2, lower_bound = 0.5 63 | - MUSR object placement, num_choices = 5, lower_bound = 0.2 64 | - MUSR team allocation, num_choices = 3, lower_bound = 0.33 65 | 66 | ```python 67 | subtasks = [ 68 | {"name": "murder_mysteries", "raw_score": 0.7, "lower_bound": 0.5}, 69 | {"name": "object_placement", "raw_score": 0.4, "lower_bound": 0.2}, 70 | {"name": "team_allocation", "raw_score": 0.6, "lower_bound": 0.333} 71 | ] 72 | 73 | normalized_scores = [] 74 | 75 | for subtask in subtasks: 76 | if subtask["raw_score"] < subtask["lower_bound"]: 77 | normalized_score = 0 78 | else: 79 | normalized_score = normalize_within_range( 80 | subtask["raw_score"], 81 | subtask["lower_bound"], 82 | 1.0 83 | ) * 100 84 | normalized_scores.append(normalized_score) 85 | print(f"{subtask['name']} normalized score: {normalized_score:.2f}") 86 | 87 | overall_normalized_score = sum(normalized_scores) / len(normalized_scores) 88 | print(f"Overall normalized MUSR score: {overall_normalized_score:.2f}") 89 | 90 | # Output: 91 | # murder_mysteries normalized score: 40.00 92 | # object_placement normalized score: 25.00 93 | # team_allocation normalized score: 40.00 94 | # Overall normalized MUSR score: 35.00 95 | ``` 96 | 97 | ## Generative Evaluations 98 | Generative evaluations like MATH and IFEval require a different approach: 99 | 1. **MATH:** Uses exact match accuracy. The lower bound is effectively 0, as random guessing is unlikely to produce a correct answer. 100 | 2. **IFEval:** 101 | - For instance-level evaluation (`ifeval_inst`), we use strict accuracy. 102 | - For prompt-level evaluation (`ifeval_prompt`), we also use strict accuracy. 103 | - The lower bound for both is 0, as random generation is unlikely to produce correct answers. 104 | 105 | This approach ensures that even for generative tasks, we can provide normalized scores that are comparable across different evaluations. 106 | 107 | ## Further Information 108 | For more detailed information and examples, please refer to our [blog post](https://huggingface.co/spaces/open-llm-leaderboard/blog) on scores normalization. 109 | 110 | If you have any questions or need clarification, please start a new discussion on [the Leaderboard page](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions). 111 | -------------------------------------------------------------------------------- /docs/source/en/open_llm_leaderboard/archive.md: -------------------------------------------------------------------------------- 1 | # Open LLM Leaderboard v1 2 | 3 | Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago when they wanted to reproduce and compare results from several published models. It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful, but in most cases, just using optimized prompts or evaluation setup to give the best chances to the models. They therefore decided to create a place where reference models would be evaluated in the exact same setup (same questions, asked in the same order, etc.) to gather completely reproducible and comparable results; and that’s how the Open LLM Leaderboard was born! 4 | 5 | Following a series of highly visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months. 6 | 7 | Around 300,000 community members used and collaborated on it monthly through submissions and discussions, usually to: 8 | 9 | - Find state-of-the-art open-source releases as the leaderboard provides reproducible scores separating marketing fluff from actual progress in the field. 10 | - Evaluate their work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition. 11 | 12 | In June 2024, we archived it, and it was replaced by a newer version, but below, you'll find all relevant information about it! 13 | 14 | ### Tasks 15 | 16 | 📈 We evaluated models on 6 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks. 17 | - AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions. 18 | - HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models. 19 | - MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. 20 | - TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting. 21 | - Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning. 22 | - GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems. 23 | 24 | For all these evaluations, a higher score is a better score. 25 | 26 | We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. 27 | 28 | ### Results 29 | 30 | You can find: 31 | - detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard-old/results 32 | - details on the input/outputs for the models in the `details` of each model, which you can access by clicking the 📄 emoji after the model name 33 | - community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard-old/requests 34 | If a model's name contains "Flagged", this indicates it has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model. 35 | 36 | ## Reproducibility 37 | To reproduce our results, you could run the following command, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness: 38 | 39 | ```bash 40 | python main.py --model=hf-causal-experimental \ 41 | --model_args="pretrained=,use_accelerate=True,revision=" \ 42 | --tasks= \ 43 | --num_fewshot= \ 44 | --batch_size=1 \ 45 | --output_path= 46 | ``` 47 | 48 | **Note:** We evaluated all models on a single node of 8 H100s, so the global batch size was 8 for each evaluation. If you don't use parallelism, adapt your batch size to fit. 49 | *You can expect results to vary slightly for different batch sizes because of padding.* 50 | 51 | The tasks and few shots parameters are: 52 | - ARC: 25-shot, *arc-challenge* (`acc_norm`) 53 | - HellaSwag: 10-shot, *hellaswag* (`acc_norm`) 54 | - TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`) 55 | - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`) 56 | - Winogrande: 5-shot, *winogrande* (`acc`) 57 | - GSM8k: 5-shot, *gsm8k* (`acc`) 58 | Side note on the baseline scores: 59 | - for log-likelihood evaluation, we select the random baseline 60 | - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs 61 | 62 | ## Blogs 63 | 64 | During the life of the leaderboard, we wrote 2 blogs that you can find [here](https://huggingface.co/blog/open-llm-leaderboard-mmlu) and [here](https://huggingface.co/blog/open-llm-leaderboard-drop) 65 | -------------------------------------------------------------------------------- /docs/source/en/open_llm_leaderboard/faq.md: -------------------------------------------------------------------------------- 1 | # FAQ 2 | 3 | ## Submissions 4 | 5 | **Q: Do you keep track of who submits models?** 6 | 7 | A: Yes, we store information about which user submitted each model in the requests files here. This helps us prevent spam and encourages responsible submissions. Users are accountable for their submissions, as the community can identify who submitted each model. 8 | 9 | **Q: Can I submit a model that requires `trust_remote_code=True`?** 10 | 11 | A: We only accept models that have been integrated into a stable version of the `transformers` library to ensure the safety and stability of code executed on our cluster. 12 | 13 | **Q: Are models of type X supported?** 14 | 15 | A: For now, submission is limited to models that are included in a stable version of the transformers library. 16 | 17 | **Q: Can I evaluate my model with a chat template?** 18 | 19 | A: Sure! When submitting a model, you can choose whether to evaluate it using a chat template, which activates automatically for chat models. 20 | 21 | **Q: How can I track the status of my model submission?** 22 | 23 | A: You can monitor your model's status by checking the [Request file here](https://huggingface.co/datasets/open-llm-leaderboard/requests) or viewing the queues above the submit form. 24 | 25 | **Q: What happens if my model disappears from all queues?** 26 | 27 | A: A model’s disappearance typically indicates a failure. You can find your model in [Requests dataset here](https://huggingface.co/datasets/open-llm-leaderboard/requests) and check its status. 28 | 29 | **Q: What causes an evaluation failure?** 30 | 31 | A: Failures often stem from submission issues such as corrupted files or configuration errors. Please review the steps in About tab before submitting. Occasionally, failures are due to hardware or connectivity issues on our end. 32 | 33 | **Q: How do I report an evaluation failure?** 34 | 35 | A: Please create an issue in the [Community section]([https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions), linking your model’s request file for further investigation. If the error is on our side, we will relaunch your model promptly. 36 | 37 | *Do not re-upload your model under a different name as it will not resolve the issue.* 38 | 39 | 40 | ## Results 41 | 42 | **Q: What information is available about my model's evaluation results?** 43 | 44 | A: For each model, you can access: 45 | 46 | - **Request File**: Status of the evaluation. 47 | - **Contents Dataset:** A full dataset that contains information about all evaluated models. It's available [here](https://huggingface.co/datasets/open-llm-leaderboard/contents). 48 | - **Details Dataset**: Comprehensive breakdown of scores and task examples. You can see all the Details datasets [here](https://huggingface.co/open-llm-leaderboard). 49 | 50 | **Q: Why do some models appear multiple times in the leaderboard?** 51 | 52 | A: Models may appear multiple times due to submissions under different commits or precision settings, like `float16` and `4bit`. You can check this by clicking on the `Precision` button under “column visibility” section on the main page. For evaluation, precision helps to assess the impact of quantization. 53 | 54 | *Duplicates with identical precision and commit should be reported.* 55 | 56 | **Q: What is model flagging?** 57 | 58 | A: Flagging helps report models that have unfair performance on the leaderboard. For example, models that were trained on the evaluation data, models that are copies of other models not attributed properly, etc. 59 | 60 | *If your model is flagged incorrectly, you can open a discussion [here](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions) and defend your case.* 61 | 62 | 63 | ## Searching for a model 64 | 65 | **Q: How do I search for models in the leaderboard?** 66 | 67 | A: The search bar provides powerful filtering capabilities with several advanced features: 68 | 69 | **Multiple Term Search** 70 | - Combine Searches: Use semicolons (;) to combine multiple independent search terms. 71 | - Stacked Results: Each term after the semicolon adds results to the previous search, creating a union of results rather than filtering by intersection. 72 | 73 | Example: `llama; 7b` will find models containing "llama" OR models containing "7b." 74 | 75 | **Special Field Search** 76 | 77 | Use the `@` prefix to target specific fields: 78 | - `@architecture:` - Search by model architecture. 79 | - `@license:` - Filter by license type. 80 | - `@precision:` - Filter by model precision. 81 | 82 | Example: `@architecture:llama @license:apache` will find Llama models with an Apache license. 83 | 84 | **Regex Support** 85 | - Advanced Pattern Matching: Supports regular expressions for flexible search criteria. 86 | - Automatic Detection: Regex mode is activated automatically when special regex characters are used. 87 | 88 | Example: `llama-2-(7|13|70)b` matches `llama-2-7b`, `llama-2-13b`, and `llama-2-70b`. 89 | 90 | **Combined Search** 91 | - Combine and stack all features for precise results: 92 | 93 | Example: `meta @architecture:llama; 7b @license:apache` will find: 94 | - Models containing "meta" AND having the Llama architecture, OR 95 | - Models containing "7b" AND having an Apache license. 96 | 97 | **Real-Time Results** 98 | - Dynamic Updates: The search is performed in real-time with debouncing for smooth performance. 99 | - Highlighting: Results are visually emphasized in the table for easy identification. 100 | 101 | ## Editing submissions 102 | 103 | **Q: How can I update or rename my submitted model?** 104 | 105 | A: To update, open an issue with your model's exact name for removal from the leaderboard before resubmitting with the new commit hash. For renaming, check [community resources](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/174) page and use @Weyaxi's tool to request changes, then link the pull request in a discussion for approval. 106 | 107 | ## Additional information 108 | 109 | **Q: What does “Only Official Providers” button do?** 110 | 111 | A: This button filters and displays models from a curated list of trusted and high-quality model providers. We have introduced it to help users easily identify and choose top-tier models. The current set of trusted authors includes well-known names such as EleutherAI, CohereForAI, MistralAI and many others. 112 | The dataset is available [here](https://huggingface.co/datasets/open-llm-leaderboard/official-providers). 113 | 114 | **Q: How can I view raw scores for each evaluation?** 115 | 116 | A: The Leaderboard displays normalized scores by default to provide a fair comparison. Normalization adjusts scores so that the lower bound corresponds to the score of a random baseline, ensuring a fairer average. To view the non-normalized values, go to "table options", "Score Display", and click "Raw". 117 | 118 | **Q: How are model categories differentiated?** 119 | 120 | A: Categories are defined to reflect the specific training stages and methodologies applied to each model, ensuring comparisons are both fair and meaningful. Here's a breakdown of each category: 121 | 122 | - **Pretrained Models:** These foundational models are initially trained on large datasets without task-specific tuning, serving as a versatile base for further development. 123 | - **Continuously Pretrained Models:** These undergo additional training beyond initial pretraining to enhance their capabilities, often using more specialized data. 124 | - **Fine-Tuned Models:** Specifically adjusted on targeted datasets, these models are optimized for particular tasks, improving performance in those areas. 125 | - **Chat Models:** Tailored for interactive applications like chatbots, these models are trained to handle conversational contexts effectively. 126 | - **Merge Models:** Combining multiple models or methods, these can show superior test results but do not always apply for real-world situations. 127 | 128 | **Q: What are the leaderboard's intended uses?** 129 | 130 | A: The leaderboard is ideal for: 131 | 132 | 1. Viewing rankings and scores of open pretrained models. 133 | 2. Experimenting with various fine-tuning and quantization techniques. 134 | 3. Comparing the performance of specific models within their categories. 135 | 136 | **Q: Why don't you have closed-source models?** 137 | 138 | A: The leaderboard focuses on open-source models to ensure transparency, reproducibility, and fairness. Closed-source models can change their APIs unpredictably, making it difficult to guarantee consistent and accurate scoring. Additionally, we rerun all evaluations on our cluster to maintain a uniform testing environment, which isn't possible with closed-source models. 139 | 140 | **Q: I have another problem, help!** 141 | 142 | A: Please, open an issue in the discussion tab, and we'll do our best to help you in a timely manner : -------------------------------------------------------------------------------- /docs/source/en/open_llm_leaderboard/about.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art. 4 | 5 | We wrote a release blog [here](https://huggingface.co/spaces/open-llm-leaderboard/blog) to explain why we introduced this leaderboard! 6 | 7 | ## Tasks 8 | 9 | 📈 We evaluate models on 6 key benchmarks using the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) , a unified framework to test generative language models on a large number of different evaluation tasks. 10 | 11 | - **IFEval** ([https://arxiv.org/abs/2311.07911](https://arxiv.org/abs/2311.07911)) – IFEval is a dataset designed to test a model's ability to follow explicit instructions, such as "include keyword x" or "use format y." The focus is on the model's adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics. 12 | - **BBH (Big Bench Hard)** ([https://arxiv.org/abs/2210.09261](https://arxiv.org/abs/2210.09261)) – A subset of 23 challenging tasks from the BigBench dataset to evaluate language models. The tasks use objective metrics, are highly difficult, and have sufficient sample sizes for statistical significance. They include multistep arithmetic, algorithmic reasoning (e.g., boolean expressions, SVG shapes), language understanding (e.g., sarcasm detection, name disambiguation), and world knowledge. BBH performance correlates well with human preferences, providing valuable insights into model capabilities. 13 | - **MATH** ([https://arxiv.org/abs/2103.03874](https://arxiv.org/abs/2103.03874)) – MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only level 5 MATH questions and call it MATH Lvl 5. 14 | - **GPQA (Graduate-Level Google-Proof Q&A Benchmark)** ([https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)) – GPQA is a highly challenging knowledge dataset with questions crafted by PhD-level domain experts in fields like biology, physics, and chemistry. These questions are designed to be difficult for laypersons but relatively easy for experts. The dataset has undergone multiple rounds of validation to ensure both difficulty and factual accuracy. Access to GPQA is restricted through gating mechanisms to minimize the risk of data contamination. Consequently, we do not provide plain text examples from this dataset, as requested by the authors. 15 | - **MuSR (Multistep Soft Reasoning)** ([https://arxiv.org/abs/2310.16049](https://arxiv.org/abs/2310.16049)) – MuSR is a new dataset consisting of algorithmically generated complex problems, each around 1,000 words in length. The problems include murder mysteries, object placement questions, and team allocation optimizations. Solving these problems requires models to integrate reasoning with long-range context parsing. Few models achieve better than random performance on this dataset. 16 | - **MMLU-PRO (Massive Multitask Language Understanding - Professional)** ([https://arxiv.org/abs/2406.01574](https://arxiv.org/abs/2406.01574)) – MMLU-Pro is a refined version of the MMLU dataset, which has been a standard for multiple-choice knowledge assessment. Recent research identified issues with the original MMLU, such as noisy data (some unanswerable questions) and decreasing difficulty due to advances in model capabilities and increased data contamination. MMLU-Pro addresses these issues by presenting models with 10 choices instead of 4, requiring reasoning on more questions, and undergoing expert review to reduce noise. As a result, MMLU-Pro is of higher quality and currently more challenging than the original. 17 | 18 | For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. 19 | 20 | ## Model Types 21 | 22 | - 🟢 **Pretrained Model:** New, base models trained on a given text corpora using masked modeling. 23 | - 🟩 **Continuously Pretrained Model:** New, base models continuously trained on further corpora (which may include IFT/chat data) using masked modeling. 24 | - 🔶 **Fine-Tuned on Domain-Specific Datasets Model:** Pretrained models fine-tuned on more data. 25 | - 💬 **Chat Models (RLHF, DPO, IFT, ...):** Chat-like fine-tunes using IFT (datasets of task instruction), RLHF, DPO (changing the model loss with an added policy), etc. 26 | - 🤝 **Base Merges and Moerges Model:** Merges or MoErges, models which have been merged or fused without additional fine-tuning. 27 | 28 | 29 | ## Results 30 | 31 | You can find: 32 | - Detailed numerical results in the [`results` Hugging Face dataset](https://huggingface.co/datasets/open-llm-leaderboard/results/). 33 | - Details on the input/outputs for the models in the `details` of each model, which you can access by clicking the 📄 emoji after the model name. 34 | - Community queries and running status in the [`requests` Hugging Face dataset](https://huggingface.co/datasets/open-llm-leaderboard/requests). 35 | 36 | If a model's name contains "Flagged", this indicates it has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model. 37 | 38 | ## Reproducibility 39 | 40 | To reproduce our results, you can use our fork of [lm_eval](https://github.com/huggingface/lm-evaluation-harness/tree/main), as our PRs are not all merged in it at the moment. 41 | ``` 42 | git clone git@github.com:huggingface/lm-evaluation-harness.git 43 | cd lm-evaluation-harness 44 | git checkout main 45 | pip install -e . 46 | lm-eval --model_args="pretrained=,revision=,dtype=" --tasks=leaderboard --batch_size=auto --output_path= 47 | ``` 48 | **Attention:** For instruction models add the `--apply_chat_template` and `fewshot_as_multiturn` option. 49 | 50 | **Note:** You can expect results to vary slightly for different batch sizes because of padding. 51 | 52 | ### **Task Evaluations and Parameters** 53 | 54 | **IFEval**: 55 | 56 | - Task: "IFEval" 57 | - Measure: Strict Accuracy at Instance and Prompt Levels (`inst_level_strict_acc,none` and `prompt_level_strict_acc,none`) 58 | - Shots: 0-shot for both Instance-Level Strict Accuracy and Prompt-Level Strict Accuracy 59 | - num_choices: 0 for both Strict Accuracy at Instance and Prompt Levels. 60 | 61 | **Big Bench Hard (BBH)**: 62 | 63 | - Overview Task: "BBH" 64 | - Shots: 3-shot for each subtask 65 | - Measure: Normalized Accuracy across all subtasks (`acc_norm,none`) 66 | - List of subtasks with `num_choices`: 67 | - BBH Sports Understanding, num_choices=2 68 | - BBH Tracking Shuffled Objects (Three Objects), num_choices=3 69 | - BBH Navigate, num_choices=2 70 | - BBH Snarks, num_choices=2 71 | - BBH Date Understanding, num_choices=6 72 | - BBH Reasoning about Colored Objects, num_choices=18 73 | - BBH Object Counting, num_choices=19 (should be 18 but we added a “0” choice) 74 | - BBH Logical Deduction (Seven Objects), num_choices=7 75 | - BBH Geometric Shapes, num_choices=11 76 | - BBH Web of Lies, num_choices=2 77 | - BBH Movie Recommendation, num_choices=6 78 | - BBH Logical Deduction (Five Objects), num_choices=5 79 | - BBH Salient Translation Error Detection, num_choices=6 80 | - BBH Disambiguation QA, num_choices=3 81 | - BBH Temporal Sequences, num_choices=4 82 | - BBH Hyperbaton, num_choices=2 83 | - BBH Logical Deduction (Three Objects), num_choices=3 84 | - BBH Causal Judgement, num_choices=2 85 | - BBH Formal Fallacies, num_choices=2 86 | - BBH Tracking Shuffled Objects (Seven Objects), num_choices=7 87 | - BBH Ruin Names, num_choices=6 88 | - BBH Penguins in a Table, num_choices=5 89 | - BBH Boolean Expressions, num_choices=2 90 | - BBH Tracking Shuffled Objects (Five Objects), num_choices=5 91 | 92 | **Math Challenges**: 93 | 94 | - Task: "Math Level 5" 95 | - Measure: Exact Match (`exact_match,none`) 96 | - Shots: 4-shot 97 | - num_choices: 0 98 | 99 | **Generalized Purpose Question Answering (GPQA)**: 100 | 101 | - Task: "GPQA" 102 | - Measure: Normalized Accuracy (`acc_norm,none`) 103 | - Shots: 0-shot 104 | - num_choices: 4 105 | 106 | **MuSR**: 107 | 108 | - Overview Task: "MuSR" 109 | - Measure: Normalized Accuracy across all subtasks (`acc_norm,none`) 110 | - MuSR Murder Mysteries: 0-shot, num_choices: 2 111 | - MuSR Object Placement: 0-shot, num_choices: 5 112 | - MuSR Team Allocation: 0-shot, num_choices: 3 113 | 114 | **MMLU-PRO**: 115 | 116 | - Task: "MMLU-PRO" 117 | - Measure: Accuracy (`acc,none`) 118 | - Shots: 5-shot 119 | - num_choices: 10 120 | 121 | --------------------------------------------------------------------------------