├── .gitignore
├── LICENSE
├── README.md
├── api_handler.py
├── benchmark_suite.py
├── benchmarks
├── arc_challenge_benchmark.py
├── base_benchmark.py
├── chatbot_arena_benchmark.py
├── gpqa_diamond_benchmark.py
├── hellaswag_benchmark.py
├── livebench_benchmark.py
├── math_hard_benchmark.py
├── mgsm_benchmark.py
├── mmlu_pro_benchmark.py
└── musr_benchmark.py
├── main.py
└── model.py
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/api_handler.cpython-310.pyc
2 | benchmarks/__pycache__/mmlu_pro_benchmark.cpython-310.pyc
3 | __pycache__/benchmark_suite.cpython-310.pyc
4 | benchmarks/__pycache__/base_benchmark.cpython-310.pyc
5 | benchmarks/__pycache__/gsm8k_benchmark.cpython-310.pyc
6 | benchmarks/__pycache__/gpqa_diamond_benchmark.cpython-310.pyc
7 | benchmarks/__pycache__/drop_benchmark.cpython-310.pyc
8 | .vscode/launch.json
9 | data.json
10 | benchmarks/__pycache__/chatbot_arena_benchmark.cpython-310.pyc
11 | __pycache__/model.cpython-310.pyc
12 | benchmarks/__pycache__/math_hard_benchmark.cpython-310.pyc
13 | benchmarks/__pycache__/musr_benchmark.cpython-310.pyc
14 | benchmarks/__pycache__/arc_challenge_benchmark.cpython-310.pyc
15 | benchmarks/__pycache__/hellaswag_benchmark.cpython-310.pyc
16 | benchmarks/__pycache__/livebench_benchmark.cpython-310.pyc
17 | benchmarks/__pycache__/mgsm_benchmark.cpython-310.pyc
18 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # BenchmarkAggregator 🚀
2 |
3 | [](https://choosealicense.com/licenses/mit/)
4 | [](https://github.com/yourusername/benchmarkaggregator/issues)
5 |
6 | Rigorous, unbiased, and scalable LLM evaluations across diverse AI benchmarks, from GPQA Diamond to Chatbot Arena, testing all major models equally.
7 |
8 |
9 |
10 | [View Leaderboard](https://benchmark-aggregator-lvss.vercel.app/) | [Features](#🌟-features) | [Benchmarks](#🏆-current-benchmarks) | [FAQ](#🤔-faq)
11 |
12 | ## 🎯 Introduction
13 |
14 | The BenchmarkAggregator framework serves as a **central** hub, addressing the critical need for consistent model evaluation in the AI community. By providing comprehensive comparisons of Large Language Models (LLMs) across all challenging, well-respected benchmarks in one unified location, it offers a holistic, fair, and scalable view of model performance. Our approach balances depth of evaluation with resource constraints, ensuring fair comparisons while maintaining practicality and accessibility from a single, authoritative source.
15 |
16 | ## 📊 Model Performance Overview
17 |
18 | | Model | Average Score |
19 | |-------|---------------|
20 | | gpt-4o-2024-08-06 | 69.0 |
21 | | claude-3.5-sonnet | 66.2 |
22 | | gpt-4o-mini-2024-07-18 | 62.1 |
23 | | mistral-large | 61.4 |
24 | | llama-3.1-405b-instruct | 59.8 |
25 | | llama-3.1-70b-instruct | 58.4 |
26 | | claude-3-sonnet | 53.2 |
27 | | gpt-3.5-turbo-0125 | 34.8 |
28 |
29 | For detailed scores across all benchmarks, visit our [leaderboard](https://benchmark-aggregator-lvss.vercel.app/).
30 |
31 | ## 🌟 Features
32 |
33 | 1. 🏆 Incorporates top, most respected benchmarks in the AI community
34 | 2. 📊 Balanced evaluation using 100 randomly drawn samples per benchmark (adjustable)
35 | 3. 🔌 Quick and easy integration of new benchmarks and models (uses [OpenRouter](https://openrouter.ai/), making the addition of new models absolutely trivial)
36 | 4. 📈 Holistic performance view through score averaging across diverse tasks
37 | 5. ⚖️ Efficient approach balancing evaluation depth with resource constraints
38 |
39 | ## 🏆 Current Benchmarks
40 | 1. [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
41 | 2. [GPQA-Diamond](https://huggingface.co/datasets/Idavidrein/gpqa)
42 | 3. [ChatbotArena](https://chat.lmsys.org/)
43 | 4. [MATH-Hard](https://huggingface.co/datasets/lighteval/MATH-Hard)
44 | 5. [MuSR](https://huggingface.co/datasets/TAUR-Lab/MuSR)
45 | 6. [ARC-Challenge](https://huggingface.co/datasets/allenai/ai2_arc)
46 | 7. [HellaSwag](https://rowanzellers.com/hellaswag/)
47 | 8. [LiveBench](https://livebench.ai/)
48 | 9. [MGSM](https://huggingface.co/datasets/juletxara/mgsm)
49 |
50 | 📖 [Learn more about each benchmark on our website](https://benchmark-aggregator-lvss.vercel.app/)
51 |
52 | ## 🤔 FAQ
53 |
54 |
Why not run all questions for each benchmark?
56 | Running all questions for each benchmark would be cost-prohibitive. Our approach balances comprehensive evaluation with practical resource constraints.
57 | How are benchmark samples chosen?
61 | The samples are randomly drawn from the larger benchmark dataset. The same sample set is used for each model to ensure consistency and fair comparison across all evaluations.
62 | Why are certain models like Claude 3 Opus and GPT-4 turbo absent?
66 | These models are significantly more expensive to query compared to many others. Their absence is due to cost considerations in running the benchmarks.
67 | How easy is it to add new benchmarks or models?
71 | Adding new benchmarks or models is designed to be quick and efficient. For benchmarks, it can take only a few minutes to integrate an existing one. For models, we use OpenRouter, which covers basically all closed and open-source options. To add a model, simply find its ID on the OpenRouter website and include it in our framework. This makes adding new models absolutely trivial!
72 | How are the scores from Chatbot Arena calculated?
76 | The scores for Chatbot Arena are fetched directly from their website. These scores are then normalized against the values of other models in this benchmark.
77 |
96 | Made with ❤️ by the AI community 97 |
98 | 99 |100 | Website 101 |
102 | -------------------------------------------------------------------------------- /api_handler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import asyncio 3 | from openai import AsyncOpenAI 4 | from aiolimiter import AsyncLimiter 5 | 6 | def get_openrouter_client(): 7 | client = AsyncOpenAI( 8 | base_url="https://openrouter.ai/api/v1", 9 | api_key=os.getenv("OPENROUTER_API_KEY"), 10 | ) 11 | return client 12 | 13 | class RateLimitedClient: 14 | def __init__(self, client, rate_limit): 15 | self.client = client 16 | self.limiter = AsyncLimiter(rate_limit, 1) # rate_limit requests per second 17 | 18 | async def query_model(self, model, prompt, max_retries=15): 19 | for attempt in range(max_retries): 20 | try: 21 | async with self.limiter: 22 | completion = await self.client.chat.completions.create( 23 | model=model, 24 | messages=[ 25 | { 26 | "role": "user", 27 | "content": prompt, 28 | }, 29 | ], 30 | ) 31 | return completion.choices[0].message.content 32 | except Exception as e: 33 | if attempt < max_retries - 1: 34 | wait_time = 2 ** attempt # Exponential backoff 35 | print(f"Attempt {attempt + 1} failed. Retrying in {wait_time} seconds...") 36 | await asyncio.sleep(wait_time) 37 | else: 38 | print(f"All {max_retries} attempts failed. Last error: {str(e)}") 39 | raise -------------------------------------------------------------------------------- /benchmark_suite.py: -------------------------------------------------------------------------------- 1 | import os 2 | import importlib.util 3 | import asyncio 4 | import json 5 | from typing import List, Dict, Any 6 | from benchmarks.base_benchmark import BaseBenchmark 7 | from api_handler import get_openrouter_client, RateLimitedClient 8 | from model import Model 9 | 10 | class BenchmarkSuite: 11 | def __init__(self): 12 | self.all_benchmarks = self._discover_benchmarks() 13 | self.client = None 14 | self.full_benchmark_data = {} 15 | self.sampled_benchmark_data = {} 16 | self.samples_per_benchmark = None 17 | 18 | def _discover_benchmarks(self): 19 | discovered_benchmarks = {} 20 | benchmark_dir = os.path.join(os.path.dirname(__file__), 'benchmarks') 21 | 22 | for filename in os.listdir(benchmark_dir): 23 | if filename.endswith('.py') and filename != 'base_benchmark.py': 24 | module_name = filename[:-3] # Remove .py extension 25 | module_path = os.path.join(benchmark_dir, filename) 26 | 27 | spec = importlib.util.spec_from_file_location(module_name, module_path) 28 | module = importlib.util.module_from_spec(spec) 29 | spec.loader.exec_module(module) 30 | 31 | for item_name in dir(module): 32 | item = getattr(module, item_name) 33 | if isinstance(item, type) and issubclass(item, BaseBenchmark) and item != BaseBenchmark: 34 | discovered_benchmarks[item().id] = item 35 | 36 | return discovered_benchmarks 37 | 38 | async def run(self, models: List[Model], benchmark_ids: List[str] = None, samples_per_benchmark: int = None) -> Dict[str, Dict[str, Any]]: 39 | self.samples_per_benchmark = samples_per_benchmark 40 | benchmarks_to_run = {bid: self.all_benchmarks[bid] for bid in benchmark_ids if bid in self.all_benchmarks} 41 | if len(benchmarks_to_run) != len(benchmark_ids): 42 | missing = set(benchmark_ids) - set(benchmarks_to_run.keys()) 43 | print(f"Warning: The following benchmarks were not found: {missing}") 44 | 45 | openai_client = get_openrouter_client() 46 | self.client = RateLimitedClient(openai_client, rate_limit=5) 47 | 48 | # Load benchmark data once 49 | await self._load_benchmark_data(benchmarks_to_run, samples_per_benchmark) 50 | 51 | results = {model.id: {"releaseDate": model.release_date, "benchmarks": {}} for model in models} 52 | tasks = [] 53 | 54 | for model in models: 55 | for benchmark_id, benchmark_class in benchmarks_to_run.items(): 56 | task = asyncio.create_task(self._run_benchmark(model, benchmark_id, benchmark_class)) 57 | tasks.append(task) 58 | 59 | benchmark_results = await asyncio.gather(*tasks) 60 | 61 | for model, benchmark_id, score in benchmark_results: 62 | results[model.id]["benchmarks"][benchmark_id] = score 63 | 64 | return results 65 | 66 | async def _load_benchmark_data(self, benchmarks_to_run, samples_per_benchmark): 67 | for benchmark_id, benchmark_class in benchmarks_to_run.items(): 68 | if benchmark_id not in self.full_benchmark_data: 69 | benchmark = benchmark_class() 70 | await benchmark.setup() 71 | df = await benchmark.get_dataset() 72 | self.full_benchmark_data[benchmark_id] = df 73 | 74 | if benchmark_id not in ["ChatbotArena", "LiveBench"] and samples_per_benchmark is not None and samples_per_benchmark < len(df): 75 | sampled_df = df.sample(n=samples_per_benchmark, random_state=1337) 76 | else: 77 | sampled_df = df 78 | 79 | self.sampled_benchmark_data[benchmark_id] = sampled_df 80 | await benchmark.cleanup() 81 | 82 | async def _run_benchmark(self, model: Model, benchmark_id: str, benchmark_class): 83 | benchmark = benchmark_class() 84 | benchmark.df = self.sampled_benchmark_data[benchmark_id] # Use sampled data for running 85 | try: 86 | score = await benchmark.run(model.id, self.client, self.sampled_benchmark_data[benchmark_id]) 87 | return model, benchmark_id, score 88 | finally: 89 | pass 90 | 91 | def print_results(self, results: Dict[str, Dict[str, Any]]): 92 | for model_id, model_data in results.items(): 93 | print(f"Results for model: {model_id} (Release Date: {model_data['releaseDate']})") 94 | for benchmark_id, score in model_data['benchmarks'].items(): 95 | print(f" {benchmark_id}: {score:.2%}") 96 | 97 | def save_results_to_json(self, results: Dict[str, Dict[str, Any]], filename='data.json'): 98 | formatted_results = [] 99 | for model_id, model_data in results.items(): 100 | model_result = { 101 | "model": model_id, 102 | "releaseDate": model_data['releaseDate'], 103 | "benchmarks": [] 104 | } 105 | for benchmark_id, score in model_data['benchmarks'].items(): 106 | benchmark_info = { 107 | "name": benchmark_id, 108 | "score": round(score * 100, 2) 109 | } 110 | 111 | if benchmark_id not in ["ChatbotArena", "LiveBench"]: 112 | total_samples = len(self.full_benchmark_data[benchmark_id]) 113 | drawn_samples = len(self.sampled_benchmark_data[benchmark_id]) 114 | benchmark_info["samplesDrawn"] = drawn_samples 115 | benchmark_info["totalSamples"] = total_samples 116 | else: 117 | benchmark_info["samplesDrawn"] ="N/A" 118 | benchmark_info["totalSamples"] ="N/A" 119 | 120 | model_result["benchmarks"].append(benchmark_info) 121 | 122 | formatted_results.append(model_result) 123 | 124 | with open(filename, 'w') as f: 125 | json.dump(formatted_results, f, indent=2) 126 | 127 | print(f"Results saved to {filename}") -------------------------------------------------------------------------------- /benchmarks/arc_challenge_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | 5 | class ARCChallengeBenchmark(BaseBenchmark): 6 | def __init__(self): 7 | super().__init__() 8 | self.id = "ARC-Challenge" 9 | self.data_url = "https://huggingface.co/datasets/allenai/ai2_arc/resolve/main/ARC-Challenge/validation-00000-of-00001.parquet" 10 | 11 | async def setup(self): 12 | await super().setup() 13 | file_path = os.path.join(self.temp_dir, "arc_challenge_validation.parquet") 14 | await self.download_file(self.data_url, file_path) 15 | self.data = pd.read_parquet(file_path) 16 | 17 | async def get_dataset(self) -> pd.DataFrame: 18 | return self.data 19 | 20 | def get_question(self, row: pd.Series) -> str: 21 | question = row['question'] 22 | choices = row['choices'] 23 | 24 | formatted_question = f"{question}\n\nOptions:\n" 25 | for label, text in zip(choices['label'], choices['text']): 26 | formatted_question += f"{label}. {text}\n" 27 | formatted_question += f"\nFinal answer should be the letter of the correct option." 28 | return formatted_question 29 | 30 | def get_correct_answer(self, row: pd.Series) -> str: 31 | return row['answerKey'] 32 | 33 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 34 | return model_answer.strip().upper() == correct_answer.strip().upper() 35 | 36 | def construct_prompt(self, question: str) -> str: 37 | prompt = f"{question}\n\n" 38 | return self.append_answer_instruction(prompt) 39 | 40 | @staticmethod 41 | def append_answer_instruction(prompt: str) -> str: 42 | return prompt + "Please reason through the question and options carefully. After your reasoning, provide your answer as a single letter corresponding to the chosen option. For example: [answer]A[/answer]" -------------------------------------------------------------------------------- /benchmarks/base_benchmark.py: -------------------------------------------------------------------------------- 1 | from abc import ABC, abstractmethod 2 | import os 3 | import tempfile 4 | import shutil 5 | import aiofiles 6 | import aiohttp 7 | import pandas as pd 8 | from tqdm import tqdm 9 | import asyncio 10 | from typing import Any 11 | from urllib.parse import urlparse 12 | 13 | class BaseBenchmark(ABC): 14 | def __init__(self): 15 | self.id = None 16 | self.temp_dir = None 17 | self.data_url = None 18 | self.data_file = None 19 | 20 | async def setup(self): 21 | self.create_temp_dir() 22 | await self.download_data() 23 | 24 | @abstractmethod 25 | async def get_dataset(self) -> pd.DataFrame: 26 | pass 27 | 28 | @abstractmethod 29 | def get_question(self, row: pd.Series) -> str: 30 | pass 31 | 32 | @abstractmethod 33 | def get_correct_answer(self, row: pd.Series) -> Any: 34 | pass 35 | 36 | @abstractmethod 37 | def check_answer(self, model_answer: str, correct_answer: Any) -> bool: 38 | pass 39 | 40 | async def run(self, model: str, client, df: pd.DataFrame) -> float: 41 | total_questions = len(df) 42 | 43 | print(f"Starting {self.id} benchmark for model: {model}") 44 | print(f"Total questions: {total_questions}") 45 | 46 | progress_bar = tqdm(total=total_questions, desc="Progress", unit="question") 47 | 48 | tasks = [] 49 | for _, row in df.iterrows(): 50 | task = self.process_question(model, client, row, progress_bar) 51 | tasks.append(task) 52 | 53 | results = await asyncio.gather(*tasks) 54 | 55 | correct_answers = sum(results) 56 | final_score = correct_answers / total_questions 57 | 58 | progress_bar.close() 59 | print(f"\nFinal Score for {model} on {self.id}: {final_score:.2%}") 60 | 61 | return final_score 62 | 63 | async def process_question(self, model: str, client, row, progress_bar): 64 | question = self.get_question(row) 65 | prompt = self.construct_prompt(question) 66 | 67 | model_response = await client.query_model(model, prompt) 68 | model_answer = self.parse_model_answer(model_response) 69 | 70 | correct_answer = self.get_correct_answer(row) 71 | is_correct = self.check_answer(model_answer, correct_answer) 72 | 73 | progress_bar.update(1) 74 | return int(is_correct) 75 | 76 | def construct_prompt(self, question: str) -> str: 77 | prompt = f"{question}\n\n" 78 | return self.append_answer_instruction(prompt) 79 | 80 | def parse_model_answer(self, response: str) -> str: 81 | start_tag = "[answer]" 82 | end_tag = "[/answer]" 83 | start_index = response.find(start_tag) 84 | end_index = response.find(end_tag) 85 | if start_index != -1 and end_index != -1: 86 | return response[start_index + len(start_tag):end_index].strip() 87 | else: 88 | return response # Return the full response if tags are not found 89 | 90 | @staticmethod 91 | def append_answer_instruction(prompt: str) -> str: 92 | return prompt + "Please reason through the question and options. After your reasoning, provide your answer enclosed in [answer] tags. For example: [answer]Your answer here[/answer]" 93 | 94 | async def cleanup(self): 95 | self.remove_temp_dir() 96 | 97 | def create_temp_dir(self): 98 | self.temp_dir = tempfile.mkdtemp() 99 | 100 | def remove_temp_dir(self): 101 | if self.temp_dir and os.path.exists(self.temp_dir): 102 | shutil.rmtree(self.temp_dir) 103 | 104 | async def download_data(self): 105 | if not self.data_url: 106 | return 107 | parsed_url = urlparse(self.data_url) 108 | self.data_file = os.path.basename(parsed_url.path) 109 | local_path = os.path.join(self.temp_dir, self.data_file) 110 | await self.download_file(self.data_url, local_path) 111 | 112 | @staticmethod 113 | async def download_file(url: str, local_path: str): 114 | try: 115 | async with aiohttp.ClientSession() as session: 116 | async with session.get(url) as response: 117 | response.raise_for_status() 118 | async with aiofiles.open(local_path, 'wb') as file: 119 | while True: 120 | chunk = await response.content.read(8192) 121 | if not chunk: 122 | break 123 | await file.write(chunk) 124 | print(f"File downloaded successfully and saved to {local_path}") 125 | except Exception as e: 126 | print(f"An error occurred while downloading the file: {e}") 127 | raise -------------------------------------------------------------------------------- /benchmarks/chatbot_arena_benchmark.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | import git 4 | import pickle 5 | import tempfile 6 | import shutil 7 | import asyncio 8 | from datetime import datetime 9 | from benchmarks.base_benchmark import BaseBenchmark 10 | 11 | class ChatbotArenaBenchmark(BaseBenchmark): 12 | def __init__(self): 13 | super().__init__() 14 | self.id = "ChatbotArena" 15 | self.repo_url = "https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" 16 | self.temp_dir = tempfile.mkdtemp() 17 | self.repo_path = os.path.join(self.temp_dir, "chatbot-arena-leaderboard") 18 | self.model_mapping = { 19 | "openai/gpt-3.5-turbo-0125": "gpt-3.5-turbo-0125", 20 | "openai/gpt-4o-mini-2024-07-18": "gpt-4o-mini-2024-07-18", 21 | "openai/gpt-4o-2024-08-06": "chatgpt-4o-latest-2024-08-08", 22 | "anthropic/claude-3-sonnet": "claude-3-sonnet-20240229", 23 | "anthropic/claude-3-opus": "claude-3-opus-20240229", 24 | "anthropic/claude-3.5-sonnet": "claude-3-5-sonnet-20240620", 25 | "google/gemini-pro-1.5-exp": "gemini-1.5-pro-exp-0801", 26 | "meta-llama/llama-3.1-70b-instruct": "llama-3.1-70b-instruct", 27 | "meta-llama/llama-3.1-405b-instruct": "llama-3.1-405b-instruct", 28 | "mistralai/mistral-large": "mistral-large-2407" 29 | } 30 | self.repo = None 31 | 32 | async def setup(self): 33 | if not os.path.exists(self.repo_path): 34 | self.repo = git.Repo.clone_from(self.repo_url, self.repo_path) 35 | else: 36 | self.repo = git.Repo(self.repo_path) 37 | origin = self.repo.remotes.origin 38 | origin.pull() 39 | 40 | def get_latest_elo_file(self): 41 | elo_files = [f for f in os.listdir(self.repo_path) if f.startswith("elo_results_") and f.endswith(".pkl")] 42 | return max(elo_files, key=lambda x: datetime.strptime(x.split("_")[2].split(".")[0], "%Y%m%d")) 43 | 44 | async def get_dataset(self): 45 | latest_file = self.get_latest_elo_file() 46 | with open(os.path.join(self.repo_path, latest_file), 'rb') as f: 47 | data = pickle.load(f) 48 | 49 | full_elo_data = data['text']['full']['elo_rating_final'] 50 | df = pd.DataFrame(list(full_elo_data.items()), columns=['Model', 'ELO']) 51 | return df 52 | 53 | def normalize_scores(self, df): 54 | min_score = df['ELO'].min() 55 | max_score = df['ELO'].max() 56 | df['normalized_score'] = (df['ELO'] - min_score) / (max_score - min_score) 57 | return df 58 | 59 | async def run(self, model: str, client, df: pd.DataFrame = None, samples: int = None) -> float: 60 | df = self.normalize_scores(df) 61 | 62 | df_sorted = df.sort_values('ELO', ascending=False) 63 | df_sorted['normalized_score'] = df_sorted['normalized_score'].round(4) 64 | 65 | # For debugging 66 | #print(tabulate(df_sorted, headers='keys', tablefmt='grid')) 67 | 68 | arena_model = self.model_mapping.get(model, model) 69 | if arena_model in df['Model'].values: 70 | normalized_score = df[df['Model'] == arena_model]['normalized_score'].values[0] 71 | rounded_normalized_score = round(normalized_score, 4) 72 | return rounded_normalized_score 73 | else: 74 | print(f"Warning: No data available for model {model}") 75 | return 0.0 76 | 77 | async def cleanup(self): 78 | if self.repo: 79 | self.repo.close() 80 | await asyncio.sleep(1) # Give a moment for any pending operations to complete 81 | if os.path.exists(self.temp_dir): 82 | shutil.rmtree(self.temp_dir, ignore_errors=True) 83 | 84 | # These methods are not used but are required by the BaseBenchmark abstract class 85 | def get_question(self, row: pd.Series) -> str: 86 | pass 87 | 88 | def get_correct_answer(self, row: pd.Series) -> str: 89 | pass 90 | 91 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 92 | pass -------------------------------------------------------------------------------- /benchmarks/gpqa_diamond_benchmark.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import random 3 | from typing import List, Tuple 4 | from huggingface_hub import hf_hub_download 5 | from benchmarks.base_benchmark import BaseBenchmark 6 | import pandas as pd 7 | 8 | class GPQADiamondBenchmark(BaseBenchmark): 9 | def __init__(self): 10 | super().__init__() 11 | self.id = "GPQA-Diamond" 12 | self.repo_id = "Idavidrein/gpqa" 13 | self.filename = "gpqa_diamond.csv" 14 | 15 | async def setup(self): 16 | await super().setup() 17 | self.data_file = hf_hub_download(repo_id=self.repo_id, filename=self.filename, repo_type="dataset") 18 | 19 | async def get_dataset(self) -> pd.DataFrame: 20 | with open(self.data_file, 'r', encoding='utf-8') as f: 21 | reader = csv.DictReader(f) 22 | return pd.DataFrame(list(reader)) 23 | 24 | def get_question(self, row: pd.Series) -> Tuple[str, List[str]]: 25 | question = row['Question'] 26 | correct_answer = row['Correct Answer'] 27 | incorrect_answers = [ 28 | row['Incorrect Answer 1'], 29 | row['Incorrect Answer 2'], 30 | row['Incorrect Answer 3'] 31 | ] 32 | options = [correct_answer] + incorrect_answers 33 | shuffled_options = list(enumerate(options)) 34 | random.shuffle(shuffled_options) 35 | 36 | formatted_question = f"{question}\n\nOptions:\n" 37 | for i, option in shuffled_options: 38 | formatted_question += f"- {option}\n" 39 | 40 | return formatted_question, [option for _, option in shuffled_options] 41 | 42 | def get_correct_answer(self, row: pd.Series) -> str: 43 | return row['Correct Answer'] 44 | 45 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 46 | try: 47 | return model_answer.strip().lower() == correct_answer 48 | except (ValueError, IndexError): 49 | return False 50 | 51 | def process_question(self, model: str, client, row, progress_bar): 52 | question, self.shuffled_options = self.get_question(row) 53 | return super().process_question(model, client, row, progress_bar) -------------------------------------------------------------------------------- /benchmarks/hellaswag_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | import numpy as np 5 | 6 | class HellaSwagBenchmark(BaseBenchmark): 7 | def __init__(self): 8 | super().__init__() 9 | self.id = "HellaSwag" 10 | self.data_url = "https://huggingface.co/api/datasets/Rowan/hellaswag/parquet/default/validation/0.parquet" 11 | 12 | async def setup(self): 13 | await super().setup() 14 | file_path = os.path.join(self.temp_dir, "hellaswag_validation.parquet") 15 | await self.download_file(self.data_url, file_path) 16 | self.data = pd.read_parquet(file_path) 17 | 18 | async def get_dataset(self) -> pd.DataFrame: 19 | return self.data 20 | 21 | def get_question(self, row: pd.Series) -> str: 22 | ctx = row['ctx'] 23 | endings = row['endings'] 24 | 25 | if isinstance(endings, np.ndarray): 26 | endings = endings.tolist() 27 | 28 | formatted_question = f"{ctx}\n\nOptions:\n" 29 | for i, ending in enumerate(endings): 30 | formatted_question += f"{i}. {ending}\n" 31 | formatted_question += f"\nFinal answer should be the number of the correct option." 32 | return formatted_question 33 | 34 | def get_correct_answer(self, row: pd.Series) -> str: 35 | return row['label'] 36 | 37 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 38 | return model_answer.strip() == correct_answer.strip() 39 | 40 | def construct_prompt(self, question: str) -> str: 41 | prompt = f"{question}\n\n" 42 | return self.append_answer_instruction(prompt) 43 | 44 | @staticmethod 45 | def append_answer_instruction(prompt: str) -> str: 46 | return prompt + "Please reason through the context and options carefully. After your reasoning, provide your answer as a single number corresponding to the chosen option. For example: [answer]0[/answer]" -------------------------------------------------------------------------------- /benchmarks/livebench_benchmark.py: -------------------------------------------------------------------------------- 1 | import aiohttp 2 | import pandas as pd 3 | from io import StringIO 4 | from benchmarks.base_benchmark import BaseBenchmark 5 | 6 | class LiveBenchCSVBenchmark(BaseBenchmark): 7 | def __init__(self): 8 | super().__init__() 9 | self.id = "LiveBench" 10 | self.url = "https://livebench.ai/table_2024_07_26.csv" 11 | self.model_mapping = { 12 | "openai/gpt-3.5-turbo-0125": "gpt-3.5-turbo-0125", 13 | "openai/gpt-4o-mini-2024-07-18": "gpt-4o-mini-2024-07-18", 14 | "openai/gpt-4o-2024-08-06": "chatgpt-4o-latest", 15 | "anthropic/claude-3-sonnet": "claude-3-sonnet-20240229", 16 | "anthropic/claude-3-opus": "claude-3-opus-20240229", 17 | "anthropic/claude-3.5-sonnet": "claude-3-5-sonnet-20240620", 18 | "meta-llama/llama-3.1-70b-instruct": "meta-llama-3.1-70b-instruct-turbo", 19 | "meta-llama/llama-3.1-405b-instruct": "meta-llama-3.1-405b-instruct-turbo", 20 | "mistralai/mistral-large": "mistral-large-2407" 21 | } 22 | 23 | async def setup(self): 24 | await super().setup() 25 | async with aiohttp.ClientSession() as session: 26 | async with session.get(self.url) as response: 27 | csv_content = await response.text() 28 | 29 | df = pd.read_csv(StringIO(csv_content)) 30 | df['average_score'] = df.iloc[:, 1:].mean(axis=1) 31 | self.df = df[['model', 'average_score']] 32 | 33 | async def get_dataset(self) -> pd.DataFrame: 34 | return self.df 35 | 36 | def get_question(self, row: pd.Series) -> str: 37 | return f"What is the LiveBench score for {row['model']}?" 38 | 39 | def get_correct_answer(self, row: pd.Series) -> str: 40 | return str(row['average_score']) 41 | 42 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 43 | try: 44 | return abs(float(model_answer) - float(correct_answer)) < 0.01 45 | except ValueError: 46 | return False 47 | 48 | async def run(self, model: str, client, df: pd.DataFrame = None, samples: int = None) -> float: 49 | livebench_model = self.model_mapping.get(model, model) 50 | if livebench_model in df['model'].values: 51 | score = df[df['model'] == livebench_model]['average_score'].values[0] 52 | return score / 100.0 # Convert percentage to decimal 53 | else: 54 | print(f"Warning: No data available for model {model}") 55 | return 0.0 -------------------------------------------------------------------------------- /benchmarks/math_hard_benchmark.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import asyncio 4 | import aiofiles 5 | import pandas as pd 6 | from benchmarks.base_benchmark import BaseBenchmark 7 | 8 | class MathHardBenchmark(BaseBenchmark): 9 | def __init__(self): 10 | super().__init__() 11 | self.id = "MATH-Hard" 12 | self.base_url = "https://huggingface.co/datasets/lighteval/MATH-Hard/resolve/main/test/" 13 | self.subtests = [ 14 | "algebra.jsonl", 15 | "counting_and_probability.jsonl", 16 | "geometry.jsonl", 17 | "intermediate_algebra.jsonl", 18 | "number_theory.jsonl", 19 | "prealgebra.jsonl", 20 | "precalculus.jsonl" 21 | ] 22 | self.BOXED_COMMAND = '\\boxed{' 23 | 24 | async def setup(self): 25 | await super().setup() 26 | await self.download_subtests() 27 | 28 | async def download_subtests(self): 29 | tasks = [self.download_file(self.base_url + subtest, os.path.join(self.temp_dir, subtest)) 30 | for subtest in self.subtests] 31 | await asyncio.gather(*tasks) 32 | 33 | async def get_dataset(self) -> pd.DataFrame: 34 | all_problems = [] 35 | for subtest in self.subtests: 36 | file_path = os.path.join(self.temp_dir, subtest) 37 | async with aiofiles.open(file_path, mode='r') as file: 38 | content = await file.read() 39 | problems = json.loads(content) 40 | all_problems.extend(problems) 41 | return pd.DataFrame(all_problems) 42 | 43 | def get_question(self, row: pd.Series) -> str: 44 | return row['problem'] 45 | 46 | def get_correct_answer(self, row: pd.Series) -> str: 47 | solution = row['solution'] 48 | try: 49 | boxed_content = solution.split("\\boxed{")[1] 50 | if "}$" in boxed_content: 51 | boxed_content = boxed_content.split("}$")[0] 52 | elif "}." in boxed_content: 53 | boxed_content = boxed_content.split("}.")[0] 54 | return boxed_content 55 | except IndexError: 56 | return None 57 | 58 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 59 | return model_answer.strip().lower() == correct_answer.strip().lower() 60 | 61 | def construct_prompt(self, question: str) -> str: 62 | prompt = f"Solve the following mathematics problem:\n\n{question}\n\n" 63 | prompt += "Don't enclose the answer in parentheses and don't add units to the answer.\n" 64 | prompt += "But still ONLY WRITE the answer in LATEX script!\n" 65 | prompt += "Example: [answer]42[/answer]\n" 66 | prompt += "Don't add dollar signs before or after answer!\n" 67 | return self.append_answer_instruction(prompt) -------------------------------------------------------------------------------- /benchmarks/mgsm_benchmark.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import git 4 | import tempfile 5 | import re 6 | from benchmarks.base_benchmark import BaseBenchmark 7 | 8 | class MGSMBenchmark(BaseBenchmark): 9 | def __init__(self): 10 | super().__init__() 11 | self.id = "MGSM" 12 | self.repo_url = "https://huggingface.co/datasets/juletxara/mgsm" 13 | self.temp_dir = tempfile.mkdtemp() 14 | self.repo_path = os.path.join(self.temp_dir, "mgsm") 15 | self.repo = None 16 | 17 | async def setup(self): 18 | await super().setup() 19 | if not os.path.exists(self.repo_path): 20 | self.repo = git.Repo.clone_from(self.repo_url, self.repo_path) 21 | else: 22 | self.repo = git.Repo(self.repo_path) 23 | origin = self.repo.remotes.origin 24 | origin.pull() 25 | 26 | async def get_dataset(self) -> pd.DataFrame: 27 | all_questions = [] 28 | for file in os.listdir(self.repo_path): 29 | if file.endswith('.tsv'): 30 | file_path = os.path.join(self.repo_path, file) 31 | df = pd.read_csv(file_path, sep='\t', names=['question', 'answer_number'], quoting=3) 32 | all_questions.append(df) 33 | 34 | return pd.concat(all_questions, ignore_index=True) 35 | 36 | def get_question(self, row: pd.Series) -> str: 37 | return row['question'] 38 | 39 | def get_correct_answer(self, row: pd.Series) -> str: 40 | return str(row['answer_number']) 41 | 42 | def extract_number(self, text: str) -> float: 43 | # Keep only digits and decimal points 44 | cleaned_text = ''.join(char for char in text if char in '0123456789.') 45 | if cleaned_text: 46 | return float(cleaned_text) 47 | else: 48 | raise ValueError("No numeric value found in the answer") 49 | 50 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 51 | try: 52 | model_value = self.extract_number(model_answer) 53 | correct_value = float(correct_answer) 54 | return abs(model_value - correct_value) < 1e-6 # Allow for small floating-point differences 55 | except ValueError: 56 | return False # If we can't extract or convert to float, consider it incorrect -------------------------------------------------------------------------------- /benchmarks/mmlu_pro_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | 5 | class MMULProBenchmark(BaseBenchmark): 6 | def __init__(self): 7 | super().__init__() 8 | self.id = "MMLU-Pro" 9 | self.data_url = "https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/resolve/main/data/test-00000-of-00001.parquet" 10 | 11 | async def get_dataset(self) -> pd.DataFrame: 12 | return pd.read_parquet(os.path.join(self.temp_dir, self.data_file)) 13 | 14 | def get_question(self, row: pd.Series) -> str: 15 | question = row['question'] 16 | options = row['options'] 17 | formatted_question = f"{question}\n\nOptions:\n" 18 | for i, option in enumerate(options): 19 | formatted_question += f"{chr(65 + i)}. {option}\n" 20 | formatted_question += f"Final answer should be the single letter you choose." 21 | return formatted_question 22 | 23 | def get_correct_answer(self, row: pd.Series) -> str: 24 | return row['answer'] 25 | 26 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 27 | return model_answer.strip().upper() == correct_answer -------------------------------------------------------------------------------- /benchmarks/musr_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | import ast 5 | 6 | class MuSRBenchmark(BaseBenchmark): 7 | def __init__(self): 8 | super().__init__() 9 | self.id = "MuSR" 10 | self.data_url = "https://huggingface.co/datasets/TAUR-Lab/MuSR/resolve/main/all.csv" 11 | 12 | async def get_dataset(self) -> pd.DataFrame: 13 | file_path = os.path.join(self.temp_dir, self.data_file) 14 | df = pd.read_csv(file_path) 15 | df['choices'] = df['choices'].apply(ast.literal_eval) 16 | return df 17 | 18 | def get_question(self, row: pd.Series) -> str: 19 | narrative = row['narrative'] 20 | question = row['question'] 21 | choices = row['choices'] 22 | 23 | formatted_question = f"{narrative}\n\n{question}\n\nOptions:\n" 24 | for i, choice in enumerate(choices): 25 | formatted_question += f"{i+1}. {choice}\n" 26 | formatted_question += f"\nFinal answer should be the number of the correct option." 27 | return formatted_question 28 | 29 | def get_correct_answer(self, row: pd.Series) -> str: 30 | return str(row['answer_index'] + 1) # Adding 1 because answer_index is 0-based 31 | 32 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 33 | return model_answer.strip() == correct_answer 34 | 35 | def construct_prompt(self, question: str) -> str: 36 | prompt = f"{question}\n\n" 37 | return self.append_answer_instruction(prompt) 38 | 39 | @staticmethod 40 | def append_answer_instruction(prompt: str) -> str: 41 | return prompt + "Please reason through the narrative and question carefully. After your reasoning, provide your answer as a single number corresponding to the chosen option. For example: [answer]1[/answer]" -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from benchmark_suite import BenchmarkSuite 2 | from model import Model 3 | import asyncio 4 | 5 | async def main(): 6 | suite = BenchmarkSuite() 7 | 8 | # Create Model instances using OpenRouter model ids and model release dates 9 | models = [ 10 | Model("openai/gpt-3.5-turbo-0125", "2024-01-24"), 11 | Model("openai/gpt-4o-mini-2024-07-18", "2024-07-18"), 12 | Model("openai/gpt-4o-2024-08-06", "2024-08-06"), 13 | Model("anthropic/claude-3-sonnet", "2024-02-29"), 14 | Model("anthropic/claude-3.5-sonnet", "2024-06-20"), 15 | Model("meta-llama/llama-3.1-70b-instruct", "2024-07-23"), 16 | Model("meta-llama/llama-3.1-405b-instruct", "2024-07-23"), 17 | Model("mistralai/mistral-large", "2024-07-24") 18 | ] 19 | 20 | # Specify which benchmarks to run 21 | benchmark_ids = [ 22 | "MMLU-Pro", 23 | "GPQA-Diamond", 24 | "ChatbotArena", 25 | "MATH-Hard", 26 | "MuSR", 27 | "ARC-Challenge", 28 | "HellaSwag", 29 | "LiveBench", 30 | "MGSM" 31 | ] 32 | 33 | # Specify the number of samples to draw from each benchmark 34 | samples_per_benchmark = 100 35 | 36 | # Run the benchmarks 37 | results = await suite.run(models, benchmark_ids, samples_per_benchmark) 38 | 39 | # Print the results 40 | suite.print_results(results) 41 | 42 | # Save results to JSON 43 | suite.save_results_to_json(results) 44 | 45 | if __name__ == "__main__": 46 | asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) 47 | asyncio.run(main()) -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | class Model: 2 | def __init__(self, id: str, release_date: str): 3 | self.id = id 4 | self.release_date = release_date --------------------------------------------------------------------------------