├── .gitignore ├── LICENSE ├── README.md ├── api_handler.py ├── benchmark_suite.py ├── benchmarks ├── arc_challenge_benchmark.py ├── base_benchmark.py ├── chatbot_arena_benchmark.py ├── gpqa_diamond_benchmark.py ├── hellaswag_benchmark.py ├── livebench_benchmark.py ├── math_hard_benchmark.py ├── mgsm_benchmark.py ├── mmlu_pro_benchmark.py └── musr_benchmark.py ├── main.py └── model.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/api_handler.cpython-310.pyc 2 | benchmarks/__pycache__/mmlu_pro_benchmark.cpython-310.pyc 3 | __pycache__/benchmark_suite.cpython-310.pyc 4 | benchmarks/__pycache__/base_benchmark.cpython-310.pyc 5 | benchmarks/__pycache__/gsm8k_benchmark.cpython-310.pyc 6 | benchmarks/__pycache__/gpqa_diamond_benchmark.cpython-310.pyc 7 | benchmarks/__pycache__/drop_benchmark.cpython-310.pyc 8 | .vscode/launch.json 9 | data.json 10 | benchmarks/__pycache__/chatbot_arena_benchmark.cpython-310.pyc 11 | __pycache__/model.cpython-310.pyc 12 | benchmarks/__pycache__/math_hard_benchmark.cpython-310.pyc 13 | benchmarks/__pycache__/musr_benchmark.cpython-310.pyc 14 | benchmarks/__pycache__/arc_challenge_benchmark.cpython-310.pyc 15 | benchmarks/__pycache__/hellaswag_benchmark.cpython-310.pyc 16 | benchmarks/__pycache__/livebench_benchmark.cpython-310.pyc 17 | benchmarks/__pycache__/mgsm_benchmark.cpython-310.pyc 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BenchmarkAggregator 🚀 2 | 3 | [![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://choosealicense.com/licenses/mit/) 4 | [![Contributions Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/yourusername/benchmarkaggregator/issues) 5 | 6 | Rigorous, unbiased, and scalable LLM evaluations across diverse AI benchmarks, from GPQA Diamond to Chatbot Arena, testing all major models equally. 7 | 8 | BenchmarkAggregator Dashboard 9 | 10 | [View Leaderboard](https://benchmark-aggregator-lvss.vercel.app/) | [Features](#🌟-features) | [Benchmarks](#🏆-current-benchmarks) | [FAQ](#🤔-faq) 11 | 12 | ## 🎯 Introduction 13 | 14 | The BenchmarkAggregator framework serves as a **central** hub, addressing the critical need for consistent model evaluation in the AI community. By providing comprehensive comparisons of Large Language Models (LLMs) across all challenging, well-respected benchmarks in one unified location, it offers a holistic, fair, and scalable view of model performance. Our approach balances depth of evaluation with resource constraints, ensuring fair comparisons while maintaining practicality and accessibility from a single, authoritative source. 15 | 16 | ## 📊 Model Performance Overview 17 | 18 | | Model | Average Score | 19 | |-------|---------------| 20 | | gpt-4o-2024-08-06 | 69.0 | 21 | | claude-3.5-sonnet | 66.2 | 22 | | gpt-4o-mini-2024-07-18 | 62.1 | 23 | | mistral-large | 61.4 | 24 | | llama-3.1-405b-instruct | 59.8 | 25 | | llama-3.1-70b-instruct | 58.4 | 26 | | claude-3-sonnet | 53.2 | 27 | | gpt-3.5-turbo-0125 | 34.8 | 28 | 29 | For detailed scores across all benchmarks, visit our [leaderboard](https://benchmark-aggregator-lvss.vercel.app/). 30 | 31 | ## 🌟 Features 32 | 33 | 1. 🏆 Incorporates top, most respected benchmarks in the AI community 34 | 2. 📊 Balanced evaluation using 100 randomly drawn samples per benchmark (adjustable) 35 | 3. 🔌 Quick and easy integration of new benchmarks and models (uses [OpenRouter](https://openrouter.ai/), making the addition of new models absolutely trivial) 36 | 4. 📈 Holistic performance view through score averaging across diverse tasks 37 | 5. ⚖️ Efficient approach balancing evaluation depth with resource constraints 38 | 39 | ## 🏆 Current Benchmarks 40 | 1. [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) 41 | 2. [GPQA-Diamond](https://huggingface.co/datasets/Idavidrein/gpqa) 42 | 3. [ChatbotArena](https://chat.lmsys.org/) 43 | 4. [MATH-Hard](https://huggingface.co/datasets/lighteval/MATH-Hard) 44 | 5. [MuSR](https://huggingface.co/datasets/TAUR-Lab/MuSR) 45 | 6. [ARC-Challenge](https://huggingface.co/datasets/allenai/ai2_arc) 46 | 7. [HellaSwag](https://rowanzellers.com/hellaswag/) 47 | 8. [LiveBench](https://livebench.ai/) 48 | 9. [MGSM](https://huggingface.co/datasets/juletxara/mgsm) 49 | 50 | 📖 [Learn more about each benchmark on our website](https://benchmark-aggregator-lvss.vercel.app/) 51 | 52 | ## 🤔 FAQ 53 | 54 |
55 | Why not run all questions for each benchmark? 56 | Running all questions for each benchmark would be cost-prohibitive. Our approach balances comprehensive evaluation with practical resource constraints. 57 |
58 | 59 |
60 | How are benchmark samples chosen? 61 | The samples are randomly drawn from the larger benchmark dataset. The same sample set is used for each model to ensure consistency and fair comparison across all evaluations. 62 |
63 | 64 |
65 | Why are certain models like Claude 3 Opus and GPT-4 turbo absent? 66 | These models are significantly more expensive to query compared to many others. Their absence is due to cost considerations in running the benchmarks. 67 |
68 | 69 |
70 | How easy is it to add new benchmarks or models? 71 | Adding new benchmarks or models is designed to be quick and efficient. For benchmarks, it can take only a few minutes to integrate an existing one. For models, we use OpenRouter, which covers basically all closed and open-source options. To add a model, simply find its ID on the OpenRouter website and include it in our framework. This makes adding new models absolutely trivial! 72 |
73 | 74 |
75 | How are the scores from Chatbot Arena calculated? 76 | The scores for Chatbot Arena are fetched directly from their website. These scores are then normalized against the values of other models in this benchmark. 77 |
78 | 79 | 👉 [View more FAQs on our website](https://benchmark-aggregator-lvss.vercel.app/) 80 | 81 | ## 🤝 Contributing 82 | 83 | We welcome contributions from the community! If you have any questions, suggestions, or requests, please don't hesitate to create an issue. Your input is valuable in helping us improve and expand the BenchmarkAggregator. 84 | 85 | ## 📄 License 86 | 87 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. 88 | 89 | ## 🙏 Acknowledgments 90 | 91 | We're grateful to the creators and maintainers of the benchmark datasets used in this project, as well as to [OpenRouter](https://openrouter.ai/) for making model integration seamless. 92 | 93 | --- 94 | 95 |

96 | Made with ❤️ by the AI community 97 |

98 | 99 |

100 | Website 101 |

102 | -------------------------------------------------------------------------------- /api_handler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import asyncio 3 | from openai import AsyncOpenAI 4 | from aiolimiter import AsyncLimiter 5 | 6 | def get_openrouter_client(): 7 | client = AsyncOpenAI( 8 | base_url="https://openrouter.ai/api/v1", 9 | api_key=os.getenv("OPENROUTER_API_KEY"), 10 | ) 11 | return client 12 | 13 | class RateLimitedClient: 14 | def __init__(self, client, rate_limit): 15 | self.client = client 16 | self.limiter = AsyncLimiter(rate_limit, 1) # rate_limit requests per second 17 | 18 | async def query_model(self, model, prompt, max_retries=15): 19 | for attempt in range(max_retries): 20 | try: 21 | async with self.limiter: 22 | completion = await self.client.chat.completions.create( 23 | model=model, 24 | messages=[ 25 | { 26 | "role": "user", 27 | "content": prompt, 28 | }, 29 | ], 30 | ) 31 | return completion.choices[0].message.content 32 | except Exception as e: 33 | if attempt < max_retries - 1: 34 | wait_time = 2 ** attempt # Exponential backoff 35 | print(f"Attempt {attempt + 1} failed. Retrying in {wait_time} seconds...") 36 | await asyncio.sleep(wait_time) 37 | else: 38 | print(f"All {max_retries} attempts failed. Last error: {str(e)}") 39 | raise -------------------------------------------------------------------------------- /benchmark_suite.py: -------------------------------------------------------------------------------- 1 | import os 2 | import importlib.util 3 | import asyncio 4 | import json 5 | from typing import List, Dict, Any 6 | from benchmarks.base_benchmark import BaseBenchmark 7 | from api_handler import get_openrouter_client, RateLimitedClient 8 | from model import Model 9 | 10 | class BenchmarkSuite: 11 | def __init__(self): 12 | self.all_benchmarks = self._discover_benchmarks() 13 | self.client = None 14 | self.full_benchmark_data = {} 15 | self.sampled_benchmark_data = {} 16 | self.samples_per_benchmark = None 17 | 18 | def _discover_benchmarks(self): 19 | discovered_benchmarks = {} 20 | benchmark_dir = os.path.join(os.path.dirname(__file__), 'benchmarks') 21 | 22 | for filename in os.listdir(benchmark_dir): 23 | if filename.endswith('.py') and filename != 'base_benchmark.py': 24 | module_name = filename[:-3] # Remove .py extension 25 | module_path = os.path.join(benchmark_dir, filename) 26 | 27 | spec = importlib.util.spec_from_file_location(module_name, module_path) 28 | module = importlib.util.module_from_spec(spec) 29 | spec.loader.exec_module(module) 30 | 31 | for item_name in dir(module): 32 | item = getattr(module, item_name) 33 | if isinstance(item, type) and issubclass(item, BaseBenchmark) and item != BaseBenchmark: 34 | discovered_benchmarks[item().id] = item 35 | 36 | return discovered_benchmarks 37 | 38 | async def run(self, models: List[Model], benchmark_ids: List[str] = None, samples_per_benchmark: int = None) -> Dict[str, Dict[str, Any]]: 39 | self.samples_per_benchmark = samples_per_benchmark 40 | benchmarks_to_run = {bid: self.all_benchmarks[bid] for bid in benchmark_ids if bid in self.all_benchmarks} 41 | if len(benchmarks_to_run) != len(benchmark_ids): 42 | missing = set(benchmark_ids) - set(benchmarks_to_run.keys()) 43 | print(f"Warning: The following benchmarks were not found: {missing}") 44 | 45 | openai_client = get_openrouter_client() 46 | self.client = RateLimitedClient(openai_client, rate_limit=5) 47 | 48 | # Load benchmark data once 49 | await self._load_benchmark_data(benchmarks_to_run, samples_per_benchmark) 50 | 51 | results = {model.id: {"releaseDate": model.release_date, "benchmarks": {}} for model in models} 52 | tasks = [] 53 | 54 | for model in models: 55 | for benchmark_id, benchmark_class in benchmarks_to_run.items(): 56 | task = asyncio.create_task(self._run_benchmark(model, benchmark_id, benchmark_class)) 57 | tasks.append(task) 58 | 59 | benchmark_results = await asyncio.gather(*tasks) 60 | 61 | for model, benchmark_id, score in benchmark_results: 62 | results[model.id]["benchmarks"][benchmark_id] = score 63 | 64 | return results 65 | 66 | async def _load_benchmark_data(self, benchmarks_to_run, samples_per_benchmark): 67 | for benchmark_id, benchmark_class in benchmarks_to_run.items(): 68 | if benchmark_id not in self.full_benchmark_data: 69 | benchmark = benchmark_class() 70 | await benchmark.setup() 71 | df = await benchmark.get_dataset() 72 | self.full_benchmark_data[benchmark_id] = df 73 | 74 | if benchmark_id not in ["ChatbotArena", "LiveBench"] and samples_per_benchmark is not None and samples_per_benchmark < len(df): 75 | sampled_df = df.sample(n=samples_per_benchmark, random_state=1337) 76 | else: 77 | sampled_df = df 78 | 79 | self.sampled_benchmark_data[benchmark_id] = sampled_df 80 | await benchmark.cleanup() 81 | 82 | async def _run_benchmark(self, model: Model, benchmark_id: str, benchmark_class): 83 | benchmark = benchmark_class() 84 | benchmark.df = self.sampled_benchmark_data[benchmark_id] # Use sampled data for running 85 | try: 86 | score = await benchmark.run(model.id, self.client, self.sampled_benchmark_data[benchmark_id]) 87 | return model, benchmark_id, score 88 | finally: 89 | pass 90 | 91 | def print_results(self, results: Dict[str, Dict[str, Any]]): 92 | for model_id, model_data in results.items(): 93 | print(f"Results for model: {model_id} (Release Date: {model_data['releaseDate']})") 94 | for benchmark_id, score in model_data['benchmarks'].items(): 95 | print(f" {benchmark_id}: {score:.2%}") 96 | 97 | def save_results_to_json(self, results: Dict[str, Dict[str, Any]], filename='data.json'): 98 | formatted_results = [] 99 | for model_id, model_data in results.items(): 100 | model_result = { 101 | "model": model_id, 102 | "releaseDate": model_data['releaseDate'], 103 | "benchmarks": [] 104 | } 105 | for benchmark_id, score in model_data['benchmarks'].items(): 106 | benchmark_info = { 107 | "name": benchmark_id, 108 | "score": round(score * 100, 2) 109 | } 110 | 111 | if benchmark_id not in ["ChatbotArena", "LiveBench"]: 112 | total_samples = len(self.full_benchmark_data[benchmark_id]) 113 | drawn_samples = len(self.sampled_benchmark_data[benchmark_id]) 114 | benchmark_info["samplesDrawn"] = drawn_samples 115 | benchmark_info["totalSamples"] = total_samples 116 | else: 117 | benchmark_info["samplesDrawn"] ="N/A" 118 | benchmark_info["totalSamples"] ="N/A" 119 | 120 | model_result["benchmarks"].append(benchmark_info) 121 | 122 | formatted_results.append(model_result) 123 | 124 | with open(filename, 'w') as f: 125 | json.dump(formatted_results, f, indent=2) 126 | 127 | print(f"Results saved to {filename}") -------------------------------------------------------------------------------- /benchmarks/arc_challenge_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | 5 | class ARCChallengeBenchmark(BaseBenchmark): 6 | def __init__(self): 7 | super().__init__() 8 | self.id = "ARC-Challenge" 9 | self.data_url = "https://huggingface.co/datasets/allenai/ai2_arc/resolve/main/ARC-Challenge/validation-00000-of-00001.parquet" 10 | 11 | async def setup(self): 12 | await super().setup() 13 | file_path = os.path.join(self.temp_dir, "arc_challenge_validation.parquet") 14 | await self.download_file(self.data_url, file_path) 15 | self.data = pd.read_parquet(file_path) 16 | 17 | async def get_dataset(self) -> pd.DataFrame: 18 | return self.data 19 | 20 | def get_question(self, row: pd.Series) -> str: 21 | question = row['question'] 22 | choices = row['choices'] 23 | 24 | formatted_question = f"{question}\n\nOptions:\n" 25 | for label, text in zip(choices['label'], choices['text']): 26 | formatted_question += f"{label}. {text}\n" 27 | formatted_question += f"\nFinal answer should be the letter of the correct option." 28 | return formatted_question 29 | 30 | def get_correct_answer(self, row: pd.Series) -> str: 31 | return row['answerKey'] 32 | 33 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 34 | return model_answer.strip().upper() == correct_answer.strip().upper() 35 | 36 | def construct_prompt(self, question: str) -> str: 37 | prompt = f"{question}\n\n" 38 | return self.append_answer_instruction(prompt) 39 | 40 | @staticmethod 41 | def append_answer_instruction(prompt: str) -> str: 42 | return prompt + "Please reason through the question and options carefully. After your reasoning, provide your answer as a single letter corresponding to the chosen option. For example: [answer]A[/answer]" -------------------------------------------------------------------------------- /benchmarks/base_benchmark.py: -------------------------------------------------------------------------------- 1 | from abc import ABC, abstractmethod 2 | import os 3 | import tempfile 4 | import shutil 5 | import aiofiles 6 | import aiohttp 7 | import pandas as pd 8 | from tqdm import tqdm 9 | import asyncio 10 | from typing import Any 11 | from urllib.parse import urlparse 12 | 13 | class BaseBenchmark(ABC): 14 | def __init__(self): 15 | self.id = None 16 | self.temp_dir = None 17 | self.data_url = None 18 | self.data_file = None 19 | 20 | async def setup(self): 21 | self.create_temp_dir() 22 | await self.download_data() 23 | 24 | @abstractmethod 25 | async def get_dataset(self) -> pd.DataFrame: 26 | pass 27 | 28 | @abstractmethod 29 | def get_question(self, row: pd.Series) -> str: 30 | pass 31 | 32 | @abstractmethod 33 | def get_correct_answer(self, row: pd.Series) -> Any: 34 | pass 35 | 36 | @abstractmethod 37 | def check_answer(self, model_answer: str, correct_answer: Any) -> bool: 38 | pass 39 | 40 | async def run(self, model: str, client, df: pd.DataFrame) -> float: 41 | total_questions = len(df) 42 | 43 | print(f"Starting {self.id} benchmark for model: {model}") 44 | print(f"Total questions: {total_questions}") 45 | 46 | progress_bar = tqdm(total=total_questions, desc="Progress", unit="question") 47 | 48 | tasks = [] 49 | for _, row in df.iterrows(): 50 | task = self.process_question(model, client, row, progress_bar) 51 | tasks.append(task) 52 | 53 | results = await asyncio.gather(*tasks) 54 | 55 | correct_answers = sum(results) 56 | final_score = correct_answers / total_questions 57 | 58 | progress_bar.close() 59 | print(f"\nFinal Score for {model} on {self.id}: {final_score:.2%}") 60 | 61 | return final_score 62 | 63 | async def process_question(self, model: str, client, row, progress_bar): 64 | question = self.get_question(row) 65 | prompt = self.construct_prompt(question) 66 | 67 | model_response = await client.query_model(model, prompt) 68 | model_answer = self.parse_model_answer(model_response) 69 | 70 | correct_answer = self.get_correct_answer(row) 71 | is_correct = self.check_answer(model_answer, correct_answer) 72 | 73 | progress_bar.update(1) 74 | return int(is_correct) 75 | 76 | def construct_prompt(self, question: str) -> str: 77 | prompt = f"{question}\n\n" 78 | return self.append_answer_instruction(prompt) 79 | 80 | def parse_model_answer(self, response: str) -> str: 81 | start_tag = "[answer]" 82 | end_tag = "[/answer]" 83 | start_index = response.find(start_tag) 84 | end_index = response.find(end_tag) 85 | if start_index != -1 and end_index != -1: 86 | return response[start_index + len(start_tag):end_index].strip() 87 | else: 88 | return response # Return the full response if tags are not found 89 | 90 | @staticmethod 91 | def append_answer_instruction(prompt: str) -> str: 92 | return prompt + "Please reason through the question and options. After your reasoning, provide your answer enclosed in [answer] tags. For example: [answer]Your answer here[/answer]" 93 | 94 | async def cleanup(self): 95 | self.remove_temp_dir() 96 | 97 | def create_temp_dir(self): 98 | self.temp_dir = tempfile.mkdtemp() 99 | 100 | def remove_temp_dir(self): 101 | if self.temp_dir and os.path.exists(self.temp_dir): 102 | shutil.rmtree(self.temp_dir) 103 | 104 | async def download_data(self): 105 | if not self.data_url: 106 | return 107 | parsed_url = urlparse(self.data_url) 108 | self.data_file = os.path.basename(parsed_url.path) 109 | local_path = os.path.join(self.temp_dir, self.data_file) 110 | await self.download_file(self.data_url, local_path) 111 | 112 | @staticmethod 113 | async def download_file(url: str, local_path: str): 114 | try: 115 | async with aiohttp.ClientSession() as session: 116 | async with session.get(url) as response: 117 | response.raise_for_status() 118 | async with aiofiles.open(local_path, 'wb') as file: 119 | while True: 120 | chunk = await response.content.read(8192) 121 | if not chunk: 122 | break 123 | await file.write(chunk) 124 | print(f"File downloaded successfully and saved to {local_path}") 125 | except Exception as e: 126 | print(f"An error occurred while downloading the file: {e}") 127 | raise -------------------------------------------------------------------------------- /benchmarks/chatbot_arena_benchmark.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | import git 4 | import pickle 5 | import tempfile 6 | import shutil 7 | import asyncio 8 | from datetime import datetime 9 | from benchmarks.base_benchmark import BaseBenchmark 10 | 11 | class ChatbotArenaBenchmark(BaseBenchmark): 12 | def __init__(self): 13 | super().__init__() 14 | self.id = "ChatbotArena" 15 | self.repo_url = "https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" 16 | self.temp_dir = tempfile.mkdtemp() 17 | self.repo_path = os.path.join(self.temp_dir, "chatbot-arena-leaderboard") 18 | self.model_mapping = { 19 | "openai/gpt-3.5-turbo-0125": "gpt-3.5-turbo-0125", 20 | "openai/gpt-4o-mini-2024-07-18": "gpt-4o-mini-2024-07-18", 21 | "openai/gpt-4o-2024-08-06": "chatgpt-4o-latest-2024-08-08", 22 | "anthropic/claude-3-sonnet": "claude-3-sonnet-20240229", 23 | "anthropic/claude-3-opus": "claude-3-opus-20240229", 24 | "anthropic/claude-3.5-sonnet": "claude-3-5-sonnet-20240620", 25 | "google/gemini-pro-1.5-exp": "gemini-1.5-pro-exp-0801", 26 | "meta-llama/llama-3.1-70b-instruct": "llama-3.1-70b-instruct", 27 | "meta-llama/llama-3.1-405b-instruct": "llama-3.1-405b-instruct", 28 | "mistralai/mistral-large": "mistral-large-2407" 29 | } 30 | self.repo = None 31 | 32 | async def setup(self): 33 | if not os.path.exists(self.repo_path): 34 | self.repo = git.Repo.clone_from(self.repo_url, self.repo_path) 35 | else: 36 | self.repo = git.Repo(self.repo_path) 37 | origin = self.repo.remotes.origin 38 | origin.pull() 39 | 40 | def get_latest_elo_file(self): 41 | elo_files = [f for f in os.listdir(self.repo_path) if f.startswith("elo_results_") and f.endswith(".pkl")] 42 | return max(elo_files, key=lambda x: datetime.strptime(x.split("_")[2].split(".")[0], "%Y%m%d")) 43 | 44 | async def get_dataset(self): 45 | latest_file = self.get_latest_elo_file() 46 | with open(os.path.join(self.repo_path, latest_file), 'rb') as f: 47 | data = pickle.load(f) 48 | 49 | full_elo_data = data['text']['full']['elo_rating_final'] 50 | df = pd.DataFrame(list(full_elo_data.items()), columns=['Model', 'ELO']) 51 | return df 52 | 53 | def normalize_scores(self, df): 54 | min_score = df['ELO'].min() 55 | max_score = df['ELO'].max() 56 | df['normalized_score'] = (df['ELO'] - min_score) / (max_score - min_score) 57 | return df 58 | 59 | async def run(self, model: str, client, df: pd.DataFrame = None, samples: int = None) -> float: 60 | df = self.normalize_scores(df) 61 | 62 | df_sorted = df.sort_values('ELO', ascending=False) 63 | df_sorted['normalized_score'] = df_sorted['normalized_score'].round(4) 64 | 65 | # For debugging 66 | #print(tabulate(df_sorted, headers='keys', tablefmt='grid')) 67 | 68 | arena_model = self.model_mapping.get(model, model) 69 | if arena_model in df['Model'].values: 70 | normalized_score = df[df['Model'] == arena_model]['normalized_score'].values[0] 71 | rounded_normalized_score = round(normalized_score, 4) 72 | return rounded_normalized_score 73 | else: 74 | print(f"Warning: No data available for model {model}") 75 | return 0.0 76 | 77 | async def cleanup(self): 78 | if self.repo: 79 | self.repo.close() 80 | await asyncio.sleep(1) # Give a moment for any pending operations to complete 81 | if os.path.exists(self.temp_dir): 82 | shutil.rmtree(self.temp_dir, ignore_errors=True) 83 | 84 | # These methods are not used but are required by the BaseBenchmark abstract class 85 | def get_question(self, row: pd.Series) -> str: 86 | pass 87 | 88 | def get_correct_answer(self, row: pd.Series) -> str: 89 | pass 90 | 91 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 92 | pass -------------------------------------------------------------------------------- /benchmarks/gpqa_diamond_benchmark.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import random 3 | from typing import List, Tuple 4 | from huggingface_hub import hf_hub_download 5 | from benchmarks.base_benchmark import BaseBenchmark 6 | import pandas as pd 7 | 8 | class GPQADiamondBenchmark(BaseBenchmark): 9 | def __init__(self): 10 | super().__init__() 11 | self.id = "GPQA-Diamond" 12 | self.repo_id = "Idavidrein/gpqa" 13 | self.filename = "gpqa_diamond.csv" 14 | 15 | async def setup(self): 16 | await super().setup() 17 | self.data_file = hf_hub_download(repo_id=self.repo_id, filename=self.filename, repo_type="dataset") 18 | 19 | async def get_dataset(self) -> pd.DataFrame: 20 | with open(self.data_file, 'r', encoding='utf-8') as f: 21 | reader = csv.DictReader(f) 22 | return pd.DataFrame(list(reader)) 23 | 24 | def get_question(self, row: pd.Series) -> Tuple[str, List[str]]: 25 | question = row['Question'] 26 | correct_answer = row['Correct Answer'] 27 | incorrect_answers = [ 28 | row['Incorrect Answer 1'], 29 | row['Incorrect Answer 2'], 30 | row['Incorrect Answer 3'] 31 | ] 32 | options = [correct_answer] + incorrect_answers 33 | shuffled_options = list(enumerate(options)) 34 | random.shuffle(shuffled_options) 35 | 36 | formatted_question = f"{question}\n\nOptions:\n" 37 | for i, option in shuffled_options: 38 | formatted_question += f"- {option}\n" 39 | 40 | return formatted_question, [option for _, option in shuffled_options] 41 | 42 | def get_correct_answer(self, row: pd.Series) -> str: 43 | return row['Correct Answer'] 44 | 45 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 46 | try: 47 | return model_answer.strip().lower() == correct_answer 48 | except (ValueError, IndexError): 49 | return False 50 | 51 | def process_question(self, model: str, client, row, progress_bar): 52 | question, self.shuffled_options = self.get_question(row) 53 | return super().process_question(model, client, row, progress_bar) -------------------------------------------------------------------------------- /benchmarks/hellaswag_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | import numpy as np 5 | 6 | class HellaSwagBenchmark(BaseBenchmark): 7 | def __init__(self): 8 | super().__init__() 9 | self.id = "HellaSwag" 10 | self.data_url = "https://huggingface.co/api/datasets/Rowan/hellaswag/parquet/default/validation/0.parquet" 11 | 12 | async def setup(self): 13 | await super().setup() 14 | file_path = os.path.join(self.temp_dir, "hellaswag_validation.parquet") 15 | await self.download_file(self.data_url, file_path) 16 | self.data = pd.read_parquet(file_path) 17 | 18 | async def get_dataset(self) -> pd.DataFrame: 19 | return self.data 20 | 21 | def get_question(self, row: pd.Series) -> str: 22 | ctx = row['ctx'] 23 | endings = row['endings'] 24 | 25 | if isinstance(endings, np.ndarray): 26 | endings = endings.tolist() 27 | 28 | formatted_question = f"{ctx}\n\nOptions:\n" 29 | for i, ending in enumerate(endings): 30 | formatted_question += f"{i}. {ending}\n" 31 | formatted_question += f"\nFinal answer should be the number of the correct option." 32 | return formatted_question 33 | 34 | def get_correct_answer(self, row: pd.Series) -> str: 35 | return row['label'] 36 | 37 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 38 | return model_answer.strip() == correct_answer.strip() 39 | 40 | def construct_prompt(self, question: str) -> str: 41 | prompt = f"{question}\n\n" 42 | return self.append_answer_instruction(prompt) 43 | 44 | @staticmethod 45 | def append_answer_instruction(prompt: str) -> str: 46 | return prompt + "Please reason through the context and options carefully. After your reasoning, provide your answer as a single number corresponding to the chosen option. For example: [answer]0[/answer]" -------------------------------------------------------------------------------- /benchmarks/livebench_benchmark.py: -------------------------------------------------------------------------------- 1 | import aiohttp 2 | import pandas as pd 3 | from io import StringIO 4 | from benchmarks.base_benchmark import BaseBenchmark 5 | 6 | class LiveBenchCSVBenchmark(BaseBenchmark): 7 | def __init__(self): 8 | super().__init__() 9 | self.id = "LiveBench" 10 | self.url = "https://livebench.ai/table_2024_07_26.csv" 11 | self.model_mapping = { 12 | "openai/gpt-3.5-turbo-0125": "gpt-3.5-turbo-0125", 13 | "openai/gpt-4o-mini-2024-07-18": "gpt-4o-mini-2024-07-18", 14 | "openai/gpt-4o-2024-08-06": "chatgpt-4o-latest", 15 | "anthropic/claude-3-sonnet": "claude-3-sonnet-20240229", 16 | "anthropic/claude-3-opus": "claude-3-opus-20240229", 17 | "anthropic/claude-3.5-sonnet": "claude-3-5-sonnet-20240620", 18 | "meta-llama/llama-3.1-70b-instruct": "meta-llama-3.1-70b-instruct-turbo", 19 | "meta-llama/llama-3.1-405b-instruct": "meta-llama-3.1-405b-instruct-turbo", 20 | "mistralai/mistral-large": "mistral-large-2407" 21 | } 22 | 23 | async def setup(self): 24 | await super().setup() 25 | async with aiohttp.ClientSession() as session: 26 | async with session.get(self.url) as response: 27 | csv_content = await response.text() 28 | 29 | df = pd.read_csv(StringIO(csv_content)) 30 | df['average_score'] = df.iloc[:, 1:].mean(axis=1) 31 | self.df = df[['model', 'average_score']] 32 | 33 | async def get_dataset(self) -> pd.DataFrame: 34 | return self.df 35 | 36 | def get_question(self, row: pd.Series) -> str: 37 | return f"What is the LiveBench score for {row['model']}?" 38 | 39 | def get_correct_answer(self, row: pd.Series) -> str: 40 | return str(row['average_score']) 41 | 42 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 43 | try: 44 | return abs(float(model_answer) - float(correct_answer)) < 0.01 45 | except ValueError: 46 | return False 47 | 48 | async def run(self, model: str, client, df: pd.DataFrame = None, samples: int = None) -> float: 49 | livebench_model = self.model_mapping.get(model, model) 50 | if livebench_model in df['model'].values: 51 | score = df[df['model'] == livebench_model]['average_score'].values[0] 52 | return score / 100.0 # Convert percentage to decimal 53 | else: 54 | print(f"Warning: No data available for model {model}") 55 | return 0.0 -------------------------------------------------------------------------------- /benchmarks/math_hard_benchmark.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import asyncio 4 | import aiofiles 5 | import pandas as pd 6 | from benchmarks.base_benchmark import BaseBenchmark 7 | 8 | class MathHardBenchmark(BaseBenchmark): 9 | def __init__(self): 10 | super().__init__() 11 | self.id = "MATH-Hard" 12 | self.base_url = "https://huggingface.co/datasets/lighteval/MATH-Hard/resolve/main/test/" 13 | self.subtests = [ 14 | "algebra.jsonl", 15 | "counting_and_probability.jsonl", 16 | "geometry.jsonl", 17 | "intermediate_algebra.jsonl", 18 | "number_theory.jsonl", 19 | "prealgebra.jsonl", 20 | "precalculus.jsonl" 21 | ] 22 | self.BOXED_COMMAND = '\\boxed{' 23 | 24 | async def setup(self): 25 | await super().setup() 26 | await self.download_subtests() 27 | 28 | async def download_subtests(self): 29 | tasks = [self.download_file(self.base_url + subtest, os.path.join(self.temp_dir, subtest)) 30 | for subtest in self.subtests] 31 | await asyncio.gather(*tasks) 32 | 33 | async def get_dataset(self) -> pd.DataFrame: 34 | all_problems = [] 35 | for subtest in self.subtests: 36 | file_path = os.path.join(self.temp_dir, subtest) 37 | async with aiofiles.open(file_path, mode='r') as file: 38 | content = await file.read() 39 | problems = json.loads(content) 40 | all_problems.extend(problems) 41 | return pd.DataFrame(all_problems) 42 | 43 | def get_question(self, row: pd.Series) -> str: 44 | return row['problem'] 45 | 46 | def get_correct_answer(self, row: pd.Series) -> str: 47 | solution = row['solution'] 48 | try: 49 | boxed_content = solution.split("\\boxed{")[1] 50 | if "}$" in boxed_content: 51 | boxed_content = boxed_content.split("}$")[0] 52 | elif "}." in boxed_content: 53 | boxed_content = boxed_content.split("}.")[0] 54 | return boxed_content 55 | except IndexError: 56 | return None 57 | 58 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 59 | return model_answer.strip().lower() == correct_answer.strip().lower() 60 | 61 | def construct_prompt(self, question: str) -> str: 62 | prompt = f"Solve the following mathematics problem:\n\n{question}\n\n" 63 | prompt += "Don't enclose the answer in parentheses and don't add units to the answer.\n" 64 | prompt += "But still ONLY WRITE the answer in LATEX script!\n" 65 | prompt += "Example: [answer]42[/answer]\n" 66 | prompt += "Don't add dollar signs before or after answer!\n" 67 | return self.append_answer_instruction(prompt) -------------------------------------------------------------------------------- /benchmarks/mgsm_benchmark.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import git 4 | import tempfile 5 | import re 6 | from benchmarks.base_benchmark import BaseBenchmark 7 | 8 | class MGSMBenchmark(BaseBenchmark): 9 | def __init__(self): 10 | super().__init__() 11 | self.id = "MGSM" 12 | self.repo_url = "https://huggingface.co/datasets/juletxara/mgsm" 13 | self.temp_dir = tempfile.mkdtemp() 14 | self.repo_path = os.path.join(self.temp_dir, "mgsm") 15 | self.repo = None 16 | 17 | async def setup(self): 18 | await super().setup() 19 | if not os.path.exists(self.repo_path): 20 | self.repo = git.Repo.clone_from(self.repo_url, self.repo_path) 21 | else: 22 | self.repo = git.Repo(self.repo_path) 23 | origin = self.repo.remotes.origin 24 | origin.pull() 25 | 26 | async def get_dataset(self) -> pd.DataFrame: 27 | all_questions = [] 28 | for file in os.listdir(self.repo_path): 29 | if file.endswith('.tsv'): 30 | file_path = os.path.join(self.repo_path, file) 31 | df = pd.read_csv(file_path, sep='\t', names=['question', 'answer_number'], quoting=3) 32 | all_questions.append(df) 33 | 34 | return pd.concat(all_questions, ignore_index=True) 35 | 36 | def get_question(self, row: pd.Series) -> str: 37 | return row['question'] 38 | 39 | def get_correct_answer(self, row: pd.Series) -> str: 40 | return str(row['answer_number']) 41 | 42 | def extract_number(self, text: str) -> float: 43 | # Keep only digits and decimal points 44 | cleaned_text = ''.join(char for char in text if char in '0123456789.') 45 | if cleaned_text: 46 | return float(cleaned_text) 47 | else: 48 | raise ValueError("No numeric value found in the answer") 49 | 50 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 51 | try: 52 | model_value = self.extract_number(model_answer) 53 | correct_value = float(correct_answer) 54 | return abs(model_value - correct_value) < 1e-6 # Allow for small floating-point differences 55 | except ValueError: 56 | return False # If we can't extract or convert to float, consider it incorrect -------------------------------------------------------------------------------- /benchmarks/mmlu_pro_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | 5 | class MMULProBenchmark(BaseBenchmark): 6 | def __init__(self): 7 | super().__init__() 8 | self.id = "MMLU-Pro" 9 | self.data_url = "https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/resolve/main/data/test-00000-of-00001.parquet" 10 | 11 | async def get_dataset(self) -> pd.DataFrame: 12 | return pd.read_parquet(os.path.join(self.temp_dir, self.data_file)) 13 | 14 | def get_question(self, row: pd.Series) -> str: 15 | question = row['question'] 16 | options = row['options'] 17 | formatted_question = f"{question}\n\nOptions:\n" 18 | for i, option in enumerate(options): 19 | formatted_question += f"{chr(65 + i)}. {option}\n" 20 | formatted_question += f"Final answer should be the single letter you choose." 21 | return formatted_question 22 | 23 | def get_correct_answer(self, row: pd.Series) -> str: 24 | return row['answer'] 25 | 26 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 27 | return model_answer.strip().upper() == correct_answer -------------------------------------------------------------------------------- /benchmarks/musr_benchmark.py: -------------------------------------------------------------------------------- 1 | from benchmarks.base_benchmark import BaseBenchmark 2 | import pandas as pd 3 | import os 4 | import ast 5 | 6 | class MuSRBenchmark(BaseBenchmark): 7 | def __init__(self): 8 | super().__init__() 9 | self.id = "MuSR" 10 | self.data_url = "https://huggingface.co/datasets/TAUR-Lab/MuSR/resolve/main/all.csv" 11 | 12 | async def get_dataset(self) -> pd.DataFrame: 13 | file_path = os.path.join(self.temp_dir, self.data_file) 14 | df = pd.read_csv(file_path) 15 | df['choices'] = df['choices'].apply(ast.literal_eval) 16 | return df 17 | 18 | def get_question(self, row: pd.Series) -> str: 19 | narrative = row['narrative'] 20 | question = row['question'] 21 | choices = row['choices'] 22 | 23 | formatted_question = f"{narrative}\n\n{question}\n\nOptions:\n" 24 | for i, choice in enumerate(choices): 25 | formatted_question += f"{i+1}. {choice}\n" 26 | formatted_question += f"\nFinal answer should be the number of the correct option." 27 | return formatted_question 28 | 29 | def get_correct_answer(self, row: pd.Series) -> str: 30 | return str(row['answer_index'] + 1) # Adding 1 because answer_index is 0-based 31 | 32 | def check_answer(self, model_answer: str, correct_answer: str) -> bool: 33 | return model_answer.strip() == correct_answer 34 | 35 | def construct_prompt(self, question: str) -> str: 36 | prompt = f"{question}\n\n" 37 | return self.append_answer_instruction(prompt) 38 | 39 | @staticmethod 40 | def append_answer_instruction(prompt: str) -> str: 41 | return prompt + "Please reason through the narrative and question carefully. After your reasoning, provide your answer as a single number corresponding to the chosen option. For example: [answer]1[/answer]" -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from benchmark_suite import BenchmarkSuite 2 | from model import Model 3 | import asyncio 4 | 5 | async def main(): 6 | suite = BenchmarkSuite() 7 | 8 | # Create Model instances using OpenRouter model ids and model release dates 9 | models = [ 10 | Model("openai/gpt-3.5-turbo-0125", "2024-01-24"), 11 | Model("openai/gpt-4o-mini-2024-07-18", "2024-07-18"), 12 | Model("openai/gpt-4o-2024-08-06", "2024-08-06"), 13 | Model("anthropic/claude-3-sonnet", "2024-02-29"), 14 | Model("anthropic/claude-3.5-sonnet", "2024-06-20"), 15 | Model("meta-llama/llama-3.1-70b-instruct", "2024-07-23"), 16 | Model("meta-llama/llama-3.1-405b-instruct", "2024-07-23"), 17 | Model("mistralai/mistral-large", "2024-07-24") 18 | ] 19 | 20 | # Specify which benchmarks to run 21 | benchmark_ids = [ 22 | "MMLU-Pro", 23 | "GPQA-Diamond", 24 | "ChatbotArena", 25 | "MATH-Hard", 26 | "MuSR", 27 | "ARC-Challenge", 28 | "HellaSwag", 29 | "LiveBench", 30 | "MGSM" 31 | ] 32 | 33 | # Specify the number of samples to draw from each benchmark 34 | samples_per_benchmark = 100 35 | 36 | # Run the benchmarks 37 | results = await suite.run(models, benchmark_ids, samples_per_benchmark) 38 | 39 | # Print the results 40 | suite.print_results(results) 41 | 42 | # Save results to JSON 43 | suite.save_results_to_json(results) 44 | 45 | if __name__ == "__main__": 46 | asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) 47 | asyncio.run(main()) -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | class Model: 2 | def __init__(self, id: str, release_date: str): 3 | self.id = id 4 | self.release_date = release_date --------------------------------------------------------------------------------