├── .github └── workflows │ └── tests.yml ├── .gitignore ├── .pre-commit-config.yaml ├── LICENSE ├── README.md ├── bixbench-v1.5_results ├── bixbench_results_comparison.png ├── majority_vote_accuracy_image_comparison.png ├── majority_vote_accuracy_refusal_option_comparison.png ├── zero_shot_baselines.json └── zero_shot_baselines │ ├── claude-3-5-sonnet-latest-grader-mcq-refusal-False.csv │ ├── claude-3-5-sonnet-latest-grader-mcq-refusal-True.csv │ ├── claude-3-5-sonnet-latest-grader-openended.csv │ ├── gpt-4o-grader-mcq-refusal-False.csv │ ├── gpt-4o-grader-mcq-refusal-True.csv │ └── gpt-4o-grader-openended.csv ├── bixbench ├── __init__.py ├── generate_trajectories.py ├── graders.py ├── models.py ├── plot_style.py ├── plotting_utils.py ├── postprocessing.py ├── postprocessing_utils.py ├── prompts.py ├── run_configuration │ ├── 4o_image.yaml │ ├── 4o_no_image.yaml │ ├── bixbench_paper_results.yaml │ ├── claude_image.yaml │ ├── claude_no_image.yaml │ ├── generate_trajectories.yaml │ ├── postprocessing.yaml │ └── v1.5_paper_results.yaml ├── utils.py └── zero_shot.py ├── bixbench_results ├── baseline_eval_data │ ├── bixbench_llm_baseline_refusal_False_mcq_claude-3-5-sonnet-latest_1.0.csv │ ├── bixbench_llm_baseline_refusal_False_mcq_gpt-4o_1.0.csv │ ├── bixbench_llm_baseline_refusal_True_mcq_claude-3-5-sonnet-latest_1.0.csv │ ├── bixbench_llm_baseline_refusal_True_mcq_gpt-4o_1.0.csv │ ├── bixbench_llm_baseline_refusal_True_openended_claude-3-5-sonnet-latest_1.0.csv │ └── bixbench_llm_baseline_refusal_True_openended_gpt-4o_1.0.csv ├── bixbench_results_comparison.png ├── majority_vote_accuracy_image_comparison.png ├── majority_vote_accuracy_refusal_option_comparison.png └── zero_shot_baselines.json ├── generate_zeroshot_evals.py ├── grade_outputs.py ├── pyproject.toml ├── scripts ├── run_agentic.sh └── run_zeroshot.sh ├── tests ├── test_utils.py └── test_zeroshot.py └── uv.lock /.github/workflows/tests.yml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/.github/workflows/tests.yml -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/.gitignore -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/.pre-commit-config.yaml -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/README.md -------------------------------------------------------------------------------- /bixbench-v1.5_results/bixbench_results_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/bixbench_results_comparison.png -------------------------------------------------------------------------------- /bixbench-v1.5_results/majority_vote_accuracy_image_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/majority_vote_accuracy_image_comparison.png -------------------------------------------------------------------------------- /bixbench-v1.5_results/majority_vote_accuracy_refusal_option_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/majority_vote_accuracy_refusal_option_comparison.png -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines.json -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines/claude-3-5-sonnet-latest-grader-mcq-refusal-False.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines/claude-3-5-sonnet-latest-grader-mcq-refusal-False.csv -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines/claude-3-5-sonnet-latest-grader-mcq-refusal-True.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines/claude-3-5-sonnet-latest-grader-mcq-refusal-True.csv -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines/claude-3-5-sonnet-latest-grader-openended.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines/claude-3-5-sonnet-latest-grader-openended.csv -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines/gpt-4o-grader-mcq-refusal-False.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines/gpt-4o-grader-mcq-refusal-False.csv -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines/gpt-4o-grader-mcq-refusal-True.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines/gpt-4o-grader-mcq-refusal-True.csv -------------------------------------------------------------------------------- /bixbench-v1.5_results/zero_shot_baselines/gpt-4o-grader-openended.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench-v1.5_results/zero_shot_baselines/gpt-4o-grader-openended.csv -------------------------------------------------------------------------------- /bixbench/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/__init__.py -------------------------------------------------------------------------------- /bixbench/generate_trajectories.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/generate_trajectories.py -------------------------------------------------------------------------------- /bixbench/graders.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/graders.py -------------------------------------------------------------------------------- /bixbench/models.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/models.py -------------------------------------------------------------------------------- /bixbench/plot_style.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/plot_style.py -------------------------------------------------------------------------------- /bixbench/plotting_utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/plotting_utils.py -------------------------------------------------------------------------------- /bixbench/postprocessing.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/postprocessing.py -------------------------------------------------------------------------------- /bixbench/postprocessing_utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/postprocessing_utils.py -------------------------------------------------------------------------------- /bixbench/prompts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/prompts.py -------------------------------------------------------------------------------- /bixbench/run_configuration/4o_image.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/4o_image.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/4o_no_image.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/4o_no_image.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/bixbench_paper_results.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/bixbench_paper_results.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/claude_image.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/claude_image.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/claude_no_image.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/claude_no_image.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/generate_trajectories.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/generate_trajectories.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/postprocessing.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/postprocessing.yaml -------------------------------------------------------------------------------- /bixbench/run_configuration/v1.5_paper_results.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/run_configuration/v1.5_paper_results.yaml -------------------------------------------------------------------------------- /bixbench/utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/utils.py -------------------------------------------------------------------------------- /bixbench/zero_shot.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench/zero_shot.py -------------------------------------------------------------------------------- /bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_False_mcq_claude-3-5-sonnet-latest_1.0.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_False_mcq_claude-3-5-sonnet-latest_1.0.csv -------------------------------------------------------------------------------- /bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_False_mcq_gpt-4o_1.0.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_False_mcq_gpt-4o_1.0.csv -------------------------------------------------------------------------------- /bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_mcq_claude-3-5-sonnet-latest_1.0.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_mcq_claude-3-5-sonnet-latest_1.0.csv -------------------------------------------------------------------------------- /bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_mcq_gpt-4o_1.0.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_mcq_gpt-4o_1.0.csv -------------------------------------------------------------------------------- /bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_openended_claude-3-5-sonnet-latest_1.0.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_openended_claude-3-5-sonnet-latest_1.0.csv -------------------------------------------------------------------------------- /bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_openended_gpt-4o_1.0.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/baseline_eval_data/bixbench_llm_baseline_refusal_True_openended_gpt-4o_1.0.csv -------------------------------------------------------------------------------- /bixbench_results/bixbench_results_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/bixbench_results_comparison.png -------------------------------------------------------------------------------- /bixbench_results/majority_vote_accuracy_image_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/majority_vote_accuracy_image_comparison.png -------------------------------------------------------------------------------- /bixbench_results/majority_vote_accuracy_refusal_option_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/majority_vote_accuracy_refusal_option_comparison.png -------------------------------------------------------------------------------- /bixbench_results/zero_shot_baselines.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/bixbench_results/zero_shot_baselines.json -------------------------------------------------------------------------------- /generate_zeroshot_evals.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/generate_zeroshot_evals.py -------------------------------------------------------------------------------- /grade_outputs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/grade_outputs.py -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/pyproject.toml -------------------------------------------------------------------------------- /scripts/run_agentic.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/scripts/run_agentic.sh -------------------------------------------------------------------------------- /scripts/run_zeroshot.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/scripts/run_zeroshot.sh -------------------------------------------------------------------------------- /tests/test_utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/tests/test_utils.py -------------------------------------------------------------------------------- /tests/test_zeroshot.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/tests/test_zeroshot.py -------------------------------------------------------------------------------- /uv.lock: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Future-House/BixBench/HEAD/uv.lock --------------------------------------------------------------------------------