├── LICENSE ├── README.md ├── Template.md └── benchmark-logs ├── CollectiveCognition-v1-Mistral-7B.md ├── CollectiveCognition-v1.1-Mistral-7B.md ├── DPOpenHermes.md ├── Deepseek-LLM-67b-base.md ├── Dolphin-2.6-Mixtral-7x8.md ├── Dolphin-2.6-mistral-7b-dpo-laser.md ├── Hermes-2-Pro-Llama-3-8B-DPO.md ├── Hermes-2-Pro-Llama-3-8B-SFT.md ├── Hermes-2-Pro-Llama-3-Instruct-8B-Merge-DPO.md ├── Hermes-2-Theta-Llama-3-70B.md ├── Llama-2-7B-Base.md ├── Llama-3.1-8B-Instruct.md ├── Meta-Llama-3-70B-Instruct.md ├── Mistral-7B-Base.md ├── Mistral-Instruct-v0.1-Mistral-7B.md ├── Mixtral-7x8-Base.md ├── Mixtral-7x8-Instruct-v0.1.md ├── Neural-Hermes-2.5-Mistral-7B.md ├── Nous-Hermes-1-Llama-2-7B.md ├── Nous-Hermes-2-Mistral-7B-DPO.md ├── Nous-Hermes-2-Mixtral-8x7B-DPO.md ├── Nous-Hermes-2-Mixtral-8x7B-SFT.md ├── Nous-Hermes-2-SOLAR-10.7B.md ├── Nous-Hermes-2-Yi-34B.md ├── OpenHermes-2-Mistral-7B.md ├── OpenHermes-2.5-Mistral-7B.md ├── OpenHermes-v1-Llama-2-7B.md ├── Qwen-14b-base.md ├── Qwen-72B-base.md ├── SOLAR-10.7B-Instruct.md ├── SOLAR-10.7b-v1-Base.md ├── Skunkworks-Mistralic-7B-Mistral.md ├── Synthia-Mistral-7B.md ├── Teknium-Airoboros-2.2-Mistral-7B.md ├── TinyLlama-1.1B-intermediate-step-1431k-3T.md ├── TinyLlama-1B--intermediate-step-1195k-token-2.5T.md ├── Yi-34B-Chat.md ├── deita-v1.0-Mistral-7B.md └── template.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Teknium 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LLM-Benchmark-Logs 2 | 3 | Welcome to the LLM-Benchmark-Logs repository. This repository is dedicated to documenting and organizing benchmarks performed on various Foundational Large Language Models and their Fine-tunes. 4 | 5 | The main content of this repository is plaintext files containing detailed benchmark results. These files provide a comprehensive record of the performance characteristics of different LLMs under various conditions and workloads. 6 | 7 | In the future, I may introduce a "leaderboard" feature, which will rank the LLMs based on their benchmark performance. This will provide a quick and easy reference for comparing the capabilities of different LLMs. However, I don't want to be in the business of leaderboards, so don't expect much. 8 | -------------------------------------------------------------------------------- /Template.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | 4 | ``` 5 | Average: 6 | 7 | AGIEval: 8 | ``` 9 | 10 | ``` 11 | Average: 12 | 13 | BigBench: 14 | ``` 15 | 16 | ``` 17 | Average: 18 | 19 | TruthfulQA: 20 | ``` 21 | 22 | ``` -------------------------------------------------------------------------------- /benchmark-logs/CollectiveCognition-v1-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | 4 | ``` 5 | Average: 6 | 7 | AGIEval: 8 | ``` 9 | 10 | ``` 11 | Average: 12 | 13 | BigBench: 14 | ``` 15 | | Task |Version| Metric |Value | |Stderr| 16 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 17 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5684|± |0.0360| 18 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6612|± |0.0247| 19 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3682|± |0.0301| 20 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1365|± |0.0181| 21 | | | |exact_str_match |0.0000|± |0.0000| 22 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2540|± |0.0195| 23 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1886|± |0.0148| 24 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3967|± |0.0283| 25 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2640|± |0.0197| 26 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 27 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5080|± |0.0112| 28 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3058|± |0.0218| 29 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1743|± |0.0120| 30 | |bigbench_snarks | 0|multiple_choice_grade|0.5028|± |0.0373| 31 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159| 32 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2700|± |0.0140| 33 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2256|± |0.0118| 34 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1714|± |0.0090| 35 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3967|± |0.0283| 36 | ``` 37 | Average: 35.50 38 | 39 | TruthfulQA: 40 | ``` 41 | 42 | ``` -------------------------------------------------------------------------------- /benchmark-logs/CollectiveCognition-v1.1-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | 4 | ``` 5 | Average: 6 | 7 | AGIEval: 8 | ``` 9 | 10 | ``` 11 | Average: 12 | 13 | BigBench: 14 | ``` 15 | | Task |Version| Metric |Value | |Stderr| 16 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 17 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5579|± |0.0361| 18 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6748|± |0.0244| 19 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.4380|± |0.0309| 20 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159| 21 | | | |exact_str_match |0.0000|± |0.0000| 22 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2700|± |0.0199| 23 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2043|± |0.0152| 24 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4067|± |0.0284| 25 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2580|± |0.0196| 26 | |bigbench_navigate | 0|multiple_choice_grade|0.5110|± |0.0158| 27 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.4990|± |0.0112| 28 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3214|± |0.0221| 29 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1784|± |0.0121| 30 | |bigbench_snarks | 0|multiple_choice_grade|0.6022|± |0.0365| 31 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5254|± |0.0159| 32 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2700|± |0.0140| 33 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2296|± |0.0119| 34 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1754|± |0.0091| 35 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4067|± |0.0284| 36 | ``` 37 | Average: 36.83 38 | 39 | TruthfulQA: 40 | ``` 41 | 42 | ``` -------------------------------------------------------------------------------- /benchmark-logs/DPOpenHermes.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5776|± |0.0144| 6 | | | |acc_norm|0.6177|± |0.0142| 7 | |arc_easy | 0|acc |0.8396|± |0.0075| 8 | | | |acc_norm|0.8215|± |0.0079| 9 | |boolq | 1|acc |0.8709|± |0.0059| 10 | |hellaswag | 0|acc |0.6495|± |0.0048| 11 | | | |acc_norm|0.8297|± |0.0038| 12 | |openbookqa | 0|acc |0.3360|± |0.0211| 13 | | | |acc_norm|0.4540|± |0.0223| 14 | |piqa | 0|acc |0.8188|± |0.0090| 15 | | | |acc_norm|0.8270|± |0.0088| 16 | |winogrande | 0|acc |0.7403|± |0.0123| 17 | ``` 18 | Average: 73.73 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.1929|± |0.0248| 25 | | | |acc_norm|0.2047|± |0.0254| 26 | |agieval_logiqa_en | 0|acc |0.3763|± |0.0190| 27 | | | |acc_norm|0.3717|± |0.0190| 28 | |agieval_lsat_ar | 0|acc |0.2826|± |0.0298| 29 | | | |acc_norm|0.2652|± |0.0292| 30 | |agieval_lsat_lr | 0|acc |0.5314|± |0.0221| 31 | | | |acc_norm|0.5353|± |0.0221| 32 | |agieval_lsat_rc | 0|acc |0.6134|± |0.0297| 33 | | | |acc_norm|0.5911|± |0.0300| 34 | |agieval_sat_en | 0|acc |0.7427|± |0.0305| 35 | | | |acc_norm|0.7233|± |0.0312| 36 | |agieval_sat_en_without_passage| 0|acc |0.4709|± |0.0349| 37 | | | |acc_norm|0.4660|± |0.0348| 38 | |agieval_sat_math | 0|acc |0.4045|± |0.0332| 39 | | | |acc_norm|0.3727|± |0.0327| 40 | ``` 41 | Average: 44.13 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5579|± |0.0361| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6558|± |0.0248| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3411|± |0.0296| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2145|± |0.0217| 51 | | | |exact_str_match |0.0947|± |0.0155| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2980|± |0.0205| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2057|± |0.0153| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4800|± |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3880|± |0.0218| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6750|± |0.0105| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4353|± |0.0235| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.3387|± |0.0150| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7017|± |0.0341| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6826|± |0.0148| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3200|± |0.0148| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2120|± |0.0116| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1697|± |0.0090| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4800|± |0.0289| 66 | ``` 67 | Average: 42.53 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.4137|± |0.0172| 74 | | | |mc2 |0.5869|± |0.0154| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Deepseek-LLM-67b-base.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ```hf-causal-experimental (pretrained=deepseek-ai/deepseek-llm-67b-base,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 20 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5589|± |0.0145| 6 | | | |acc_norm|0.5964|± |0.0143| 7 | |arc_easy | 0|acc |0.8405|± |0.0075| 8 | | | |acc_norm|0.8316|± |0.0077| 9 | |boolq | 1|acc |0.8486|± |0.0063| 10 | |hellaswag | 0|acc |0.6502|± |0.0048| 11 | | | |acc_norm|0.8437|± |0.0036| 12 | |openbookqa | 0|acc |0.3620|± |0.0215| 13 | | | |acc_norm|0.4800|± |0.0224| 14 | |piqa | 0|acc |0.8226|± |0.0089| 15 | | | |acc_norm|0.8335|± |0.0087| 16 | |winogrande | 0|acc |0.8074|± |0.0111| 17 | ``` 18 | Average: 74.87 19 | 20 | AGIEval: 21 | ```hf-causal-experimental (pretrained=deepseek-ai/deepseek-llm-67b-base,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 20 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2756|± |0.0281| 25 | | | |acc_norm|0.2559|± |0.0274| 26 | |agieval_logiqa_en | 0|acc |0.3763|± |0.0190| 27 | | | |acc_norm|0.3518|± |0.0187| 28 | |agieval_lsat_ar | 0|acc |0.2217|± |0.0275| 29 | | | |acc_norm|0.2000|± |0.0264| 30 | |agieval_lsat_lr | 0|acc |0.4294|± |0.0219| 31 | | | |acc_norm|0.3569|± |0.0212| 32 | |agieval_lsat_rc | 0|acc |0.5874|± |0.0301| 33 | | | |acc_norm|0.4461|± |0.0304| 34 | |agieval_sat_en | 0|acc |0.7816|± |0.0289| 35 | | | |acc_norm|0.6262|± |0.0338| 36 | |agieval_sat_en_without_passage| 0|acc |0.4951|± |0.0349| 37 | | | |acc_norm|0.3641|± |0.0336| 38 | |agieval_sat_math | 0|acc |0.4227|± |0.0334| 39 | | | |acc_norm|0.3455|± |0.0321| 40 | ``` 41 | Average: 36.83 42 | 43 | BigBench: 44 | ```hf-causal-experimental (pretrained=deepseek-ai/deepseek-llm-67b-base,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 20 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5789|± |0.0359| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7317|± |0.0231| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3140|± |0.0289| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1281|± |0.0177| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2820|± |0.0201| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2000|± |0.0151| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4333|± |0.0287| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3460|± |0.0213| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.4720|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5175|± |0.0112| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4688|± |0.0236| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.3216|± |0.0148| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6298|± |0.0360| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6298|± |0.0154| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.4470|± |0.0157| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2096|± |0.0115| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1480|± |0.0085| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4333|± |0.0287| 66 | ``` 67 | Average: 40.51 -------------------------------------------------------------------------------- /benchmark-logs/Dolphin-2.6-Mixtral-7x8.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5606|_ |0.0145| 6 | | | |acc_norm|0.5947|_ |0.0143| 7 | |arc_easy | 0|acc |0.8283|_ |0.0077| 8 | | | |acc_norm|0.8190|_ |0.0079| 9 | |boolq | 1|acc |0.8676|_ |0.0059| 10 | |hellaswag | 0|acc |0.6477|_ |0.0048| 11 | | | |acc_norm|0.8303|_ |0.0037| 12 | |openbookqa | 0|acc |0.3540|_ |0.0214| 13 | | | |acc_norm|0.4580|_ |0.0223| 14 | |piqa | 0|acc |0.8248|_ |0.0089| 15 | | | |acc_norm|0.8319|_ |0.0087| 16 | |winogrande | 0|acc |0.7585|_ |0.0120| 17 | ``` 18 | Average: 73.71 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2480|_ |0.0272| 25 | | | |acc_norm|0.2244|_ |0.0262| 26 | |agieval_logiqa_en | 0|acc |0.3748|_ |0.0190| 27 | | | |acc_norm|0.3794|_ |0.0190| 28 | |agieval_lsat_ar | 0|acc |0.2174|_ |0.0273| 29 | | | |acc_norm|0.1957|_ |0.0262| 30 | |agieval_lsat_lr | 0|acc |0.3980|_ |0.0217| 31 | | | |acc_norm|0.4020|_ |0.0217| 32 | |agieval_lsat_rc | 0|acc |0.5539|_ |0.0304| 33 | | | |acc_norm|0.5167|_ |0.0305| 34 | |agieval_sat_en | 0|acc |0.7524|_ |0.0301| 35 | | | |acc_norm|0.7282|_ |0.0311| 36 | |agieval_sat_en_without_passage| 0|acc |0.4806|_ |0.0349| 37 | | | |acc_norm|0.4126|_ |0.0344| 38 | |agieval_sat_math | 0|acc |0.3909|_ |0.0330| 39 | | | |acc_norm|0.3545|_ |0.0323| 40 | ``` 41 | Average: 40.17 42 | 43 | BigBench: 44 | ```hf-causal-experimental (pretrained=cognitivecomputations/dolphin-2.6-mixtral-8x7b,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 32 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5789|_ |0.0359| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7182|_ |0.0235| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.5194|_ |0.0312| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1448|_ |0.0186| 51 | | | |exact_str_match |0.0000|_ |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2640|_ |0.0197| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2129|_ |0.0155| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4200|_ |0.0285| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3940|_ |0.0219| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|_ |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6640|_ |0.0106| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.5491|_ |0.0235| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2595|_ |0.0139| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7514|_ |0.0322| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6004|_ |0.0156| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.4010|_ |0.0155| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2256|_ |0.0118| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1571|_ |0.0087| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4200|_ |0.0285| 66 | ``` 67 | Average: 43.22 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3892|_ |0.0171| 74 | | | |mc2 |0.5478|_ |0.0150| 75 | ``` 76 | -------------------------------------------------------------------------------- /benchmark-logs/Dolphin-2.6-mistral-7b-dpo-laser.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5640|± |0.0145| 6 | | | |acc_norm|0.5887|± |0.0144| 7 | |arc_easy | 0|acc |0.8392|± |0.0075| 8 | | | |acc_norm|0.8026|± |0.0082| 9 | |boolq | 1|acc |0.8731|± |0.0058| 10 | |hellaswag | 0|acc |0.6476|± |0.0048| 11 | | | |acc_norm|0.8364|± |0.0037| 12 | |openbookqa | 0|acc |0.3460|± |0.0213| 13 | | | |acc_norm|0.4700|± |0.0223| 14 | |piqa | 0|acc |0.8215|± |0.0089| 15 | | | |acc_norm|0.8324|± |0.0087| 16 | |winogrande | 0|acc |0.7624|± |0.0120| 17 | ``` 18 | Average: 73.79 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2047|± |0.0254| 25 | | | |acc_norm|0.1969|± |0.0250| 26 | |agieval_logiqa_en | 0|acc |0.3364|± |0.0185| 27 | | | |acc_norm|0.3410|± |0.0186| 28 | |agieval_lsat_ar | 0|acc |0.2217|± |0.0275| 29 | | | |acc_norm|0.2087|± |0.0269| 30 | |agieval_lsat_lr | 0|acc |0.3902|± |0.0216| 31 | | | |acc_norm|0.3961|± |0.0217| 32 | |agieval_lsat_rc | 0|acc |0.5353|± |0.0305| 33 | | | |acc_norm|0.5130|± |0.0305| 34 | |agieval_sat_en | 0|acc |0.6990|± |0.0320| 35 | | | |acc_norm|0.6893|± |0.0323| 36 | |agieval_sat_en_without_passage| 0|acc |0.4029|± |0.0343| 37 | | | |acc_norm|0.3981|± |0.0342| 38 | |agieval_sat_math | 0|acc |0.3864|± |0.0329| 39 | | | |acc_norm|0.3409|± |0.0320| 40 | ``` 41 | Average: 38.55 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5842|± |0.0359| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6829|± |0.0243| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3488|± |0.0297| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2033|± |0.0213| 51 | | | |exact_str_match |0.0306|± |0.0091| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2680|± |0.0198| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2086|± |0.0154| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4967|± |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.5140|± |0.0224| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5330|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6715|± |0.0105| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3996|± |0.0232| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1283|± |0.0106| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6575|± |0.0354| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.7160|± |0.0144| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3680|± |0.0153| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2296|± |0.0119| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1663|± |0.0089| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4967|± |0.0289| 66 | ``` 67 | Average: 42.63 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.4431|± |0.0174| 74 | | | |mc2 |0.6106|± |0.0151| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Hermes-2-Pro-Llama-3-8B-DPO.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5503|_ |0.0145| 6 | | | |acc_norm|0.5828|_ |0.0144| 7 | |arc_easy | 0|acc |0.8333|_ |0.0076| 8 | | | |acc_norm|0.8144|_ |0.0080| 9 | |boolq | 1|acc |0.8590|_ |0.0061| 10 | |hellaswag | 0|acc |0.6260|_ |0.0048| 11 | | | |acc_norm|0.8054|_ |0.0040| 12 | |openbookqa | 0|acc |0.3800|_ |0.0217| 13 | | | |acc_norm|0.4560|_ |0.0223| 14 | |piqa | 0|acc |0.7987|_ |0.0094| 15 | | | |acc_norm|0.8118|_ |0.0091| 16 | |winogrande | 0|acc |0.7514|_ |0.0121| 17 | ``` 18 | Average: 72.58 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2520|_ |0.0273| 25 | | | |acc_norm|0.2480|_ |0.0272| 26 | |agieval_logiqa_en | 0|acc |0.3564|_ |0.0188| 27 | | | |acc_norm|0.3656|_ |0.0189| 28 | |agieval_lsat_ar | 0|acc |0.1870|_ |0.0258| 29 | | | |acc_norm|0.1957|_ |0.0262| 30 | |agieval_lsat_lr | 0|acc |0.5608|_ |0.0220| 31 | | | |acc_norm|0.5314|_ |0.0221| 32 | |agieval_lsat_rc | 0|acc |0.6320|_ |0.0295| 33 | | | |acc_norm|0.6171|_ |0.0297| 34 | |agieval_sat_en | 0|acc |0.7379|_ |0.0307| 35 | | | |acc_norm|0.7039|_ |0.0319| 36 | |agieval_sat_en_without_passage| 0|acc |0.4029|_ |0.0343| 37 | | | |acc_norm|0.3641|_ |0.0336| 38 | |agieval_sat_math | 0|acc |0.3818|_ |0.0328| 39 | | | |acc_norm|0.3727|_ |0.0327| 40 | ``` 41 | Average: 42.48 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5737|± |0.0360| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6667|± |0.0246| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3178|± |0.0290| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1755|± |0.0201| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3120|± |0.0207| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2014|± |0.0152| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5500|± |0.0288| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.4300|± |0.0222| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.4980|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.7010|± |0.0102| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4688|± |0.0236| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1974|± |0.0126| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7403|± |0.0327| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5426|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.5320|± |0.0158| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2280|± |0.0119| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1531|± |0.0086| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5500|± |0.0288| 66 | ``` 67 | Average: 43.55 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.4076|_ |0.0172| 74 | | | |mc2 |0.5788|_ |0.0157| 75 | ``` 76 | -------------------------------------------------------------------------------- /benchmark-logs/Hermes-2-Pro-Llama-3-8B-SFT.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| |arc_challenge| 0|acc |0.6425|_ |0.0140| | | |acc_norm|0.6493|_ |0.0139| |arc_easy | 0|acc |0.8813|_ |0.0066| 5 | | | |acc_norm|0.8582|_ |0.0072| 6 | |boolq | 1|acc |0.8719|_ |0.0058| 7 | |hellaswag | 0|acc |0.6776|_ |0.0047| 8 | | | |acc_norm|0.8552|_ |0.0035| 9 | |openbookqa | 0|acc |0.3760|_ |0.0217| | | |acc_norm|0.4760|_ |0.0224| 10 | |piqa | 0|acc |0.8275|_ |0.0088| 11 | | | |acc_norm|0.8400|_ |0.0086| 12 | |winogrande | 0|acc |0.8027|_ |0.0112| 13 | 14 | ``` 15 | Average: 76.48 16 | 17 | AGIEval: 18 | ``` 19 | | Task |Version| Metric |Value | |Stderr| 20 | |------------------------------|------:|--------|-----:|---|-----:| 21 | |agieval_aqua_rat | 0|acc |0.3425|_ |0.0298| 22 | | | |acc_norm|0.3150|_ |0.0292| 23 | |agieval_logiqa_en | 0|acc |0.4900|_ |0.0196| 24 | | | |acc_norm|0.4700|_ |0.0196| 25 | |agieval_lsat_ar | 0|acc |0.2652|_ |0.0292| 26 | | | |acc_norm|0.2783|_ |0.0296| 27 | |agieval_lsat_lr | 0|acc |0.7510|_ |0.0192| 28 | | | |acc_norm|0.7137|_ |0.0200| 29 | |agieval_lsat_rc | 0|acc |0.7732|_ |0.0256| 30 | | | |acc_norm|0.7472|_ |0.0265| 31 | |agieval_sat_en | 0|acc |0.8350|_ |0.0259| 32 | | | |acc_norm|0.8544|_ |0.0246| 33 | |agieval_sat_en_without_passage| 0|acc |0.5340|_ |0.0348| 34 | | | |acc_norm|0.5049|_ |0.0349| 35 | |agieval_sat_math | 0|acc |0.5864|_ |0.0333| 36 | | | |acc_norm|0.4955|_ |0.0338| 37 | ``` 38 | Average: 54.74 39 | 40 | BigBench: 41 | ``` 42 | | Task |Version| Metric |Value | |Stderr| 43 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 44 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5842|_ |0.0359| 45 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7724|_ |0.0219| 46 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3256|_ |0.0292| 47 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.5125|_ |0.0264| 48 | | | |exact_str_match |0.0000|_ |0.0000| 49 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3640|_ |0.0215| 50 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2643|_ |0.0167| 51 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4633|_ |0.0288| 52 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.4980|_ |0.0224| 53 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|_ |0.0158| 54 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.8325|_ |0.0084| 55 | |bigbench_ruin_names | 0|multiple_choice_grade|0.6272|_ |0.0229| 56 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.5361|_ |0.0158| 57 | |bigbench_snarks | 0|multiple_choice_grade|0.7680|_ |0.0315| 58 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5832|_ |0.0157| 59 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.7990|_ |0.0127| 60 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2208|_ |0.0117| 61 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1577|_ |0.0087| 62 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4633|_ |0.0288| 63 | ``` 64 | Average: 51.51 65 | 66 | TruthfulQA: 67 | Score: 56.81 68 | -------------------------------------------------------------------------------- /benchmark-logs/Hermes-2-Pro-Llama-3-Instruct-8B-Merge-DPO.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5529|± |0.0145| 6 | | | |acc_norm|0.5870|± |0.0144| 7 | |arc_easy | 0|acc |0.8371|± |0.0076| 8 | | | |acc_norm|0.8144|± |0.0080| 9 | |boolq | 1|acc |0.8599|± |0.0061| 10 | |hellaswag | 0|acc |0.6133|± |0.0049| 11 | | | |acc_norm|0.7989|± |0.0040| 12 | |openbookqa | 0|acc |0.3940|± |0.0219| 13 | | | |acc_norm|0.4680|± |0.0223| 14 | |piqa | 0|acc |0.8063|± |0.0092| 15 | | | |acc_norm|0.8156|± |0.0090| 16 | |winogrande | 0|acc |0.7372|± |0.0124| 17 | ``` 18 | Average: 72.59 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2441|± |0.0270| 25 | | | |acc_norm|0.2441|± |0.0270| 26 | |agieval_logiqa_en | 0|acc |0.3687|± |0.0189| 27 | | | |acc_norm|0.3840|± |0.0191| 28 | |agieval_lsat_ar | 0|acc |0.2304|± |0.0278| 29 | | | |acc_norm|0.2174|± |0.0273| 30 | |agieval_lsat_lr | 0|acc |0.5471|± |0.0221| 31 | | | |acc_norm|0.5373|± |0.0221| 32 | |agieval_lsat_rc | 0|acc |0.6617|± |0.0289| 33 | | | |acc_norm|0.6357|± |0.0294| 34 | |agieval_sat_en | 0|acc |0.7670|± |0.0295| 35 | | | |acc_norm|0.7379|± |0.0307| 36 | |agieval_sat_en_without_passage| 0|acc |0.4417|± |0.0347| 37 | | | |acc_norm|0.4223|± |0.0345| 38 | |agieval_sat_math | 0|acc |0.4000|± |0.0331| 39 | | | |acc_norm|0.3455|± |0.0321| 40 | ``` 41 | Average: 44.05 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.6000|± |0.0356| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6585|± |0.0247| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3178|± |0.0290| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2340|± |0.0224| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2980|± |0.0205| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2057|± |0.0153| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5367|± |0.0288| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.4040|± |0.0220| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.4970|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.7075|± |0.0102| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4821|± |0.0236| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2295|± |0.0133| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6906|± |0.0345| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5375|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.6270|± |0.0153| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2216|± |0.0118| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1594|± |0.0088| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5367|± |0.0288| 66 | ``` 67 | Average: 44.13 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3929|± |0.0171| 74 | | | |mc2 |0.5633|± |0.0154| 75 | ``` 76 | -------------------------------------------------------------------------------- /benchmark-logs/Hermes-2-Theta-Llama-3-70B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.6638|_ |0.0138| 6 | | | |acc_norm|0.6903|_ |0.0135| 7 | |arc_easy | 0|acc |0.8851|_ |0.0065| 8 | | | |acc_norm|0.8712|_ |0.0069| 9 | |boolq | 1|acc |0.8820|_ |0.0056| 10 | |hellaswag | 0|acc |0.6579|_ |0.0047| 11 | | | |acc_norm|0.8432|_ |0.0036| 12 | |openbookqa | 0|acc |0.3920|_ |0.0219| 13 | | | |acc_norm|0.4740|_ |0.0224| 14 | |piqa | 0|acc |0.8286|_ |0.0088| 15 | | | |acc_norm|0.8351|_ |0.0087| 16 | |winogrande | 0|acc |0.7893|_ |0.0115| 17 | ``` 18 | Average: 76.93 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.4055|_ |0.0309| 25 | | | |acc_norm|0.4094|_ |0.0309| 26 | |agieval_logiqa_en | 0|acc |0.5100|_ |0.0196| 27 | | | |acc_norm|0.5023|_ |0.0196| 28 | |agieval_lsat_ar | 0|acc |0.2783|_ |0.0296| 29 | | | |acc_norm|0.2957|_ |0.0302| 30 | |agieval_lsat_lr | 0|acc |0.7451|_ |0.0193| 31 | | | |acc_norm|0.7333|_ |0.0196| 32 | |agieval_lsat_rc | 0|acc |0.8290|_ |0.0230| 33 | | | |acc_norm|0.8104|_ |0.0239| 34 | |agieval_sat_en | 0|acc |0.9029|_ |0.0207| 35 | | | |acc_norm|0.9029|_ |0.0207| 36 | |agieval_sat_en_without_passage| 0|acc |0.5825|_ |0.0344| 37 | | | |acc_norm|0.5631|_ |0.0346| 38 | |agieval_sat_math | 0|acc |0.6318|_ |0.0326| 39 | | | |acc_norm|0.6227|_ |0.0328| 40 | ``` 41 | Average: 60.50 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.6737|_ |0.0341| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7724|_ |0.0219| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3256|_ |0.0292| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.4763|_ |0.0264| 51 | | | |exact_str_match |0.0000|_ |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.4720|_ |0.0223| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.3486|_ |0.0180| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.6367|_ |0.0278| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.5220|_ |0.0224| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5930|_ |0.0155| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.8600|_ |0.0078| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.7411|_ |0.0207| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.5281|_ |0.0158| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6961|_ |0.0343| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5751|_ |0.0158| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.9880|_ |0.0034| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2296|_ |0.0119| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1691|_ |0.0090| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.6367|_ |0.0278| 66 | ``` 67 | Average: 56.91 68 | 69 | TruthfulQA: 70 | ```| Task |Version|Metric|Value | |Stderr| 71 | |-------------|------:|------|-----:|---|-----:| 72 | |truthfulqa_mc| 1|mc1 |0.4565|_ |0.0174| 73 | | | |mc2 |0.6288|_ |0.0151| 74 | ``` 75 | 62.88 76 | 77 | IFEval: 78 | 87.99 79 | 80 | MTBench: 81 | First Turn - 9.1625 82 | Second Turn - 8.925 83 | Average - 9.04375 -------------------------------------------------------------------------------- /benchmark-logs/Llama-2-7B-Base.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.4343|± |0.0145| 6 | | | |acc_norm|0.4642|± |0.0146| 7 | |arc_easy | 0|acc |0.7626|± |0.0087| 8 | | | |acc_norm|0.7454|± |0.0089| 9 | |boolq | 1|acc |0.7771|± |0.0073| 10 | |hellaswag | 0|acc |0.5716|± |0.0049| 11 | | | |acc_norm|0.7601|± |0.0043| 12 | |openbookqa | 0|acc |0.3140|± |0.0208| 13 | | | |acc_norm|0.4420|± |0.0222| 14 | |piqa | 0|acc |0.7807|± |0.0097| 15 | | | |acc_norm|0.7905|± |0.0095| 16 | |winogrande | 0|acc |0.6914|± |0.0130| 17 | ``` 18 | Average: 66.72 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2598|± |0.0276| 25 | | | |acc_norm|0.2756|± |0.0281| 26 | |agieval_logiqa_en | 0|acc |0.2473|± |0.0169| 27 | | | |acc_norm|0.2965|± |0.0179| 28 | |agieval_lsat_ar | 0|acc |0.2391|± |0.0282| 29 | | | |acc_norm|0.2000|± |0.0264| 30 | |agieval_lsat_lr | 0|acc |0.2431|± |0.0190| 31 | | | |acc_norm|0.2255|± |0.0185| 32 | |agieval_lsat_rc | 0|acc |0.2528|± |0.0265| 33 | | | |acc_norm|0.2268|± |0.0256| 34 | |agieval_sat_en | 0|acc |0.3495|± |0.0333| 35 | | | |acc_norm|0.2524|± |0.0303| 36 | |agieval_sat_en_without_passage| 0|acc |0.3350|± |0.0330| 37 | | | |acc_norm|0.2136|± |0.0286| 38 | |agieval_sat_math | 0|acc |0.2455|± |0.0291| 39 | | | |acc_norm|0.2182|± |0.0279| 40 | ``` 41 | Average: 26.90 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.4947|± |0.0364| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6314|± |0.0251| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3101|± |0.0289| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.0919|± |0.0153| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2180|± |0.0185| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1486|± |0.0135| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3633|± |0.0278| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2400|± |0.0191| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5110|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.3395|± |0.0106| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.2500|± |0.0205| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1904|± |0.0124| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5304|± |0.0372| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5112|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2550|± |0.0138| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2000|± |0.0113| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1554|± |0.0087| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3633|± |0.0278| 66 | ``` 67 | Average: 32.24 68 | 69 | TruthfulQA: 70 | ``` 71 | 72 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Llama-3.1-8B-Instruct.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5179|± |0.0146| 6 | | | |acc_norm|0.5512|± |0.0145| 7 | |arc_easy | 0|acc |0.8178|± |0.0079| 8 | | | |acc_norm|0.7971|± |0.0083| 9 | |boolq | 1|acc |0.8407|± |0.0064| 10 | |hellaswag | 0|acc |0.5908|± |0.0049| 11 | | | |acc_norm|0.7924|± |0.0040| 12 | |openbookqa | 0|acc |0.3340|± |0.0211| 13 | | | |acc_norm|0.4320|± |0.0222| 14 | |piqa | 0|acc |0.8003|± |0.0093| 15 | | | |acc_norm|0.8101|± |0.0092| 16 | |winogrande | 0|acc |0.7395|± |0.0123| 17 | ``` 18 | Average: 70.90 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2953|± |0.0287| 25 | | | |acc_norm|0.2598|± |0.0276| 26 | |agieval_logiqa_en | 0|acc |0.3748|± |0.0190| 27 | | | |acc_norm|0.3825|± |0.0191| 28 | |agieval_lsat_ar | 0|acc |0.2087|± |0.0269| 29 | | | |acc_norm|0.2000|± |0.0264| 30 | |agieval_lsat_lr | 0|acc |0.4588|± |0.0221| 31 | | | |acc_norm|0.4176|± |0.0219| 32 | |agieval_lsat_rc | 0|acc |0.6283|± |0.0295| 33 | | | |acc_norm|0.5613|± |0.0303| 34 | |agieval_sat_en | 0|acc |0.7816|± |0.0289| 35 | | | |acc_norm|0.7039|± |0.0319| 36 | |agieval_sat_en_without_passage| 0|acc |0.3883|± |0.0340| 37 | | | |acc_norm|0.3641|± |0.0336| 38 | |agieval_sat_math | 0|acc |0.4182|± |0.0333| 39 | | | |acc_norm|0.3500|± |0.0322| 40 | ``` 41 | Average: 40.49 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5053|± |0.0364| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7154|± |0.0235| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.2984|± |0.0285| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.3287|± |0.0248| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2980|± |0.0205| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2286|± |0.0159| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5700|± |0.0286| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3260|± |0.0210| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5760|± |0.0156| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6325|± |0.0108| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4754|± |0.0236| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.3216|± |0.0148| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6685|± |0.0351| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5030|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.6000|± |0.0155| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2168|± |0.0117| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1697|± |0.0090| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5700|± |0.0286| 66 | ``` 67 | Average: 44.46 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3672|± |0.0169| 74 | | | |mc2 |0.5399|± |0.0150| 75 | ``` 76 | -------------------------------------------------------------------------------- /benchmark-logs/Meta-Llama-3-70B-Instruct.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.6177|_ |0.0142| 6 | | | |acc_norm|0.6459|_ |0.0140| 7 | |arc_easy | 0|acc |0.8615|_ |0.0071| 8 | | | |acc_norm|0.8497|_ |0.0073| 9 | |boolq | 1|acc |0.8752|_ |0.0058| 10 | |hellaswag | 0|acc |0.6372|_ |0.0048| 11 | | | |acc_norm|0.8253|_ |0.0038| 12 | |openbookqa | 0|acc |0.3340|_ |0.0211| 13 | | | |acc_norm|0.4300|_ |0.0222| 14 | |piqa | 0|acc |0.8166|_ |0.0090| 15 | | | |acc_norm|0.8221|_ |0.0089| 16 | |winogrande | 0|acc |0.7601|_ |0.0120| 17 | 18 | ``` 19 | Average: 74.40 20 | 21 | AGIEval: 22 | ``` 23 | | Task |Version| Metric |Value | |Stderr| 24 | |------------------------------|------:|--------|-----:|---|-----:| 25 | |agieval_aqua_rat | 0|acc |0.4488|_ |0.0313| 26 | | | |acc_norm|0.4370|_ |0.0312| 27 | |agieval_logiqa_en | 0|acc |0.5346|_ |0.0196| 28 | | | |acc_norm|0.5300|_ |0.0196| 29 | |agieval_lsat_ar | 0|acc |0.3000|_ |0.0303| 30 | | | |acc_norm|0.2870|_ |0.0299| 31 | |agieval_lsat_lr | 0|acc |0.7412|_ |0.0194| 32 | | | |acc_norm|0.7196|_ |0.0199| 33 | |agieval_lsat_rc | 0|acc |0.8401|_ |0.0224| 34 | | | |acc_norm|0.8253|_ |0.0232| 35 | |agieval_sat_en | 0|acc |0.9126|_ |0.0197| 36 | | | |acc_norm|0.9078|_ |0.0202| 37 | |agieval_sat_en_without_passage| 0|acc |0.5825|_ |0.0344| 38 | | | |acc_norm|0.5534|_ |0.0347| 39 | |agieval_sat_math | 0|acc |0.6591|_ |0.0320| 40 | | | |acc_norm|0.6227|_ |0.0328| 41 | ``` 42 | Average: 61.04 43 | 44 | BigBench: 45 | ``` 46 | | Task |Version| Metric |Value | |Stderr| 47 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 48 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.6421|_ |0.0349| 49 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7778|_ |0.0217| 50 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3101|_ |0.0289| 51 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.3955|_ |0.0258| 52 | | | |exact_str_match |0.0000|_ |0.0000| 53 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.4820|_ |0.0224| 54 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.3671|_ |0.0182| 55 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.6800|_ |0.0270| 56 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3460|_ |0.0213| 57 | |bigbench_navigate | 0|multiple_choice_grade|0.6220|_ |0.0153| 58 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.8335|_ |0.0083| 59 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4040|_ |0.0232| 60 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.4659|_ |0.0158| 61 | |bigbench_snarks | 0|multiple_choice_grade|0.5525|_ |0.0371| 62 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5213|_ |0.0159| 63 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.9950|_ |0.0022| 64 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2344|_ |0.0120| 65 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1571|_ |0.0087| 66 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.6800|_ |0.0270| 67 | ``` 68 | Average: 52.59 69 | 70 | TruthfulQA: 71 | ``` 72 | | Task |Version|Metric|Value | |Stderr| 73 | |-------------|------:|------|-----:|---|-----:| 74 | |truthfulqa_mc| 1|mc1 |0.4394|_ |0.0174| 75 | | | |mc2 |0.6184|_ |0.0154| 76 | ``` 77 | -------------------------------------------------------------------------------- /benchmark-logs/Mistral-7B-Base.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Tasks |Version|Filter| Metric |Value | |Stderr| 4 | |--------------|-------|------|----------|-----:|---|------| 5 | |arc_challenge |Yaml |none |acc |0.5026| | | 6 | | | |none |acc_norm |0.5384| | | 7 | |arc_easy |Yaml |none |acc |0.8085| | | 8 | | | |none |acc_norm |0.7946| | | 9 | |boolq |Yaml |none |acc |0.8358| | | 10 | |hellaswag |Yaml |none |acc |0.6123| | | 11 | | | |none |acc_norm |0.8102| | | 12 | |openbookqa |Yaml |none |acc |0.3220| | | 13 | | | |none |acc_norm |0.4420| | | 14 | |piqa |Yaml |none |acc |0.8074| | | 15 | | | |none |acc_norm |0.8210| | | 16 | |winogrande |Yaml |none |acc |0.7395| | | 17 | ``` 18 | Average: 71.16 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2244|± |0.0262| 25 | | | |acc_norm|0.2520|± |0.0273| 26 | |agieval_logiqa_en | 0|acc |0.2657|± |0.0173| 27 | | | |acc_norm|0.3164|± |0.0182| 28 | |agieval_lsat_ar | 0|acc |0.2304|± |0.0278| 29 | | | |acc_norm|0.2217|± |0.0275| 30 | |agieval_lsat_lr | 0|acc |0.3373|± |0.0210| 31 | | | |acc_norm|0.2922|± |0.0202| 32 | |agieval_lsat_rc | 0|acc |0.4238|± |0.0302| 33 | | | |acc_norm|0.3197|± |0.0285| 34 | |agieval_sat_en | 0|acc |0.5243|± |0.0349| 35 | | | |acc_norm|0.4854|± |0.0349| 36 | |agieval_sat_en_without_passage| 0|acc |0.3981|± |0.0342| 37 | | | |acc_norm|0.3107|± |0.0323| 38 | |agieval_sat_math | 0|acc |0.3000|± |0.0310| 39 | | | |acc_norm|0.2545|± |0.0294| 40 | ``` 41 | Average: 30.65 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5263|± |0.0363| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6640|± |0.0246| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3566|± |0.0299| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2400|± |0.0191| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1729|± |0.0143| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3633|± |0.0278| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3280|± |0.0210| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.4510|± |0.0111| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3281|± |0.0222| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2054|± |0.0128| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5414|± |0.0371| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2720|± |0.0141| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2280|± |0.0119| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1486|± |0.0085| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3633|± |0.0278| 66 | ``` 67 | Average: 34.92 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.2815|± |0.0157| 74 | | | |mc2 |0.4263|± |0.0142| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Mistral-Instruct-v0.1-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5017|± |0.0146| 6 | | | |acc_norm|0.5213|± |0.0146| 7 | |arc_easy | 0|acc |0.8005|± |0.0082| 8 | | | |acc_norm|0.7681|± |0.0087| 9 | |boolq | 1|acc |0.8028|± |0.0070| 10 | |hellaswag | 0|acc |0.5629|± |0.0050| 11 | | | |acc_norm|0.7464|± |0.0043| 12 | |openbookqa | 0|acc |0.3260|± |0.0210| 13 | | | |acc_norm|0.4320|± |0.0222| 14 | |piqa | 0|acc |0.7938|± |0.0094| 15 | | | |acc_norm|0.7905|± |0.0095| 16 | |winogrande | 0|acc |0.6953|± |0.0129| 17 | ``` 18 | Average: 67.95 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.1969|± |0.0250| 25 | | | |acc_norm|0.1969|± |0.0250| 26 | |agieval_logiqa_en | 0|acc |0.3118|± |0.0182| 27 | | | |acc_norm|0.3364|± |0.0185| 28 | |agieval_lsat_ar | 0|acc |0.2217|± |0.0275| 29 | | | |acc_norm|0.2304|± |0.0278| 30 | |agieval_lsat_lr | 0|acc |0.3686|± |0.0214| 31 | | | |acc_norm|0.3706|± |0.0214| 32 | |agieval_lsat_rc | 0|acc |0.4758|± |0.0305| 33 | | | |acc_norm|0.4238|± |0.0302| 34 | |agieval_sat_en | 0|acc |0.5583|± |0.0347| 35 | | | |acc_norm|0.5437|± |0.0348| 36 | |agieval_sat_en_without_passage| 0|acc |0.3544|± |0.0334| 37 | | | |acc_norm|0.3155|± |0.0325| 38 | |agieval_sat_math | 0|acc |0.2773|± |0.0302| 39 | | | |acc_norm|0.2591|± |0.0296| 40 | ``` 41 | Average: 33.46 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5263|± |0.0363| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6640|± |0.0246| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3566|± |0.0299| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2400|± |0.0191| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1714|± |0.0143| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3633|± |0.0278| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3260|± |0.0210| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.4520|± |0.0111| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3281|± |0.0222| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2044|± |0.0128| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5414|± |0.0371| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2720|± |0.0141| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2280|± |0.0119| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1514|± |0.0086| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3633|± |0.0278| 66 | ``` 67 | Average: 34.92 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3917|± |0.0171| 74 | | | |mc2 |0.5592|± |0.0153| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Mixtral-7x8-Base.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5674|_ |0.0145| 6 | | | |acc_norm|0.5964|_ |0.0143| 7 | |arc_easy | 0|acc |0.8418|_ |0.0075| 8 | | | |acc_norm|0.8346|_ |0.0076| 9 | |boolq | 1|acc |0.8508|_ |0.0062| 10 | |hellaswag | 0|acc |0.6489|_ |0.0048| 11 | | | |acc_norm|0.8401|_ |0.0037| 12 | |openbookqa | 0|acc |0.3520|_ |0.0214| 13 | | | |acc_norm|0.4700|_ |0.0223| 14 | |piqa | 0|acc |0.8232|_ |0.0089| 15 | | | |acc_norm|0.8373|_ |0.0086| 16 | |winogrande | 0|acc |0.7632|_ |0.0119| 17 | ``` 18 | Average: 74.18 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.1969|_ |0.0250| 25 | | | |acc_norm|0.1969|_ |0.0250| 26 | |agieval_logiqa_en | 0|acc |0.3472|_ |0.0187| 27 | | | |acc_norm|0.3610|_ |0.0188| 28 | |agieval_lsat_ar | 0|acc |0.2261|_ |0.0276| 29 | | | |acc_norm|0.1957|_ |0.0262| 30 | |agieval_lsat_lr | 0|acc |0.3392|_ |0.0210| 31 | | | |acc_norm|0.3294|_ |0.0208| 32 | |agieval_lsat_rc | 0|acc |0.4907|_ |0.0305| 33 | | | |acc_norm|0.4201|_ |0.0301| 34 | |agieval_sat_en | 0|acc |0.7087|_ |0.0317| 35 | | | |acc_norm|0.6553|_ |0.0332| 36 | |agieval_sat_en_without_passage| 0|acc |0.4903|_ |0.0349| 37 | | | |acc_norm|0.4029|_ |0.0343| 38 | |agieval_sat_math | 0|acc |0.4455|_ |0.0336| 39 | | | |acc_norm|0.3682|_ |0.0326| 40 | ``` 41 | Average: 41.85 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5895|_ |0.0358| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7344|_ |0.0230| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3411|_ |0.0296| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2061|_ |0.0214| 51 | | | |exact_str_match |0.0000|_ |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2740|_ |0.0200| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2029|_ |0.0152| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4167|_ |0.0285| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3840|_ |0.0218| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|_ |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5430|_ |0.0111| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.5536|_ |0.0235| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2495|_ |0.0137| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5414|_ |0.0371| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|_ |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2820|_ |0.0142| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2184|_ |0.0117| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1503|_ |0.0085| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4167|_ |0.0285| 66 | ``` 67 | Average: 39.45 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3403|_ |0.0166| 74 | | | |mc2 |0.4851|_ |0.0145| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Mixtral-7x8-Instruct-v0.1.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.6220|_ |0.0142| 6 | | | |acc_norm|0.6613|_ |0.0138| 7 | |arc_easy | 0|acc |0.8699|_ |0.0069| 8 | | | |acc_norm|0.8514|_ |0.0073| 9 | |boolq | 1|acc |0.8859|_ |0.0056| 10 | |hellaswag | 0|acc |0.6752|_ |0.0047| 11 | | | |acc_norm|0.8600|_ |0.0035| 12 | |openbookqa | 0|acc |0.3640|_ |0.0215| 13 | | | |acc_norm|0.4740|_ |0.0224| 14 | |piqa | 0|acc |0.8368|_ |0.0086| 15 | | | |acc_norm|0.8471|_ |0.0084| 16 | |winogrande | 0|acc |0.7687|_ |0.0119| 17 | ``` 18 | Average: 76.41 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2717|_ |0.0280| 25 | | | |acc_norm|0.2559|_ |0.0274| 26 | |agieval_logiqa_en | 0|acc |0.4117|_ |0.0193| 27 | | | |acc_norm|0.4071|_ |0.0193| 28 | |agieval_lsat_ar | 0|acc |0.1565|_ |0.0240| 29 | | | |acc_norm|0.1609|_ |0.0243| 30 | |agieval_lsat_lr | 0|acc |0.5137|_ |0.0222| 31 | | | |acc_norm|0.5039|_ |0.0222| 32 | |agieval_lsat_rc | 0|acc |0.6357|_ |0.0294| 33 | | | |acc_norm|0.6171|_ |0.0297| 34 | |agieval_sat_en | 0|acc |0.7864|_ |0.0286| 35 | | | |acc_norm|0.7670|_ |0.0295| 36 | |agieval_sat_en_without_passage| 0|acc |0.5243|_ |0.0349| 37 | | | |acc_norm|0.5049|_ |0.0349| 38 | |agieval_sat_math | 0|acc |0.4545|_ |0.0336| 39 | | | |acc_norm|0.4091|_ |0.0332| 40 | ``` 41 | Average: 45.32 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5579|_ |0.0361| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7263|_ |0.0232| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.4264|_ |0.0308| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.3760|_ |0.0256| 51 | | | |exact_str_match |0.1253|_ |0.0175| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3040|_ |0.0206| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2343|_ |0.0160| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4967|_ |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3560|_ |0.0214| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|_ |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6980|_ |0.0103| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.5759|_ |0.0234| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2886|_ |0.0143| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6796|_ |0.0348| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6481|_ |0.0152| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.6960|_ |0.0146| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2360|_ |0.0120| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1617|_ |0.0088| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4967|_ |0.0289| 66 | ``` 67 | Average: 46.99 68 | 69 | TruthfulQA: 70 | ``` 71 | hf-causal-experimental (pretrained=mistralAI/Mixtral-8x7B-Instruct-v0.1,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 480 72 | | Task |Version|Metric|Value | |Stderr| 73 | |-------------|------:|------|-----:|---|-----:| 74 | |truthfulqa_mc| 1|mc1 |0.5018|_ |0.0175| 75 | | | |mc2 |0.6457|_ |0.0155| 76 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Neural-Hermes-2.5-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5734|± |0.0145| 6 | | | |acc_norm|0.6067|± |0.0143| 7 | |arc_easy | 0|acc |0.8384|± |0.0076| 8 | | | |acc_norm|0.8131|± |0.0080| 9 | |boolq | 1|acc |0.8642|± |0.0060| 10 | |hellaswag | 0|acc |0.6379|± |0.0048| 11 | | | |acc_norm|0.8223|± |0.0038| 12 | |openbookqa | 0|acc |0.3400|± |0.0212| 13 | | | |acc_norm|0.4480|± |0.0223| 14 | |piqa | 0|acc |0.8161|± |0.0090| 15 | | | |acc_norm|0.8275|± |0.0088| 16 | |winogrande | 0|acc |0.7459|± |0.0122| 17 | ``` 18 | Average: 73.25 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2244|± |0.0262| 25 | | | |acc_norm|0.2283|± |0.0264| 26 | |agieval_logiqa_en | 0|acc |0.4071|± |0.0193| 27 | | | |acc_norm|0.4132|± |0.0193| 28 | |agieval_lsat_ar | 0|acc |0.2478|± |0.0285| 29 | | | |acc_norm|0.2391|± |0.0282| 30 | |agieval_lsat_lr | 0|acc |0.4922|± |0.0222| 31 | | | |acc_norm|0.5000|± |0.0222| 32 | |agieval_lsat_rc | 0|acc |0.6022|± |0.0299| 33 | | | |acc_norm|0.5911|± |0.0300| 34 | |agieval_sat_en | 0|acc |0.7427|± |0.0305| 35 | | | |acc_norm|0.7379|± |0.0307| 36 | |agieval_sat_en_without_passage| 0|acc |0.4563|± |0.0348| 37 | | | |acc_norm|0.4466|± |0.0347| 38 | |agieval_sat_math | 0|acc |0.3500|± |0.0322| 39 | | | |acc_norm|0.3455|± |0.0321| 40 | ``` 41 | Average: 43.77 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5368|± |0.0363| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6748|± |0.0244| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3605|± |0.0300| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2507|± |0.0229| 51 | | | |exact_str_match |0.1727|± |0.0200| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2880|± |0.0203| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2000|± |0.0151| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4667|± |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3680|± |0.0216| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.4990|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6600|± |0.0106| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4442|± |0.0235| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2485|± |0.0137| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7017|± |0.0341| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6673|± |0.0150| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3030|± |0.0145| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2112|± |0.0115| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1703|± |0.0090| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4667|± |0.0289| 66 | ``` 67 | Average: 41.76 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3745|± |0.0169| 74 | | | |mc2 |0.5537|± |0.0154| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Nous-Hermes-1-Llama-2-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.4735|± |0.0146| 6 | | | |acc_norm|0.5017|± |0.0146| 7 | |arc_easy | 0|acc |0.7946|± |0.0083| 8 | | | |acc_norm|0.7605|± |0.0088| 9 | |boolq | 1|acc |0.8000|± |0.0070| 10 | |hellaswag | 0|acc |0.5924|± |0.0049| 11 | | | |acc_norm|0.7774|± |0.0042| 12 | |openbookqa | 0|acc |0.3600|± |0.0215| 13 | | | |acc_norm|0.4660|± |0.0223| 14 | |piqa | 0|acc |0.7889|± |0.0095| 15 | | | |acc_norm|0.7976|± |0.0094| 16 | |winogrande | 0|acc |0.6993|± |0.0129| 17 | ``` 18 | Average: 68.60 19 | 20 | AGIEval: 21 | ``` 22 | 23 | ``` 24 | Average: 25 | 26 | BigBench: 27 | ``` 28 | | Task |Version| Metric |Value | |Stderr| 29 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 30 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5579|± |0.0361| 31 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6233|± |0.0253| 32 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3062|± |0.0288| 33 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2006|± |0.0212| 34 | | | |exact_str_match |0.0000|± |0.0000| 35 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2540|± |0.0195| 36 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1657|± |0.0141| 37 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4067|± |0.0284| 38 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2780|± |0.0201| 39 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 40 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.4405|± |0.0111| 41 | |bigbench_ruin_names | 0|multiple_choice_grade|0.2701|± |0.0210| 42 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2034|± |0.0127| 43 | |bigbench_snarks | 0|multiple_choice_grade|0.5028|± |0.0373| 44 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6136|± |0.0155| 45 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2720|± |0.0141| 46 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.1944|± |0.0112| 47 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1497|± |0.0085| 48 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4067|± |0.0284| 49 | ``` 50 | Average: 35.25 51 | 52 | TruthfulQA: 53 | ``` 54 | | Task |Version|Metric|Value | |Stderr| 55 | |-------------|------:|------|-----:|---|-----:| 56 | |truthfulqa_mc| 1|mc1 |0.3341|± |0.0165| 57 | | | |mc2 |0.4910|± |0.0151| 58 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Nous-Hermes-2-Mistral-7B-DPO.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5776|± |0.0144| 6 | | | |acc_norm|0.6220|± |0.0142| 7 | |arc_easy | 0|acc |0.8380|± |0.0076| 8 | | | |acc_norm|0.8245|± |0.0078| 9 | |boolq | 1|acc |0.8624|± |0.0060| 10 | |hellaswag | 0|acc |0.6418|± |0.0048| 11 | | | |acc_norm|0.8249|± |0.0038| 12 | |openbookqa | 0|acc |0.3420|± |0.0212| 13 | | | |acc_norm|0.4540|± |0.0223| 14 | |piqa | 0|acc |0.8177|± |0.0090| 15 | | | |acc_norm|0.8264|± |0.0088| 16 | |winogrande | 0|acc |0.7466|± |0.0122| 17 | ``` 18 | Average: 73.72 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2047|± |0.0254| 25 | | | |acc_norm|0.2283|± |0.0264| 26 | |agieval_logiqa_en | 0|acc |0.3779|± |0.0190| 27 | | | |acc_norm|0.3932|± |0.0192| 28 | |agieval_lsat_ar | 0|acc |0.2652|± |0.0292| 29 | | | |acc_norm|0.2522|± |0.0287| 30 | |agieval_lsat_lr | 0|acc |0.5216|± |0.0221| 31 | | | |acc_norm|0.5137|± |0.0222| 32 | |agieval_lsat_rc | 0|acc |0.5911|± |0.0300| 33 | | | |acc_norm|0.5836|± |0.0301| 34 | |agieval_sat_en | 0|acc |0.7427|± |0.0305| 35 | | | |acc_norm|0.7184|± |0.0314| 36 | |agieval_sat_en_without_passage| 0|acc |0.4612|± |0.0348| 37 | | | |acc_norm|0.4466|± |0.0347| 38 | |agieval_sat_math | 0|acc |0.3818|± |0.0328| 39 | | | |acc_norm|0.3545|± |0.0323| 40 | ``` 41 | Average: 43.63 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5579|± |0.0361| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6694|± |0.0245| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3333|± |0.0294| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2061|± |0.0214| 51 | | | |exact_str_match |0.2256|± |0.0221| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3120|± |0.0207| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2114|± |0.0154| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4900|± |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3600|± |0.0215| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6660|± |0.0105| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4420|± |0.0235| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2766|± |0.0142| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6630|± |0.0352| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6653|± |0.0150| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3190|± |0.0147| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2128|± |0.0116| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1737|± |0.0091| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4900|± |0.0289| 66 | ``` 67 | Average: 41.94 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3892|± |0.0171| 74 | | | |mc2 |0.5642|± |0.0153| 75 | ``` 76 | -------------------------------------------------------------------------------- /benchmark-logs/Nous-Hermes-2-Mixtral-8x7B-DPO.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5990|± |0.0143| 6 | | | |acc_norm|0.6425|± |0.0140| 7 | |arc_easy | 0|acc |0.8657|± |0.0070| 8 | | | |acc_norm|0.8636|± |0.0070| 9 | |boolq | 1|acc |0.8783|± |0.0057| 10 | |hellaswag | 0|acc |0.6661|± |0.0047| 11 | | | |acc_norm|0.8489|± |0.0036| 12 | |openbookqa | 0|acc |0.3440|± |0.0213| 13 | | | |acc_norm|0.4660|± |0.0223| 14 | |piqa | 0|acc |0.8324|± |0.0087| 15 | | | |acc_norm|0.8379|± |0.0086| 16 | |winogrande | 0|acc |0.7616|± |0.0120| 17 | ``` 18 | Average: 75.70 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2402|± |0.0269| 25 | | | |acc_norm|0.2520|± |0.0273| 26 | |agieval_logiqa_en | 0|acc |0.4117|± |0.0193| 27 | | | |acc_norm|0.4055|± |0.0193| 28 | |agieval_lsat_ar | 0|acc |0.2348|± |0.0280| 29 | | | |acc_norm|0.2087|± |0.0269| 30 | |agieval_lsat_lr | 0|acc |0.5549|± |0.0220| 31 | | | |acc_norm|0.5294|± |0.0221| 32 | |agieval_lsat_rc | 0|acc |0.6617|± |0.0289| 33 | | | |acc_norm|0.6357|± |0.0294| 34 | |agieval_sat_en | 0|acc |0.8010|± |0.0279| 35 | | | |acc_norm|0.7913|± |0.0284| 36 | |agieval_sat_en_without_passage| 0|acc |0.4806|± |0.0349| 37 | | | |acc_norm|0.4612|± |0.0348| 38 | |agieval_sat_math | 0|acc |0.4909|± |0.0338| 39 | | | |acc_norm|0.4000|± |0.0331| 40 | ``` 41 | Average: 46.05 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.6105|± |0.0355| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7182|± |0.0235| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.5736|± |0.0308| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.4596|± |0.0263| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3500|± |0.0214| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2500|± |0.0164| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5200|± |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3540|± |0.0214| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6900|± |0.0103| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.6317|± |0.0228| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2535|± |0.0138| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7293|± |0.0331| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6744|± |0.0149| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.7400|± |0.0139| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2176|± |0.0117| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1543|± |0.0086| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5200|± |0.0289| 66 | ``` 67 | Average: 49.70 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.4162|± |0.0173| 74 | | | |mc2 |0.5783|± |0.0151| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Nous-Hermes-2-Mixtral-8x7B-SFT.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5904|± |0.0144| 6 | | | |acc_norm|0.6323|± |0.0141| 7 | |arc_easy | 0|acc |0.8594|± |0.0071| 8 | | | |acc_norm|0.8607|± |0.0071| 9 | |boolq | 1|acc |0.8783|± |0.0057| 10 | |hellaswag | 0|acc |0.6592|± |0.0047| 11 | | | |acc_norm|0.8434|± |0.0036| 12 | |openbookqa | 0|acc |0.3400|± |0.0212| 13 | | | |acc_norm|0.4660|± |0.0223| 14 | |piqa | 0|acc |0.8324|± |0.0087| 15 | | | |acc_norm|0.8379|± |0.0086| 16 | |winogrande | 0|acc |0.7569|± |0.0121| 17 | ``` 18 | Average: 75.36 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2441|± |0.0270| 25 | | | |acc_norm|0.2598|± |0.0276| 26 | |agieval_logiqa_en | 0|acc |0.4025|± |0.0192| 27 | | | |acc_norm|0.3978|± |0.0192| 28 | |agieval_lsat_ar | 0|acc |0.2391|± |0.0282| 29 | | | |acc_norm|0.2043|± |0.0266| 30 | |agieval_lsat_lr | 0|acc |0.5353|± |0.0221| 31 | | | |acc_norm|0.5098|± |0.0222| 32 | |agieval_lsat_rc | 0|acc |0.6617|± |0.0289| 33 | | | |acc_norm|0.5948|± |0.0300| 34 | |agieval_sat_en | 0|acc |0.7961|± |0.0281| 35 | | | |acc_norm|0.7816|± |0.0289| 36 | |agieval_sat_en_without_passage| 0|acc |0.4757|± |0.0349| 37 | | | |acc_norm|0.4515|± |0.0348| 38 | |agieval_sat_math | 0|acc |0.4818|± |0.0338| 39 | | | |acc_norm|0.3909|± |0.0330| 40 | ``` 41 | Average: 44.89 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5789|± |0.0359| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7154|± |0.0235| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.5388|± |0.0311| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.4680|± |0.0264| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3260|± |0.0210| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2443|± |0.0163| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5233|± |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3700|± |0.0216| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6665|± |0.0105| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.6317|± |0.0228| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2505|± |0.0137| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7127|± |0.0337| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6592|± |0.0151| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.6860|± |0.0147| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2200|± |0.0117| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1503|± |0.0085| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5233|± |0.0289| 66 | ``` 67 | Average: 48.69 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3856|± |0.0170| 74 | | | |mc2 |0.5388|± |0.0149| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Nous-Hermes-2-SOLAR-10.7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5768|_ |0.0144| 6 | | | |acc_norm|0.6067|_ |0.0143| 7 | |arc_easy | 0|acc |0.8375|_ |0.0076| 8 | | | |acc_norm|0.8316|_ |0.0077| 9 | |boolq | 1|acc |0.8875|_ |0.0055| 10 | |hellaswag | 0|acc |0.6467|_ |0.0048| 11 | | | |acc_norm|0.8321|_ |0.0037| 12 | |openbookqa | 0|acc |0.3420|_ |0.0212| 13 | | | |acc_norm|0.4580|_ |0.0223| 14 | |piqa | 0|acc |0.8161|_ |0.0090| 15 | | | |acc_norm|0.8313|_ |0.0087| 16 | |winogrande | 0|acc |0.7814|_ |0.0116| 17 | ``` 18 | Average: 74.69 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.3110|_ |0.0291| 25 | | | |acc_norm|0.3189|_ |0.0293| 26 | |agieval_logiqa_en | 0|acc |0.4455|_ |0.0195| 27 | | | |acc_norm|0.4439|_ |0.0195| 28 | |agieval_lsat_ar | 0|acc |0.2391|_ |0.0282| 29 | | | |acc_norm|0.2261|_ |0.0276| 30 | |agieval_lsat_lr | 0|acc |0.5765|_ |0.0219| 31 | | | |acc_norm|0.5569|_ |0.0220| 32 | |agieval_lsat_rc | 0|acc |0.6914|_ |0.0282| 33 | | | |acc_norm|0.6803|_ |0.0285| 34 | |agieval_sat_en | 0|acc |0.8010|_ |0.0279| 35 | | | |acc_norm|0.7864|_ |0.0286| 36 | |agieval_sat_en_without_passage| 0|acc |0.4660|_ |0.0348| 37 | | | |acc_norm|0.4515|_ |0.0348| 38 | |agieval_sat_math | 0|acc |0.4364|_ |0.0335| 39 | | | |acc_norm|0.3591|_ |0.0324|``` 40 | Average: 47.79 41 | 42 | BigBench: 43 | ``` 44 | | Task |Version| Metric |Value | |Stderr| 45 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 46 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.6263|_ |0.0352| 47 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7019|_ |0.0238| 48 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3333|_ |0.0294| 49 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.3008|_ |0.0242| 50 | | | |exact_str_match |0.0000|_ |0.0000| 51 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3360|_ |0.0211| 52 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2514|_ |0.0164| 53 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5000|_ |0.0289| 54 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3760|_ |0.0217| 55 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|_ |0.0158| 56 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5920|_ |0.0110| 57 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4442|_ |0.0235| 58 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.3557|_ |0.0152| 59 | |bigbench_snarks | 0|multiple_choice_grade|0.7182|_ |0.0335| 60 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6734|_ |0.0149| 61 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.4630|_ |0.0158| 62 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2328|_ |0.0120| 63 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1657|_ |0.0089| 64 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5000|_ |0.0289| 65 | ``` 66 | Average: 44.84 67 | 68 | TruthfulQA: 69 | ``` 70 | | Task |Version|Metric|Value | |Stderr| 71 | |-------------|------:|------|-----:|---|-----:| 72 | |truthfulqa_mc| 1|mc1 |0.3917|_ |0.0171| 73 | | | |mc2 |0.5592|_ |0.0154| 74 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Nous-Hermes-2-Yi-34B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.6067|_ |0.0143| 6 | | | |acc_norm|0.6416|_ |0.0140| 7 | |arc_easy | 0|acc |0.8594|_ |0.0071| 8 | | | |acc_norm|0.8569|_ |0.0072| 9 | |boolq | 1|acc |0.8859|_ |0.0056| 10 | |hellaswag | 0|acc |0.6407|_ |0.0048| 11 | | | |acc_norm|0.8388|_ |0.0037| 12 | |openbookqa | 0|acc |0.3520|_ |0.0214| 13 | | | |acc_norm|0.4760|_ |0.0224| 14 | |piqa | 0|acc |0.8215|_ |0.0089| 15 | | | |acc_norm|0.8303|_ |0.0088| 16 | |winogrande | 0|acc |0.7908|_ |0.0114| 17 | ``` 18 | Average: 76.00% 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.3189|_ |0.0293| 25 | | | |acc_norm|0.2953|_ |0.0287| 26 | |agieval_logiqa_en | 0|acc |0.5438|_ |0.0195| 27 | | | |acc_norm|0.4977|_ |0.0196| 28 | |agieval_lsat_ar | 0|acc |0.2696|_ |0.0293| 29 | | | |acc_norm|0.2087|_ |0.0269| 30 | |agieval_lsat_lr | 0|acc |0.7078|_ |0.0202| 31 | | | |acc_norm|0.6255|_ |0.0215| 32 | |agieval_lsat_rc | 0|acc |0.7807|_ |0.0253| 33 | | | |acc_norm|0.7063|_ |0.0278| 34 | |agieval_sat_en | 0|acc |0.8689|_ |0.0236| 35 | | | |acc_norm|0.8447|_ |0.0253| 36 | |agieval_sat_en_without_passage| 0|acc |0.5194|_ |0.0349| 37 | | | |acc_norm|0.4612|_ |0.0348| 38 | |agieval_sat_math | 0|acc |0.4409|_ |0.0336| 39 | | | |acc_norm|0.3818|_ |0.0328| 40 | ``` 41 | Average: 50.27% 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5737|_ |0.0360| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7263|_ |0.0232| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3953|_ |0.0305| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.4457|_ |0.0263| 51 | | | |exact_str_match |0.0000|_ |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2820|_ |0.0201| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2186|_ |0.0156| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4733|_ |0.0289| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.5200|_ |0.0224| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.4910|_ |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.7495|_ |0.0097| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.5938|_ |0.0232| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.3808|_ |0.0154| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.8066|_ |0.0294| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5101|_ |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3850|_ |0.0154| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2160|_ |0.0116| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1634|_ |0.0088| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4733|_ |0.0289| 66 | ``` 67 | Average: 46.69% 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.4333|_ |0.0173| 74 | | | |mc2 |0.6034|_ |0.0149| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/OpenHermes-2-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5452|± |0.0146| 6 | | | |acc_norm|0.5700|± |0.0145| 7 | |arc_easy | 0|acc |0.8363|± |0.0076| 8 | | | |acc_norm|0.8114|± |0.0080| 9 | |boolq | 1|acc |0.8673|± |0.0059| 10 | |hellaswag | 0|acc |0.6200|± |0.0048| 11 | | | |acc_norm|0.8107|± |0.0039| 12 | |openbookqa | 0|acc |0.3500|± |0.0214| 13 | | | |acc_norm|0.4580|± |0.0223| 14 | |piqa | 0|acc |0.8085|± |0.0092| 15 | | | |acc_norm|0.8243|± |0.0089| 16 | |winogrande | 0|acc |0.7459|± |0.0122| 17 | ``` 18 | Average: 72.68 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2283|± |0.0264| 25 | | | |acc_norm|0.2283|± |0.0264| 26 | |agieval_logiqa_en | 0|acc |0.3594|± |0.0188| 27 | | | |acc_norm|0.3671|± |0.0189| 28 | |agieval_lsat_ar | 0|acc |0.2217|± |0.0275| 29 | | | |acc_norm|0.2217|± |0.0275| 30 | |agieval_lsat_lr | 0|acc |0.4471|± |0.0220| 31 | | | |acc_norm|0.4451|± |0.0220| 32 | |agieval_lsat_rc | 0|acc |0.5762|± |0.0302| 33 | | | |acc_norm|0.4907|± |0.0305| 34 | |agieval_sat_en | 0|acc |0.7039|± |0.0319| 35 | | | |acc_norm|0.6942|± |0.0322| 36 | |agieval_sat_en_without_passage| 0|acc |0.4369|± |0.0346| 37 | | | |acc_norm|0.3932|± |0.0341| 38 | |agieval_sat_math | 0|acc |0.3364|± |0.0319| 39 | | | |acc_norm|0.3091|± |0.0312| 40 | ``` 41 | Average: 39.37 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5789|± |0.0359| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6694|± |0.0245| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3953|± |0.0305| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.3621|± |0.0254| 51 | | | |exact_str_match |0.1309|± |0.0178| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2920|± |0.0204| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2043|± |0.0152| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4300|± |0.0286| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3140|± |0.0208| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5010|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6845|± |0.0104| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4241|± |0.0234| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1683|± |0.0118| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.7238|± |0.0333| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6623|± |0.0151| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.3770|± |0.0153| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2208|± |0.0117| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1514|± |0.0086| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4300|± |0.0286| 66 | ``` 67 | Average: 42.16 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3403|± |0.0166| 74 | | | |mc2 |0.5093|± |0.0151| 75 | ``` 76 | -------------------------------------------------------------------------------- /benchmark-logs/OpenHermes-2.5-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5623|± |0.0145| 6 | | | |acc_norm|0.6007|± |0.0143| 7 | |arc_easy | 0|acc |0.8346|± |0.0076| 8 | | | |acc_norm|0.8165|± |0.0079| 9 | |boolq | 1|acc |0.8657|± |0.0060| 10 | |hellaswag | 0|acc |0.6310|± |0.0048| 11 | | | |acc_norm|0.8173|± |0.0039| 12 | |openbookqa | 0|acc |0.3460|± |0.0213| 13 | | | |acc_norm|0.4480|± |0.0223| 14 | |piqa | 0|acc |0.8145|± |0.0091| 15 | | | |acc_norm|0.8270|± |0.0088| 16 | |winogrande | 0|acc |0.7435|± |0.0123| 17 | ``` 18 | Average: 73.12 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2323|± |0.0265| 25 | | | |acc_norm|0.2362|± |0.0267| 26 | |agieval_logiqa_en | 0|acc |0.3871|± |0.0191| 27 | | | |acc_norm|0.3948|± |0.0192| 28 | |agieval_lsat_ar | 0|acc |0.2522|± |0.0287| 29 | | | |acc_norm|0.2304|± |0.0278| 30 | |agieval_lsat_lr | 0|acc |0.5059|± |0.0222| 31 | | | |acc_norm|0.5157|± |0.0222| 32 | |agieval_lsat_rc | 0|acc |0.5911|± |0.0300| 33 | | | |acc_norm|0.5725|± |0.0302| 34 | |agieval_sat_en | 0|acc |0.7476|± |0.0303| 35 | | | |acc_norm|0.7330|± |0.0309| 36 | |agieval_sat_en_without_passage| 0|acc |0.4417|± |0.0347| 37 | | | |acc_norm|0.4126|± |0.0344| 38 | |agieval_sat_math | 0|acc |0.3773|± |0.0328| 39 | | | |acc_norm|0.3500|± |0.0322| 40 | ``` 41 | Average: 43.07 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5316|± |0.0363| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6667|± |0.0246| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3411|± |0.0296| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2145|± |0.0217| 51 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2860|± |0.0202| 52 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2086|± |0.0154| 53 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4800|± |0.0289| 54 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3620|± |0.0215| 55 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 56 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6630|± |0.0106| 57 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4241|± |0.0234| 58 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2285|± |0.0133| 59 | |bigbench_snarks | 0|multiple_choice_grade|0.6796|± |0.0348| 60 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.6491|± |0.0152| 61 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2800|± |0.0142| 62 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2072|± |0.0115| 63 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1691|± |0.0090| 64 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4800|± |0.0289| 65 | ``` 66 | Average: 40.96 67 | 68 | TruthfulQA: 69 | ``` 70 | | Task |Version|Metric|Value | |Stderr| 71 | |-------------|------:|------|-----:|---|-----:| 72 | |truthfulqa_mc| 1|mc1 |0.3599|± |0.0168| 73 | | | |mc2 |0.5304|± |0.0153| 74 | ``` -------------------------------------------------------------------------------- /benchmark-logs/OpenHermes-v1-Llama-2-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | 4 | ``` 5 | Average: 6 | 7 | AGIEval: 8 | ``` 9 | | Task |Version| Metric |Value | |Stderr| 10 | |------------------------------|------:|--------|-----:|---|-----:| 11 | |agieval_aqua_rat | 0|acc |0.2441|± |0.0270| 12 | | | |acc_norm|0.2402|± |0.0269| 13 | |agieval_logiqa_en | 0|acc |0.2458|± |0.0169| 14 | | | |acc_norm|0.2965|± |0.0179| 15 | |agieval_lsat_ar | 0|acc |0.2522|± |0.0287| 16 | | | |acc_norm|0.2130|± |0.0271| 17 | |agieval_lsat_lr | 0|acc |0.2745|± |0.0198| 18 | | | |acc_norm|0.2686|± |0.0196| 19 | |agieval_lsat_rc | 0|acc |0.2900|± |0.0277| 20 | | | |acc_norm|0.2379|± |0.0260| 21 | |agieval_sat_en | 0|acc |0.4466|± |0.0347| 22 | | | |acc_norm|0.3738|± |0.0338| 23 | |agieval_sat_en_without_passage| 0|acc |0.3738|± |0.0338| 24 | | | |acc_norm|0.3301|± |0.0328| 25 | |agieval_sat_math | 0|acc |0.2318|± |0.0285| 26 | | | |acc_norm|0.1864|± |0.0263| 27 | ``` 28 | Average: 26.83 29 | 30 | BigBench: 31 | ``` 32 | | Task |Version| Metric |Value | |Stderr| 33 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 34 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5000|± |0.0364| 35 | |bigbench_date_understanding | 0|multiple_choice_grade|0.5908|± |0.0256| 36 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3023|± |0.0286| 37 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159| 38 | | | |exact_str_match |0.0000|± |0.0000| 39 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2520|± |0.0194| 40 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1871|± |0.0148| 41 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3833|± |0.0281| 42 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2500|± |0.0194| 43 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 44 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.4370|± |0.0111| 45 | |bigbench_ruin_names | 0|multiple_choice_grade|0.2679|± |0.0209| 46 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2495|± |0.0137| 47 | |bigbench_snarks | 0|multiple_choice_grade|0.5249|± |0.0372| 48 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5406|± |0.0159| 49 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2470|± |0.0136| 50 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.1944|± |0.0112| 51 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1509|± |0.0086| 52 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3833|± |0.0281| 53 | ``` 54 | Average: 33.67 55 | 56 | TruthfulQA: 57 | ``` 58 | | Task |Version|Metric|Value | |Stderr| 59 | |-------------|------:|------|-----:|---|-----:| 60 | |truthfulqa_mc| 1|mc1 |0.2999|± |0.0160| 61 | | | |mc2 |0.4542|± |0.0148| 62 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Qwen-14b-base.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Tasks |Version|Filter| Metric |Value | |Stderr| 4 | |------------------------------------------------------------|-------|------|--------|-----:|---|------| 5 | |arc_challenge |Yaml |none |acc |0.4497| | | 6 | | | |none |acc_norm|0.4718| | | 7 | |arc_easy |Yaml |none |acc |0.7269| | | 8 | | | |none |acc_norm|0.7029| | | 9 | |boolq |Yaml |none |acc |0.8661| | | 10 | |hellaswag |Yaml |none |acc |0.6301| | | 11 | | | |none |acc_norm|0.8131| | | 12 | |openbookqa |Yaml |none |acc |0.3280| | | 13 | | | |none |acc_norm|0.4360| | | 14 | |piqa |Yaml |none |acc |0.7965| | | 15 | | | |none |acc_norm|0.8123| | | 16 | |winogrande |Yaml |none |acc |0.6740| | | 17 | ``` 18 | Average: 68.23 19 | 20 | AGIEval: 21 | ``` 22 | | Tasks |Version|Filter| Metric |Value | |Stderr| 23 | |------------------------------------------------------------|-------|------|--------|-----:|---|------| 24 | |agieval_aquarat |Yaml |none |acc |0.2638| | | 25 | | | |none |acc_norm|0.2677| | | 26 | |agieval_logiqa |Yaml |none |acc |0.4716| | | 27 | | | |none |acc_norm|0.4808| | | 28 | |agieval_lsatar |Yaml |none |acc |0.2652| | | 29 | | | |none |acc_norm|0.2130| | | 30 | |agieval_lsatlr |Yaml |none |acc |0.5255| | | 31 | | | |none |acc_norm|0.5039| | | 32 | |agieval_lsatrc |Yaml |none |acc |0.6506| | | 33 | | | |none |acc_norm|0.6283| | | 34 | |agieval_saten |Yaml |none |acc |0.7524| | | 35 | | | |none |acc_norm|0.6893| | | 36 | |agieval_saten_wop |Yaml |none |acc |0.4903| | | 37 | | | |none |acc_norm|0.4466| | | 38 | |agieval_satm |Yaml |none |acc |0.4409| | | 39 | | | |none |acc_norm|0.4227| | | 40 | ``` 41 | Average: 45.65 42 | 43 | TruthfulQA: 44 | ``` 45 | 46 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Qwen-72B-base.md: -------------------------------------------------------------------------------- 1 | TruthfulQA: 2 | ```hf-causal-experimental (pretrained=Qwen/Qwen-72B,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 20 3 | | Task |Version|Metric|Value | |Stderr| 4 | |-------------|------:|------|-----:|---|-----:| 5 | |truthfulqa_mc| 1|mc1 |0.4259|± |0.0173| 6 | | | |mc2 |0.6016|± |0.0146| 7 | ``` 8 | 9 | GPT4All: 10 | ```hf-causal-experimental (pretrained=Qwen/Qwen-72B,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 20 11 | | Task |Version| Metric |Value | |Stderr| 12 | |-------------|------:|--------|-----:|---|-----:| 13 | |arc_challenge| 0|acc |0.5162|± |0.0146| 14 | | | |acc_norm|0.5435|± |0.0146| 15 | |arc_easy | 0|acc |0.8081|± |0.0081| 16 | | | |acc_norm|0.7542|± |0.0088| 17 | |boolq | 1|acc |0.8804|± |0.0057| 18 | |hellaswag | 0|acc |0.6560|± |0.0047| 19 | | | |acc_norm|0.8490|± |0.0036| 20 | |openbookqa | 0|acc |0.3480|± |0.0213| 21 | | | |acc_norm|0.4720|± |0.0223| 22 | |piqa | 0|acc |0.8194|± |0.0090| 23 | | | |acc_norm|0.8259|± |0.0088| 24 | |winogrande | 0|acc |0.7774|± |0.0117| 25 | ``` 26 | Average: 72.89 27 | 28 | AGIEval: 29 | ```hf-causal-experimental (pretrained=Qwen/Qwen-72B,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 20 30 | | Task |Version| Metric |Value | |Stderr| 31 | |------------------------------|------:|--------|-----:|---|-----:| 32 | |agieval_aqua_rat | 0|acc |0.3346|± |0.0297| 33 | | | |acc_norm|0.3307|± |0.0296| 34 | |agieval_logiqa_en | 0|acc |0.3978|± |0.0192| 35 | | | |acc_norm|0.4055|± |0.0193| 36 | |agieval_lsat_ar | 0|acc |0.2391|± |0.0282| 37 | | | |acc_norm|0.2739|± |0.0295| 38 | |agieval_lsat_lr | 0|acc |0.4941|± |0.0222| 39 | | | |acc_norm|0.4647|± |0.0221| 40 | |agieval_lsat_rc | 0|acc |0.5019|± |0.0305| 41 | | | |acc_norm|0.4758|± |0.0305| 42 | |agieval_sat_en | 0|acc |0.7524|± |0.0301| 43 | | | |acc_norm|0.7184|± |0.0314| 44 | |agieval_sat_en_without_passage| 0|acc |0.5000|± |0.0349| 45 | | | |acc_norm|0.4320|± |0.0346| 46 | |agieval_sat_math | 0|acc |0.4364|± |0.0335| 47 | | | |acc_norm|0.3182|± |0.0315| 48 | ``` 49 | Average: 42.74 50 | 51 | NOTE: QWEN has issues with BigBench in eval harness, so it was not done -------------------------------------------------------------------------------- /benchmark-logs/SOLAR-10.7B-Instruct.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.6075|_ |0.0143| 6 | | | |acc_norm|0.6408|_ |0.0140| 7 | |arc_easy | 0|acc |0.8329|_ |0.0077| 8 | | | |acc_norm|0.8152|_ |0.0080| 9 | |boolq | 1|acc |0.8853|_ |0.0056| 10 | |hellaswag | 0|acc |0.6868|_ |0.0046| 11 | | | |acc_norm|0.8637|_ |0.0034| 12 | |openbookqa | 0|acc |0.3640|_ |0.0215| 13 | | | |acc_norm|0.4760|_ |0.0224| 14 | |piqa | 0|acc |0.8069|_ |0.0092| 15 | | | |acc_norm|0.8101|_ |0.0092| 16 | |winogrande | 0|acc |0.7664|_ |0.0119| 17 | 18 | ``` 19 | Average: 75.11 20 | 21 | AGIEval: 22 | ``` 23 | | Task |Version| Metric |Value | |Stderr| 24 | |------------------------------|------:|--------|-----:|---|-----:| 25 | |agieval_aqua_rat | 0|acc |0.2874|_ |0.0285| 26 | | | |acc_norm|0.2913|_ |0.0286| 27 | |agieval_logiqa_en | 0|acc |0.4270|_ |0.0194| 28 | | | |acc_norm|0.4270|_ |0.0194| 29 | |agieval_lsat_ar | 0|acc |0.2435|_ |0.0284| 30 | | | |acc_norm|0.2348|_ |0.0280| 31 | |agieval_lsat_lr | 0|acc |0.5333|_ |0.0221| 32 | | | |acc_norm|0.5353|_ |0.0221| 33 | |agieval_lsat_rc | 0|acc |0.6989|_ |0.0280| 34 | | | |acc_norm|0.6877|_ |0.0283| 35 | |agieval_sat_en | 0|acc |0.7961|_ |0.0281| 36 | | | |acc_norm|0.7961|_ |0.0281| 37 | |agieval_sat_en_without_passage| 0|acc |0.4612|_ |0.0348| 38 | | | |acc_norm|0.4515|_ |0.0348| 39 | |agieval_sat_math | 0|acc |0.3955|_ |0.0330| 40 | | | |acc_norm|0.3818|_ |0.0328| 41 | 42 | ``` 43 | Average: 47.57 44 | 45 | BigBench: 46 | ``` 47 | | Task |Version| Metric |Value | |Stderr| 48 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 49 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5895|_ |0.0358| 50 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6531|_ |0.0248| 51 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3798|_ |0.0303| 52 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.2535|_ |0.0230| 53 | | | |exact_str_match |0.1031|_ |0.0161| 54 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2820|_ |0.0201| 55 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2100|_ |0.0154| 56 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4733|_ |0.0289| 57 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.4160|_ |0.0221| 58 | |bigbench_navigate | 0|multiple_choice_grade|0.6480|_ |0.0151| 59 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5625|_ |0.0111| 60 | |bigbench_ruin_names | 0|multiple_choice_grade|0.4062|_ |0.0232| 61 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.4108|_ |0.0156| 62 | |bigbench_snarks | 0|multiple_choice_grade|0.6630|_ |0.0352| 63 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.7120|_ |0.0144| 64 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.6300|_ |0.0153| 65 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2424|_ |0.0121| 66 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1754|_ |0.0091| 67 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4733|_ |0.0289| 68 | 69 | ``` 70 | Average: 45.45 71 | 72 | TruthfulQA: 73 | ``` 74 | | Task |Version|Metric|Value | |Stderr| 75 | |-------------|------:|------|-----:|---|-----:| 76 | |truthfulqa_mc| 1|mc1 |0.5765|_ |0.0173| 77 | | | |mc2 |0.7172|_ |0.0150| 78 | ``` -------------------------------------------------------------------------------- /benchmark-logs/SOLAR-10.7b-v1-Base.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ```hf-causal-experimental (pretrained=upstage/SOLAR-10.7B-v1.0,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 14 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5247|± |0.0146| 6 | | | |acc_norm|0.5708|± |0.0145| 7 | |arc_easy | 0|acc |0.8148|± |0.0080| 8 | | | |acc_norm|0.8072|± |0.0081| 9 | |boolq | 1|acc |0.8254|± |0.0066| 10 | |hellaswag | 0|acc |0.6394|± |0.0048| 11 | | | |acc_norm|0.8310|± |0.0037| 12 | |openbookqa | 0|acc |0.3240|± |0.0210| 13 | | | |acc_norm|0.4400|± |0.0222| 14 | |piqa | 0|acc |0.8058|± |0.0092| 15 | | | |acc_norm|0.8194|± |0.0090| 16 | |winogrande | 0|acc |0.7459|± |0.0122| 17 | ``` 18 | Average: 71.99 19 | 20 | AGIEval: 21 | ```hf-causal-experimental (pretrained=upstage/SOLAR-10.7B-v1.0,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 16 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2520|± |0.0273| 25 | | | |acc_norm|0.2638|± |0.0277| 26 | |agieval_logiqa_en | 0|acc |0.3748|± |0.0190| 27 | | | |acc_norm|0.3840|± |0.0191| 28 | |agieval_lsat_ar | 0|acc |0.2217|± |0.0275| 29 | | | |acc_norm|0.2087|± |0.0269| 30 | |agieval_lsat_lr | 0|acc |0.4353|± |0.0220| 31 | | | |acc_norm|0.3902|± |0.0216| 32 | |agieval_lsat_rc | 0|acc |0.5911|± |0.0300| 33 | | | |acc_norm|0.4684|± |0.0305| 34 | |agieval_sat_en | 0|acc |0.7524|± |0.0301| 35 | | | |acc_norm|0.6942|± |0.0322| 36 | |agieval_sat_en_without_passage| 0|acc |0.4223|± |0.0345| 37 | | | |acc_norm|0.3883|± |0.0340| 38 | |agieval_sat_math | 0|acc |0.3727|± |0.0327| 39 | | | |acc_norm|0.3136|± |0.0314| 40 | ``` 41 | Average: 38.89 42 | 43 | BigBench: 44 | ```hf-causal-experimental (pretrained=upstage/SOLAR-10.7B-v1.0,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 32 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5368|± |0.0363| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.7046|± |0.0238| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3760|± |0.0302| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1950|± |0.0209| 51 | | | |exact_str_match |0.1058|± |0.0163| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2520|± |0.0194| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1914|± |0.0149| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3967|± |0.0283| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3340|± |0.0211| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5150|± |0.0112| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3817|± |0.0230| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2445|± |0.0136| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5746|± |0.0369| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.4640|± |0.0158| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2344|± |0.0120| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1640|± |0.0089| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3967|± |0.0283| 66 | ``` 67 | Average: 38.66 68 | 69 | TruthfulQA: 70 | ```hf-causal-experimental (pretrained=upstage/SOLAR-10.7B-v1.0,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 60 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.3182|± |0.0163| 74 | | | |mc2 |0.4565|± |0.0144| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Skunkworks-Mistralic-7B-Mistral.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | 4 | ``` 5 | Average: 6 | 7 | AGIEval: 8 | ``` 9 | | Task |Version| Metric |Value | |Stderr| 10 | |------------------------------|------:|--------|-----:|---|-----:| 11 | |agieval_aqua_rat | 0|acc |0.2323|± |0.0265| 12 | | | |acc_norm|0.2362|± |0.0267| 13 | |agieval_logiqa_en | 0|acc |0.3118|± |0.0182| 14 | | | |acc_norm|0.3195|± |0.0183| 15 | |agieval_lsat_ar | 0|acc |0.1783|± |0.0253| 16 | | | |acc_norm|0.1652|± |0.0245| 17 | |agieval_lsat_lr | 0|acc |0.3922|± |0.0216| 18 | | | |acc_norm|0.3863|± |0.0216| 19 | |agieval_lsat_rc | 0|acc |0.5019|± |0.0305| 20 | | | |acc_norm|0.4721|± |0.0305| 21 | |agieval_sat_en | 0|acc |0.6942|± |0.0322| 22 | | | |acc_norm|0.6602|± |0.0331| 23 | |agieval_sat_en_without_passage| 0|acc |0.3689|± |0.0337| 24 | | | |acc_norm|0.3641|± |0.0336| 25 | |agieval_sat_math | 0|acc |0.3545|± |0.0323| 26 | | | |acc_norm|0.2955|± |0.0308| 27 | ``` 28 | Average: 36.24 29 | 30 | BigBench: 31 | ``` 32 | 33 | ``` 34 | Average: 35 | 36 | TruthfulQA: 37 | ``` 38 | 39 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Synthia-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5367|± |0.0146| 6 | | | |acc_norm|0.5640|± |0.0145| 7 | |arc_easy | 0|acc |0.8245|± |0.0078| 8 | | | |acc_norm|0.8051|± |0.0081| 9 | |boolq | 1|acc |0.8697|± |0.0059| 10 | |hellaswag | 0|acc |0.6273|± |0.0048| 11 | | | |acc_norm|0.8123|± |0.0039| 12 | |openbookqa | 0|acc |0.3440|± |0.0213| 13 | | | |acc_norm|0.4460|± |0.0223| 14 | |piqa | 0|acc |0.8161|± |0.0090| 15 | | | |acc_norm|0.8275|± |0.0088| 16 | |winogrande | 0|acc |0.7569|± |0.0121| 17 | ``` 18 | Average: 72.59 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.1969|± |0.0250| 25 | | | |acc_norm|0.1969|± |0.0250| 26 | |agieval_logiqa_en | 0|acc |0.3134|± |0.0182| 27 | | | |acc_norm|0.3518|± |0.0187| 28 | |agieval_lsat_ar | 0|acc |0.2043|± |0.0266| 29 | | | |acc_norm|0.1870|± |0.0258| 30 | |agieval_lsat_lr | 0|acc |0.3941|± |0.0217| 31 | | | |acc_norm|0.3882|± |0.0216| 32 | |agieval_lsat_rc | 0|acc |0.5093|± |0.0305| 33 | | | |acc_norm|0.4833|± |0.0305| 34 | |agieval_sat_en | 0|acc |0.6942|± |0.0322| 35 | | | |acc_norm|0.6748|± |0.0327| 36 | |agieval_sat_en_without_passage| 0|acc |0.3835|± |0.0340| 37 | | | |acc_norm|0.3835|± |0.0340| 38 | |agieval_sat_math | 0|acc |0.3955|± |0.0330| 39 | | | |acc_norm|0.3545|± |0.0323| 40 | ``` 41 | Average: 37.75 42 | 43 | BigBench: 44 | ``` 45 | 46 | ``` 47 | Average: 48 | 49 | TruthfulQA: 50 | ``` 51 | 52 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Teknium-Airoboros-2.2-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5384|± |0.0146| 6 | | | |acc_norm|0.5623|± |0.0145| 7 | |arc_easy | 0|acc |0.8232|± |0.0078| 8 | | | |acc_norm|0.7942|± |0.0083| 9 | |boolq | 1|acc |0.8541|± |0.0062| 10 | |hellaswag | 0|acc |0.6235|± |0.0048| 11 | | | |acc_norm|0.8066|± |0.0039| 12 | |openbookqa | 0|acc |0.3660|± |0.0216| 13 | | | |acc_norm|0.4580|± |0.0223| 14 | |piqa | 0|acc |0.8003|± |0.0093| 15 | | | |acc_norm|0.8145|± |0.0091| 16 | |winogrande | 0|acc |0.7316|± |0.0125| 17 | ``` 18 | Average: 71.73 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2165|± |0.0259| 25 | | | |acc_norm|0.2126|± |0.0257| 26 | |agieval_logiqa_en | 0|acc |0.3041|± |0.0180| 27 | | | |acc_norm|0.3364|± |0.0185| 28 | |agieval_lsat_ar | 0|acc |0.1913|± |0.0260| 29 | | | |acc_norm|0.2000|± |0.0264| 30 | |agieval_lsat_lr | 0|acc |0.3686|± |0.0214| 31 | | | |acc_norm|0.3745|± |0.0215| 32 | |agieval_lsat_rc | 0|acc |0.4498|± |0.0304| 33 | | | |acc_norm|0.4201|± |0.0301| 34 | |agieval_sat_en | 0|acc |0.6845|± |0.0325| 35 | | | |acc_norm|0.6311|± |0.0337| 36 | |agieval_sat_en_without_passage| 0|acc |0.3738|± |0.0338| 37 | | | |acc_norm|0.3447|± |0.0332| 38 | |agieval_sat_math | 0|acc |0.3000|± |0.0310| 39 | | | |acc_norm|0.2591|± |0.0296| 40 | ``` 41 | Average: 34.73 42 | 43 | BigBench: 44 | ``` 45 | 46 | ``` 47 | Average: 48 | 49 | TruthfulQA: 50 | ``` 51 | | Task |Version|Metric|Value | |Stderr| 52 | |-------------|------:|------|-----:|---|-----:| 53 | |truthfulqa_mc| 1|mc1 |0.3562|± |0.0168| 54 | | | |mc2 |0.5217|± |0.0156| 55 | ``` -------------------------------------------------------------------------------- /benchmark-logs/TinyLlama-1.1B-intermediate-step-1431k-3T.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.2782|± |0.0131| 6 | | | |acc_norm|0.3012|± |0.0134| 7 | |arc_easy | 0|acc |0.6031|± |0.0100| 8 | | | |acc_norm|0.5535|± |0.0102| 9 | |boolq | 1|acc |0.5774|± |0.0086| 10 | |hellaswag | 0|acc |0.4499|± |0.0050| 11 | | | |acc_norm|0.5917|± |0.0049| 12 | |openbookqa | 0|acc |0.2180|± |0.0185| 13 | | | |acc_norm|0.3600|± |0.0215| 14 | |piqa | 0|acc |0.7334|± |0.0103| 15 | | | |acc_norm|0.7318|± |0.0103| 16 | |winogrande | 0|acc |0.5935|± |0.0138| 17 | ``` 18 | Average: 52.99 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.1575|± |0.0229| 25 | | | |acc_norm|0.1693|± |0.0236| 26 | |agieval_logiqa_en | 0|acc |0.2488|± |0.0170| 27 | | | |acc_norm|0.2934|± |0.0179| 28 | |agieval_lsat_ar | 0|acc |0.2304|± |0.0278| 29 | | | |acc_norm|0.2043|± |0.0266| 30 | |agieval_lsat_lr | 0|acc |0.2059|± |0.0179| 31 | | | |acc_norm|0.2353|± |0.0188| 32 | |agieval_lsat_rc | 0|acc |0.1970|± |0.0243| 33 | | | |acc_norm|0.1710|± |0.0230| 34 | |agieval_sat_en | 0|acc |0.2427|± |0.0299| 35 | | | |acc_norm|0.1893|± |0.0274| 36 | |agieval_sat_en_without_passage| 0|acc |0.2136|± |0.0286| 37 | | | |acc_norm|0.1942|± |0.0276| 38 | |agieval_sat_math | 0|acc |0.3045|± |0.0311| 39 | | | |acc_norm|0.2273|± |0.0283| 40 | ``` 41 | Average: 21.05 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5053|± |0.0364| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.4851|± |0.0261| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.2984|± |0.0285| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.0474|± |0.0112| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2100|± |0.0182| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1471|± |0.0134| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3667|± |0.0279| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3440|± |0.0213| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.3105|± |0.0103| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3013|± |0.0217| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2295|± |0.0133| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5138|± |0.0373| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4970|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2940|± |0.0144| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.1976|± |0.0113| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1371|± |0.0082| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3667|± |0.0279| 66 | ``` 67 | Average: 31.95 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.2203|± |0.0145| 74 | | | |mc2 |0.3759|± |0.0138| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/TinyLlama-1B--intermediate-step-1195k-token-2.5T.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.2705|± |0.0130| 6 | | | |acc_norm|0.3183|± |0.0136| 7 | |arc_easy | 0|acc |0.6124|± |0.0100| 8 | | | |acc_norm|0.5678|± |0.0102| 9 | |boolq | 1|acc |0.6324|± |0.0084| 10 | |hellaswag | 0|acc |0.4492|± |0.0050| 11 | | | |acc_norm|0.5896|± |0.0049| 12 | |openbookqa | 0|acc |0.2420|± |0.0192| 13 | | | |acc_norm|0.3440|± |0.0213| 14 | |piqa | 0|acc |0.7301|± |0.0104| 15 | | | |acc_norm|0.7301|± |0.0104| 16 | |winogrande | 0|acc |0.5864|± |0.0138| 17 | ``` 18 | Average: 53.84 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.1457|± |0.0222| 25 | | | |acc_norm|0.1535|± |0.0227| 26 | |agieval_logiqa_en | 0|acc |0.2350|± |0.0166| 27 | | | |acc_norm|0.3088|± |0.0181| 28 | |agieval_lsat_ar | 0|acc |0.1913|± |0.0260| 29 | | | |acc_norm|0.1870|± |0.0258| 30 | |agieval_lsat_lr | 0|acc |0.1961|± |0.0176| 31 | | | |acc_norm|0.2196|± |0.0183| 32 | |agieval_lsat_rc | 0|acc |0.2007|± |0.0245| 33 | | | |acc_norm|0.1784|± |0.0234| 34 | |agieval_sat_en | 0|acc |0.2282|± |0.0293| 35 | | | |acc_norm|0.2233|± |0.0291| 36 | |agieval_sat_en_without_passage| 0|acc |0.1893|± |0.0274| 37 | | | |acc_norm|0.2136|± |0.0286| 38 | |agieval_sat_math | 0|acc |0.2591|± |0.0296| 39 | | | |acc_norm|0.2318|± |0.0285| 40 | ``` 41 | Average: 21.45 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.4895|± |0.0364| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.4878|± |0.0261| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3527|± |0.0298| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.0446|± |0.0109| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2180|± |0.0185| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1457|± |0.0133| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.3633|± |0.0278| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3780|± |0.0217| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.2980|± |0.0102| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3036|± |0.0217| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1533|± |0.0114| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.5249|± |0.0372| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.4980|± |0.0159| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2780|± |0.0142| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.1848|± |0.0110| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1274|± |0.0080| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.3633|± |0.0278| 66 | ``` 67 | Average: 31.73 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.2130|± |0.0143| 74 | | | |mc2 |0.3707|± |0.0138| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/Yi-34B-Chat.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | hf-causal-experimental (pretrained=01-ai/Yi-34B-Chat,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 14 4 | | Task |Version| Metric |Value | |Stderr| 5 | |-------------|------:|--------|-----:|---|-----:| 6 | |arc_challenge| 0|acc |0.5427|_ |0.0146| 7 | | | |acc_norm|0.5478|_ |0.0145| 8 | |arc_easy | 0|acc |0.8089|_ |0.0081| 9 | | | |acc_norm|0.7412|_ |0.0090| 10 | |boolq | 1|acc |0.8997|_ |0.0053| 11 | |hellaswag | 0|acc |0.6143|_ |0.0049| 12 | | | |acc_norm|0.8068|_ |0.0039| 13 | |openbookqa | 0|acc |0.3720|_ |0.0216| 14 | | | |acc_norm|0.4800|_ |0.0224| 15 | |piqa | 0|acc |0.7971|_ |0.0094| 16 | | | |acc_norm|0.7982|_ |0.0094| 17 | |winogrande | 0|acc |0.7751|_ |0.0117| 18 | ``` 19 | Average: 72.13 20 | 21 | AGIEval: 22 | ``` 23 | | Task |Version| Metric |Value | |Stderr| 24 | |------------------------------|------:|--------|-----:|---|-----:| 25 | |agieval_aqua_rat | 0|acc |0.3110|_ |0.0291| 26 | | | |acc_norm|0.3228|_ |0.0294| 27 | |agieval_logiqa_en | 0|acc |0.4593|_ |0.0195| 28 | | | |acc_norm|0.4178|_ |0.0193| 29 | |agieval_lsat_ar | 0|acc |0.2391|_ |0.0282| 30 | | | |acc_norm|0.2261|_ |0.0276| 31 | |agieval_lsat_lr | 0|acc |0.6039|_ |0.0217| 32 | | | |acc_norm|0.5275|_ |0.0221| 33 | |agieval_lsat_rc | 0|acc |0.7361|_ |0.0269| 34 | | | |acc_norm|0.6766|_ |0.0286| 35 | |agieval_sat_en | 0|acc |0.8398|_ |0.0256| 36 | | | |acc_norm|0.8107|_ |0.0274| 37 | |agieval_sat_en_without_passage| 0|acc |0.5097|_ |0.0349| 38 | | | |acc_norm|0.4757|_ |0.0349| 39 | |agieval_sat_math | 0|acc |0.4682|_ |0.0337| 40 | | | |acc_norm|0.3909|_ |0.0330| 41 | ``` 42 | Average: 48.10 43 | 44 | 45 | TruthfulQA: 46 | ```hf-causal-experimental (pretrained=01-ai/Yi-34B-Chat,dtype=float16,trust_remote_code=True,use_accelerate=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 60 47 | | Task |Version|Metric|Value | |Stderr| 48 | |-------------|------:|------|-----:|---|-----:| 49 | |truthfulqa_mc| 1|mc1 |0.3905|_ |0.0171| 50 | | | |mc2 |0.5540|_ |0.0155| 51 | ``` 52 | 53 | Note: Some kind of error kept occurring with BigBench on this model, so none is recorded -------------------------------------------------------------------------------- /benchmark-logs/deita-v1.0-Mistral-7B.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | | Task |Version| Metric |Value | |Stderr| 4 | |-------------|------:|--------|-----:|---|-----:| 5 | |arc_challenge| 0|acc |0.5282|± |0.0146| 6 | | | |acc_norm|0.5538|± |0.0145| 7 | |arc_easy | 0|acc |0.7900|± |0.0084| 8 | | | |acc_norm|0.7273|± |0.0091| 9 | |boolq | 1|acc |0.8410|± |0.0064| 10 | |hellaswag | 0|acc |0.6626|± |0.0047| 11 | | | |acc_norm|0.8332|± |0.0037| 12 | |openbookqa | 0|acc |0.3280|± |0.0210| 13 | | | |acc_norm|0.4580|± |0.0223| 14 | |piqa | 0|acc |0.7938|± |0.0094| 15 | | | |acc_norm|0.7922|± |0.0095| 16 | |winogrande | 0|acc |0.7490|± |0.0122| 17 | ``` 18 | Average: 70.78 19 | 20 | AGIEval: 21 | ``` 22 | | Task |Version| Metric |Value | |Stderr| 23 | |------------------------------|------:|--------|-----:|---|-----:| 24 | |agieval_aqua_rat | 0|acc |0.2244|± |0.0262| 25 | | | |acc_norm|0.2323|± |0.0265| 26 | |agieval_logiqa_en | 0|acc |0.3410|± |0.0186| 27 | | | |acc_norm|0.3364|± |0.0185| 28 | |agieval_lsat_ar | 0|acc |0.2391|± |0.0282| 29 | | | |acc_norm|0.2609|± |0.0290| 30 | |agieval_lsat_lr | 0|acc |0.3824|± |0.0215| 31 | | | |acc_norm|0.3882|± |0.0216| 32 | |agieval_lsat_rc | 0|acc |0.5576|± |0.0303| 33 | | | |acc_norm|0.5502|± |0.0304| 34 | |agieval_sat_en | 0|acc |0.6505|± |0.0333| 35 | | | |acc_norm|0.6408|± |0.0335| 36 | |agieval_sat_en_without_passage| 0|acc |0.4369|± |0.0346| 37 | | | |acc_norm|0.4320|± |0.0346| 38 | |agieval_sat_math | 0|acc |0.3409|± |0.0320| 39 | | | |acc_norm|0.3227|± |0.0316| 40 | ``` 41 | Average: 39.54 42 | 43 | BigBench: 44 | ``` 45 | | Task |Version| Metric |Value | |Stderr| 46 | |------------------------------------------------|------:|---------------------|-----:|---|-----:| 47 | |bigbench_causal_judgement | 0|multiple_choice_grade|0.5842|± |0.0359| 48 | |bigbench_date_understanding | 0|multiple_choice_grade|0.6396|± |0.0250| 49 | |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3333|± |0.0294| 50 | |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1086|± |0.0164| 51 | | | |exact_str_match |0.0000|± |0.0000| 52 | |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2880|± |0.0203| 53 | |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1686|± |0.0142| 54 | |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4300|± |0.0286| 55 | |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3100|± |0.0207| 56 | |bigbench_navigate | 0|multiple_choice_grade|0.4980|± |0.0158| 57 | |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.4875|± |0.0112| 58 | |bigbench_ruin_names | 0|multiple_choice_grade|0.3237|± |0.0221| 59 | |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2154|± |0.0130| 60 | |bigbench_snarks | 0|multiple_choice_grade|0.6575|± |0.0354| 61 | |bigbench_sports_understanding | 0|multiple_choice_grade|0.5639|± |0.0158| 62 | |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2500|± |0.0137| 63 | |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2256|± |0.0118| 64 | |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1794|± |0.0092| 65 | |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4300|± |0.0286| 66 | ``` 67 | Average: 37.19 68 | 69 | TruthfulQA: 70 | ``` 71 | | Task |Version|Metric|Value | |Stderr| 72 | |-------------|------:|------|-----:|---|-----:| 73 | |truthfulqa_mc| 1|mc1 |0.5288|± |0.0175| 74 | | | |mc2 |0.6715|± |0.0158| 75 | ``` -------------------------------------------------------------------------------- /benchmark-logs/template.md: -------------------------------------------------------------------------------- 1 | GPT4All: 2 | ``` 3 | ``` 4 | Average: 5 | 6 | AGIEval: 7 | ``` 8 | ``` 9 | Average: 10 | 11 | BigBench: 12 | ``` 13 | ``` 14 | Average: 15 | 16 | TruthfulQA: 17 | ``` 18 | ``` 19 | --------------------------------------------------------------------------------