├── README.md ├── codes ├── examples.py ├── mr-gsm8k_eval.py └── mr-math_eval.py ├── dataset ├── mr-gsm8k.json ├── mr-math_invalid_errors.json ├── mr-math_redundant_errors.json └── perturbation.json ├── eval_results ├── mr-gsm8k │ ├── gpt3_5_turbo_eval_results.json │ ├── gpt4_eval_results.json │ ├── math-shepherd_mistral-7b_eval_results.json │ ├── reasoneval_abel-7b-002_eval_results.json │ ├── reasoneval_llama2-7b_eval_results.json │ ├── reasoneval_llemma-34b_eval_results.json │ ├── reasoneval_llemma-7b_eval_results.json │ ├── reasoneval_mistral-7b_eval_results.json │ ├── reasoneval_wizardmath-7b-v1.0_eval_results.json │ ├── reasoneval_wizardmath-7b-v1.1_eval_results.json │ ├── roscoe-sa_eval_results.json │ └── roscoe-ss_eval_results.json ├── mr-math_invalid_errors │ ├── gpt3_5_turbo_eval_results.json │ ├── gpt4_eval_results.json │ ├── math-shepherd_mistral-7b_eval_results.json │ ├── reasoneval_abel-7b-002_eval_results.json │ ├── reasoneval_llama2-7b_eval_results.json │ ├── reasoneval_llemma-34b_eval_results.json │ ├── reasoneval_llemma-7b_eval_results.json │ ├── reasoneval_mistral-7b_eval_results.json │ ├── reasoneval_wizardmath-7b-v1.0_eval_results.json │ ├── reasoneval_wizardmath-7b-v1.1_eval_results.json │ ├── roscoe-sa_eval_results.json │ └── roscoe-ss_eval_results.json ├── mr-math_redundant_errors │ ├── gpt3_5_turbo_eval_results.json │ ├── gpt4_eval_results.json │ ├── math-shepherd_mistral-7b_eval_results.json │ ├── reasoneval_abel-7b-002_eval_results.json │ ├── reasoneval_llama2-7b_eval_results.json │ ├── reasoneval_llemma-34b_eval_results.json │ ├── reasoneval_llemma-7b_eval_results.json │ ├── reasoneval_mistral-7b_eval_results.json │ ├── reasoneval_wizardmath-7b-v1.0_eval_results.json │ ├── reasoneval_wizardmath-7b-v1.1_eval_results.json │ ├── roscoe-sa_eval_results.json │ └── roscoe-ss_eval_results.json └── perturbation │ ├── math-shepherd_mistral-7b_perturbed_eval_results.json │ ├── math-shepherd_mistral_7b_original_eval_results.json │ ├── reasoneval_llemma-34b_perturbed_eval_results.json │ └── reasoneval_llemma_34b_original_eval_results.json ├── images └── introduction.jpg └── requirements.txt /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/README.md -------------------------------------------------------------------------------- /codes/examples.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/codes/examples.py -------------------------------------------------------------------------------- /codes/mr-gsm8k_eval.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/codes/mr-gsm8k_eval.py -------------------------------------------------------------------------------- /codes/mr-math_eval.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/codes/mr-math_eval.py -------------------------------------------------------------------------------- /dataset/mr-gsm8k.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/dataset/mr-gsm8k.json -------------------------------------------------------------------------------- /dataset/mr-math_invalid_errors.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/dataset/mr-math_invalid_errors.json -------------------------------------------------------------------------------- /dataset/mr-math_redundant_errors.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/dataset/mr-math_redundant_errors.json -------------------------------------------------------------------------------- /dataset/perturbation.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/dataset/perturbation.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/gpt3_5_turbo_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/gpt3_5_turbo_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/gpt4_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/gpt4_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/math-shepherd_mistral-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/math-shepherd_mistral-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_abel-7b-002_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_abel-7b-002_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_llama2-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_llama2-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_llemma-34b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_llemma-34b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_llemma-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_llemma-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_mistral-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_mistral-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_wizardmath-7b-v1.0_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_wizardmath-7b-v1.0_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/reasoneval_wizardmath-7b-v1.1_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/reasoneval_wizardmath-7b-v1.1_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/roscoe-sa_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/roscoe-sa_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-gsm8k/roscoe-ss_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-gsm8k/roscoe-ss_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/gpt3_5_turbo_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/gpt3_5_turbo_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/gpt4_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/gpt4_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/math-shepherd_mistral-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/math-shepherd_mistral-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_abel-7b-002_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_abel-7b-002_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_llama2-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_llama2-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_llemma-34b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_llemma-34b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_llemma-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_llemma-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_mistral-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_mistral-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_wizardmath-7b-v1.0_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_wizardmath-7b-v1.0_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/reasoneval_wizardmath-7b-v1.1_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/reasoneval_wizardmath-7b-v1.1_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/roscoe-sa_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/roscoe-sa_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_invalid_errors/roscoe-ss_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_invalid_errors/roscoe-ss_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/gpt3_5_turbo_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/gpt3_5_turbo_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/gpt4_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/gpt4_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/math-shepherd_mistral-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/math-shepherd_mistral-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_abel-7b-002_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_abel-7b-002_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_llama2-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_llama2-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_llemma-34b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_llemma-34b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_llemma-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_llemma-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_mistral-7b_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_mistral-7b_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_wizardmath-7b-v1.0_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_wizardmath-7b-v1.0_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/reasoneval_wizardmath-7b-v1.1_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/reasoneval_wizardmath-7b-v1.1_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/roscoe-sa_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/roscoe-sa_eval_results.json -------------------------------------------------------------------------------- /eval_results/mr-math_redundant_errors/roscoe-ss_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/mr-math_redundant_errors/roscoe-ss_eval_results.json -------------------------------------------------------------------------------- /eval_results/perturbation/math-shepherd_mistral-7b_perturbed_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/perturbation/math-shepherd_mistral-7b_perturbed_eval_results.json -------------------------------------------------------------------------------- /eval_results/perturbation/math-shepherd_mistral_7b_original_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/perturbation/math-shepherd_mistral_7b_original_eval_results.json -------------------------------------------------------------------------------- /eval_results/perturbation/reasoneval_llemma-34b_perturbed_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/perturbation/reasoneval_llemma-34b_perturbed_eval_results.json -------------------------------------------------------------------------------- /eval_results/perturbation/reasoneval_llemma_34b_original_eval_results.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/eval_results/perturbation/reasoneval_llemma_34b_original_eval_results.json -------------------------------------------------------------------------------- /images/introduction.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GAIR-NLP/ReasonEval/HEAD/images/introduction.jpg -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch>=2.0.1 2 | transformers>=4.35.0 3 | scikit-learn>=1.4.0 --------------------------------------------------------------------------------