├── .github └── ISSUE_TEMPLATE │ └── model_eval_request.yml ├── README.md ├── annotations └── HE_annotations.jsonl ├── data └── MHPP.jsonl └── fig ├── mhpp_leaderboard.png └── statistics.png /.github/ISSUE_TEMPLATE/model_eval_request.yml: -------------------------------------------------------------------------------- 1 | name: "🤗 Model Evaluation Request" 2 | description: Request MHPP maintainers to evaluate your model independently and update it on our leaderboard. 3 | title: "🤗 [REQUEST] - FILL_THE_MODEL_NAME_HERE" 4 | labels: ["model eval"] 5 | body: 6 | - type: textarea 7 | id: about 8 | attributes: 9 | label: "Model introduction" 10 | description: Provide a brief introduction to the model. 11 | placeholder: The models is created by ... and is used for ... 12 | validations: 13 | required: true 14 | - type: input 15 | id: url 16 | attributes: 17 | label: "Model URL (Optional)" 18 | description: Indicate the URL (e.g., huggingface or other release pages) of the model 19 | placeholder: https://huggingface.co/[???]/[???] 20 | validations: 21 | required: false 22 | - type: textarea 23 | id: other 24 | attributes: 25 | label: "Additional information (Optional)" 26 | description: Add more information about preferably settings. 27 | placeholder: Which specfic instruction should we use? What data type precision should be used? What is the minimal hardware requirement? 28 | validations: 29 | required: false 30 | - type: textarea 31 | id: decomtamination 32 | attributes: 33 | label: "Decontamination" 34 | description: How does the authors avoid contamination for their training data? 35 | placeholder: Please clarify the decontamination steps and quantify it, e.g., N-gram match of ground-truth code in the training dataset. 36 | validations: 37 | required: true 38 | - type: dropdown 39 | id: author 40 | attributes: 41 | label: "Author" 42 | description: "Are you (one of) the author(s) of the model?" 43 | multiple: false 44 | options: 45 | - "Yes" 46 | - "No" 47 | validations: 48 | required: true 49 | - type: dropdown 50 | id: data 51 | attributes: 52 | label: "Data" 53 | description: "Is the training/fine-tuning data available in public?" 54 | multiple: false 55 | options: 56 | - "Yes (If so please specify in 'Additional information')" 57 | - "No" 58 | validations: 59 | required: true 60 | - type: checkboxes 61 | id: security 62 | attributes: 63 | label: "Security" 64 | options: 65 | - label: "I confirm that the model is safe to run which is not designed to produce malicious code or content." 66 | required: true 67 | - type: checkboxes 68 | id: integrity 69 | attributes: 70 | label: "Integrity" 71 | options: 72 | - label: "I confirm that the model comes from unique and original work and does not contain any plagiarism." 73 | required: true 74 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## The official Repo of the paper [MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation](https://arxiv.org/abs/2405.11430) 📄✨ 2 | 3 | ## About 4 | 5 | MHPP is a code genration benchmark that: 6 | * ✨ Assesses LLMs' comprehension of specifications and restrictions, multi-step reasoning, and effective application of coding knowledge. 7 | * ✨ Contains 210 manually curated hard Python problems across 7 different challenge types. 8 | ![statistics](./fig/statistics.png) 9 | 10 | Why MHPP? 11 | * ✨ **Precise evaluation & Ranking:** See the [MHPP Leaderboard](https://sparksofagi.github.io/MHPP/). The results align well with HumanEval and also suggest new discoveries. 12 | ![MHPP Leaderboard](./fig/mhpp_leaderboard.png) 13 | * ✨ **Comprehensive Coverage:** Our benchmark includes 7 challenge types, enabling a more rigorous and nuanced evaluation of models' coding capabilities across various difficulties, granularities, and reasoning levels. 14 | * ✨ **Data Integrity:** We will never release answers and full test units to avoid data contamination, 'cheat' by memorizing test data is impossible. 15 | 16 | The dataset can be found in the `data` directory, which includes details of each question, such as the question's challenge type. The raw questions can be accessed using the 'question' field. We also provide a version of the prompt with a prefix for users, as mentioned in the paper. Creating this dataset was labor-intensive. If you find any errors, please submit an issue to provide feedback. Thank you very much! 🙏 17 | 18 | ## 🔥 Quick Start 19 | 20 | 1. Run your model on our dataset in the `/data/` folder. You can either take the "prompt" field as input (as we do in the paper) or create a new input by adding more user instructions before or after the "question" field. 💻 21 | 2. Save the result in JSONL format. Each line should at least contain 'function_name', 'prompt', 'difficulty_types', and 'response'. 22 | 3. Run command below to upload your JSONL file to our server, remember to replace "file_name" to your real filename: 23 | ```shell 24 | curl -F "file=@file_name.jsonl" http://ai-universe.cn:7531/upload 25 | # Example 26 | # curl -F "file=@gpt4o_2024_05_13.jsonl" http://ai-universe.cn:7531/upload 27 | ``` 28 | 4. Wait. ⏳ 29 | 5. Get your model result with a detailed report. 📊 30 | 6. (Optional) If you wish to have your result featured on the leaderboard, or if you have any further questions, please click "[File a request](https://github.com/SparksofAGI/MHPP/issues/new?assignees=&labels=model+eval&projects=&template=model_eval_request.yml&title=💡+%5BREQUEST%5D+-+%3CMODEL_NAME%3E)" on the leaderboard page. This will allow you to create an issue, fill in the necessary information, and submit it. 📤 31 | 32 | 33 | Thank you for your interest and participation! 😊 34 | 35 | 36 | ## 📝 Citation 37 | 38 | If you use MHPP in your research, please consider citing us: 39 | 40 | ```bibtex 41 | @article{dai2024mhpp, 42 | title={MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation}, 43 | author={Dai, Jianbo and Lu, Jianqiao and Feng, Yunlong and Ruan, Rongju and Cheng, Ming and Tan, Haochen and Guo, Zhijiang}, 44 | journal={arXiv preprint arXiv:2405.11430}, 45 | year={2024} 46 | } 47 | ``` 48 | -------------------------------------------------------------------------------- /annotations/HE_annotations.jsonl: -------------------------------------------------------------------------------- 1 | {"id": 0, "function_name": "has_close_elements", "challenge_category": ["Basic"]} 2 | {"id": 1, "function_name": "separate_paren_groups", "challenge_category": ["Complex"]} 3 | {"id": 2, "function_name": "truncate_number", "challenge_category": ["Codesense"]} 4 | {"id": 3, "function_name": "below_zero", "challenge_category": ["Commonsense"]} 5 | {"id": 4, "function_name": "mean_absolute_deviation", "challenge_category": ["Redefinition"]} 6 | {"id": 5, "function_name": "intersperse", "challenge_category": ["Basic", "Cornercase"]} 7 | {"id": 6, "function_name": "parse_nested_parens", "challenge_category": ["Complex"]} 8 | {"id": 7, "function_name": "filter_by_substring", "challenge_category": ["Basic", "Codesense"]} 9 | {"id": 8, "function_name": "sum_product", "challenge_category": ["Basic"]} 10 | {"id": 9, "function_name": "rolling_max", "challenge_category": ["Distraction", "Complex"]} 11 | {"id": 10, "function_name": "make_palindrome", "challenge_category": ["Complex"]} 12 | {"id": 11, "function_name": "string_xor", "challenge_category": ["Commonsense"]} 13 | {"id": 12, "function_name": "longest", "challenge_category": ["Cornercase", "Complex"]} 14 | {"id": 13, "function_name": "greatest_common_divisor", "challenge_category": ["Shortcut", "Codesense"]} 15 | {"id": 14, "function_name": "all_prefixes", "challenge_category": ["Basic", "Complex"]} 16 | {"id": 15, "function_name": "string_sequence", "challenge_category": ["Basic", "Codesense"]} 17 | {"id": 16, "function_name": "count_distinct_characters", "challenge_category": ["Codesense"]} 18 | {"id": 17, "function_name": "parse_music", "challenge_category": ["Redefinition"]} 19 | {"id": 18, "function_name": "how_many_times", "challenge_category": ["Basic"]} 20 | {"id": 19, "function_name": "sort_numbers", "challenge_category": ["Basic", "Commonsense"]} 21 | {"id": 20, "function_name": "find_closest_elements", "challenge_category": ["Complex"]} 22 | {"id": 21, "function_name": "rescale_to_unit", "challenge_category": ["Commonsense", "Codesense"]} 23 | {"id": 22, "function_name": "filter_integers", "challenge_category": ["Codesense"]} 24 | {"id": 23, "function_name": "strlen", "challenge_category": ["Basic", "Codesense"]} 25 | {"id": 24, "function_name": "largest_divisor", "challenge_category": ["Codesense"]} 26 | {"id": 25, "function_name": "factorize", "challenge_category": ["Shortcut", "Commonsense", "Codesense"]} 27 | {"id": 26, "function_name": "remove_duplicates", "challenge_category": ["Codesense"]} 28 | {"id": 27, "function_name": "flip_case", "challenge_category": ["Codesense"]} 29 | {"id": 28, "function_name": "concatenate", "challenge_category": ["Codesense"]} 30 | {"id": 29, "function_name": "filter_by_prefix", "challenge_category": ["Codesense"]} 31 | {"id": 30, "function_name": "get_positive", "challenge_category": ["Basic"]} 32 | {"id": 31, "function_name": "is_prime", "challenge_category": ["Commonsense", "Codesense"]} 33 | {"id": 32, "function_name": "find_zero", "challenge_category": ["Commonsense", "Complex"]} 34 | {"id": 33, "function_name": "sort_third", "challenge_category": ["Shortcut", "Complex"]} 35 | {"id": 34, "function_name": "unique", "challenge_category": ["Codesense"]} 36 | {"id": 35, "function_name": "max_element", "challenge_category": ["Basic", "Codesense"]} 37 | {"id": 36, "function_name": "fizz_buzz", "challenge_category": ["Complex"]} 38 | {"id": 37, "function_name": "sort_even", "challenge_category": ["Complex"]} 39 | {"id": 38, "function_name": "decode_cyclic", "challenge_category": ["Redefinition", "Commonsense"]} 40 | {"id": 39, "function_name": "prime_fib", "challenge_category": ["Commonsense", "Complex"]} 41 | {"id": 40, "function_name": "triples_sum_to_zero", "challenge_category": ["Distraction", "Complex"]} 42 | {"id": 41, "function_name": "car_race_collision", "challenge_category": ["Commonsense"]} 43 | {"id": 42, "function_name": "incr_list", "challenge_category": ["Basic"]} 44 | {"id": 43, "function_name": "pairs_sum_to_zero", "challenge_category": ["Distraction", "Complex"]} 45 | {"id": 44, "function_name": "change_base", "challenge_category": ["Complex", "Codesense"]} 46 | {"id": 45, "function_name": "triangle_area", "challenge_category": ["Commonsense"]} 47 | {"id": 46, "function_name": "fib4", "challenge_category": ["Redefinition", "Cornercase"]} 48 | {"id": 47, "function_name": "median", "challenge_category": ["Codesense"]} 49 | {"id": 48, "function_name": "is_palindrome", "challenge_category": ["Commonsense", "Codesense"]} 50 | {"id": 49, "function_name": "modp", "challenge_category": ["Shortcut", "Commonsense"]} 51 | {"id": 50, "function_name": "decode_shift", "challenge_category": ["Redefinition", "Complex", "Codesense"]} 52 | {"id": 51, "function_name": "remove_vowels", "challenge_category": ["Commonsense"]} 53 | {"id": 52, "function_name": "below_threshold", "challenge_category": ["Basic"]} 54 | {"id": 53, "function_name": "add", "challenge_category": ["Basic"]} 55 | {"id": 54, "function_name": "same_chars", "challenge_category": ["Basic", "Codesense"]} 56 | {"id": 55, "function_name": "fib", "challenge_category": ["Commonsense", "Cornercase"]} 57 | {"id": 56, "function_name": "correct_bracketing", "challenge_category": ["Codesense"]} 58 | {"id": 57, "function_name": "monotonic", "challenge_category": ["Shortcut", "Codesense"]} 59 | {"id": 58, "function_name": "common", "challenge_category": ["Complex"]} 60 | {"id": 59, "function_name": "largest_prime_factor", "challenge_category": ["Commonsense", "Complex"]} 61 | {"id": 60, "function_name": "sum_to_n", "challenge_category": ["Shortcut", "Codesense"]} 62 | {"id": 61, "function_name": "correct_bracketing", "challenge_category": ["Codesense"]} 63 | {"id": 62, "function_name": "derivative", "challenge_category": ["Commonsense"]} 64 | {"id": 63, "function_name": "fibfib", "challenge_category": ["Redefinition", "Cornercase"]} 65 | {"id": 64, "function_name": "vowels_count", "challenge_category": ["Redefinition"]} 66 | {"id": 65, "function_name": "circular_shift", "challenge_category": ["Complex", "Codesense"]} 67 | {"id": 66, "function_name": "digitSum", "challenge_category": ["Commonsense", "Codesense"]} 68 | {"id": 67, "function_name": "fruit_distribution", "challenge_category": ["Redefinition"]} 69 | {"id": 68, "function_name": "pluck", "challenge_category": ["Redefinition", "Cornercase", "Complex"]} 70 | {"id": 69, "function_name": "search", "challenge_category": ["Complex"]} 71 | {"id": 70, "function_name": "strange_sort_list", "challenge_category": ["Redefinition"]} 72 | {"id": 71, "function_name": "triangle_area", "challenge_category": ["Shortcut", "Cornercase", "Complex"]} 73 | {"id": 72, "function_name": "will_it_fly", "challenge_category": ["Redefinition", "Complex"]} 74 | {"id": 73, "function_name": "smallest_change", "challenge_category": ["Basic", "Redefinition"]} 75 | {"id": 74, "function_name": "total_match", "challenge_category": ["Basic"]} 76 | {"id": 75, "function_name": "is_multiply_prime", "challenge_category": ["Complex"]} 77 | {"id": 76, "function_name": "is_simple_power", "challenge_category": ["Basic", "Redefinition"]} 78 | {"id": 77, "function_name": "iscube", "challenge_category": ["Shortcut", "Commonsense", "Codesense"]} 79 | {"id": 78, "function_name": "hex_key", "challenge_category": ["Redefinition"]} 80 | {"id": 79, "function_name": "decimal_to_binary", "challenge_category": ["Basic", "Codesense"]} 81 | {"id": 80, "function_name": "is_happy", "challenge_category": ["Redefinition", "Cornercase"]} 82 | {"id": 81, "function_name": "numerical_letter_grade", "challenge_category": ["Basic", "Redefinition"]} 83 | {"id": 82, "function_name": "prime_length", "challenge_category": ["Commonsense", "Cornercase"]} 84 | {"id": 83, "function_name": "starts_one_ends", "challenge_category": ["Shortcut"]} 85 | {"id": 84, "function_name": "solve", "challenge_category": ["Codesense"]} 86 | {"id": 85, "function_name": "add", "challenge_category": ["Complex", "Codesense"]} 87 | {"id": 86, "function_name": "anti_shuffle", "challenge_category": ["Redefinition", "Complex"]} 88 | {"id": 87, "function_name": "get_row", "challenge_category": ["Redefinition", "Complex"]} 89 | {"id": 88, "function_name": "sort_array", "challenge_category": ["Cornercase", "Codesense"]} 90 | {"id": 89, "function_name": "encrypt", "challenge_category": ["Redefinition", "Commonsense"]} 91 | {"id": 90, "function_name": "next_smallest", "challenge_category": ["Cornercase", "Codesense"]} 92 | {"id": 91, "function_name": "is_bored", "challenge_category": ["Redefinition", "Codesense"]} 93 | {"id": 92, "function_name": "any_int", "challenge_category": ["Basic", "Codesense"]} 94 | {"id": 93, "function_name": "encode", "challenge_category": ["Commonsense", "Complex", "Codesense"]} 95 | {"id": 94, "function_name": "skjkasdkd", "challenge_category": ["Complex"]} 96 | {"id": 95, "function_name": "check_dict_case", "challenge_category": ["Cornercase", "Codesense"]} 97 | {"id": 96, "function_name": "count_up_to", "challenge_category": ["Complex"]} 98 | {"id": 97, "function_name": "multiply", "challenge_category": ["Basic", "Codesense"]} 99 | {"id": 98, "function_name": "count_upper", "challenge_category": ["Commonsense", "Complex"]} 100 | {"id": 99, "function_name": "closest_integer", "challenge_category": ["Redefinition", "Codesense"]} 101 | {"id": 100, "function_name": "make_a_pile", "challenge_category": ["Redefinition"]} 102 | {"id": 101, "function_name": "words_string", "challenge_category": ["Cornercase", "Codesense"]} 103 | {"id": 102, "function_name": "choose_num", "challenge_category": ["Cornercase"]} 104 | {"id": 103, "function_name": "rounded_avg", "challenge_category": ["Cornercase", "Complex", "Codesense"]} 105 | {"id": 104, "function_name": "unique_digits", "challenge_category": ["Complex"]} 106 | {"id": 105, "function_name": "by_length", "challenge_category": ["Redefinition", "Cornercase", "Codesense"]} 107 | {"id": 106, "function_name": "f", "challenge_category": ["Complex"]} 108 | {"id": 107, "function_name": "even_odd_palindrome", "challenge_category": ["Complex"]} 109 | {"id": 108, "function_name": "count_nums", "challenge_category": ["Basic"]} 110 | {"id": 109, "function_name": "move_one_ball", "challenge_category": ["Redefinition"]} 111 | {"id": 110, "function_name": "exchange", "challenge_category": ["Shortcut", "Complex"]} 112 | {"id": 111, "function_name": "histogram", "challenge_category": ["Basic", "Codesense"]} 113 | {"id": 112, "function_name": "reverse_delete", "challenge_category": ["Commonsense", "Complex"]} 114 | {"id": 113, "function_name": "odd_count", "challenge_category": ["Complex"]} 115 | {"id": 114, "function_name": "minSubArraySum", "challenge_category": ["Shortcut"]} 116 | {"id": 115, "function_name": "max_fill", "challenge_category": ["Redefinition"]} 117 | {"id": 116, "function_name": "sort_array", "challenge_category": ["Commonsense", "Codesense"]} 118 | {"id": 117, "function_name": "select_words", "challenge_category": ["Commonsense", "Complex"]} 119 | {"id": 118, "function_name": "get_closest_vowel", "challenge_category": ["Commonsense", "Cornercase", "Complex"]} 120 | {"id": 119, "function_name": "match_parens", "challenge_category": ["Redefinition"]} 121 | {"id": 120, "function_name": "maximum", "challenge_category": ["Shortcut", "Codesense"]} 122 | {"id": 121, "function_name": "solution", "challenge_category": ["Complex", "Codesense"]} 123 | {"id": 122, "function_name": "add_elements", "challenge_category": ["Basic"]} 124 | {"id": 123, "function_name": "get_odd_collatz", "challenge_category": ["Redefinition", "Commonsense"]} 125 | {"id": 124, "function_name": "valid_date", "challenge_category": ["Complex"]} 126 | {"id": 125, "function_name": "split_words", "challenge_category": ["Complex"]} 127 | {"id": 126, "function_name": "is_sorted", "challenge_category": ["Complex"]} 128 | {"id": 127, "function_name": "intersection", "challenge_category": ["Commonsense", "Complex"]} 129 | {"id": 128, "function_name": "prod_signs", "challenge_category": ["Cornercase", "Complex"]} 130 | {"id": 129, "function_name": "minPath", "challenge_category": ["Redefinition", "Shortcut", "Complex"]} 131 | {"id": 130, "function_name": "tri", "challenge_category": ["Redefinition"]} 132 | {"id": 131, "function_name": "digits", "challenge_category": ["Complex"]} 133 | {"id": 132, "function_name": "is_nested", "challenge_category": ["Complex"]} 134 | {"id": 133, "function_name": "sum_squares", "challenge_category": ["Basic", "Codesense"]} 135 | {"id": 134, "function_name": "check_if_last_char_is_a_letter", "challenge_category": ["Codesense"]} 136 | {"id": 135, "function_name": "can_arrange", "challenge_category": ["Basic"]} 137 | {"id": 136, "function_name": "largest_smallest_integers", "challenge_category": ["Codesense"]} 138 | {"id": 137, "function_name": "compare_one", "challenge_category": ["Redefinition", "Commonsense", "Codesense"]} 139 | {"id": 138, "function_name": "is_equal_to_sum_even", "challenge_category": ["Shortcut"]} 140 | {"id": 139, "function_name": "special_factorial", "challenge_category": ["Redefinition"]} 141 | {"id": 140, "function_name": "fix_spaces", "challenge_category": ["Complex"]} 142 | {"id": 141, "function_name": "file_name_check", "challenge_category": ["Redefinition", "Commonsense", "Complex"]} 143 | {"id": 142, "function_name": "sum_squares", "challenge_category": ["Complex"]} 144 | {"id": 143, "function_name": "words_in_sentence", "challenge_category": ["Commonsense", "Complex"]} 145 | {"id": 144, "function_name": "simplify", "challenge_category": ["Commonsense"]} 146 | {"id": 145, "function_name": "order_by_points", "challenge_category": ["Complex"]} 147 | {"id": 146, "function_name": "specialFilter", "challenge_category": ["Complex"]} 148 | {"id": 147, "function_name": "get_max_triples", "challenge_category": ["Complex"]} 149 | {"id": 148, "function_name": "bf", "challenge_category": ["Redefinition", "Complex"]} 150 | {"id": 149, "function_name": "sorted_list_sum", "challenge_category": ["Complex", "Codesense"]} 151 | {"id": 150, "function_name": "x_or_y", "challenge_category": ["Commonsense", "Cornercase"]} 152 | {"id": 151, "function_name": "double_the_difference", "challenge_category": ["Complex"]} 153 | {"id": 152, "function_name": "compare", "challenge_category": ["Distraction"]} 154 | {"id": 153, "function_name": "Strongest_Extension", "challenge_category": ["Redefinition", "Complex"]} 155 | {"id": 154, "function_name": "cycpattern_check", "challenge_category": ["Commonsense", "Codesense"]} 156 | {"id": 155, "function_name": "even_odd_count", "challenge_category": ["Basic"]} 157 | {"id": 156, "function_name": "int_to_mini_roman", "challenge_category": ["Commonsense"]} 158 | {"id": 157, "function_name": "right_angle_triangle", "challenge_category": ["Commonsense"]} 159 | {"id": 158, "function_name": "find_max", "challenge_category": ["Commonsense", "Codesense"]} 160 | {"id": 159, "function_name": "eat", "challenge_category": ["Basic", "Redefinition"]} 161 | {"id": 160, "function_name": "do_algebra", "challenge_category": ["Redefinition", "Codesense"]} 162 | {"id": 161, "function_name": "solve", "challenge_category": ["Complex"]} 163 | {"id": 162, "function_name": "string_to_md5", "challenge_category": ["Codesense"]} 164 | {"id": 163, "function_name": "generate_integers", "challenge_category": ["Codesense"]} 165 | -------------------------------------------------------------------------------- /fig/mhpp_leaderboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SparksofAGI/MHPP/58a6a6408618681720bee4b74754d38d26b820e3/fig/mhpp_leaderboard.png -------------------------------------------------------------------------------- /fig/statistics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SparksofAGI/MHPP/58a6a6408618681720bee4b74754d38d26b820e3/fig/statistics.png --------------------------------------------------------------------------------