├── .gitignore ├── .gitmodules ├── README.md ├── build_run_specs_full.py ├── helm.md ├── leaderboard.md ├── open_api_spec.json ├── run_specs.conf ├── run_specs_full_coarse_600_budget.conf └── sample-submissions ├── lit-gpt ├── Dockerfile ├── README.md ├── api.py ├── fast_api_requirements.txt ├── helper.py └── main.py └── llama_recipes ├── Dockerfile ├── Dockerfile.train ├── README.md ├── api.py ├── fast_api_requirements.txt ├── main.py └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | sample-submissions/lit-gpt/__pycache__/ -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "sample-submissions/lit-gpt/lit-gpt"] 2 | path = sample-submissions/lit-gpt/lit-gpt 3 | url = https://github.com/Lightning-AI/lit-gpt 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Neurips 1 LLM 1 GPU Challenge 2 | 3 | This repository provides a starting point for those who are interested in the [NeurIPS 1 LLM 1 GPU Competition](https://llm-efficiency-challenge.github.io/). It provides detailed clarifications on what a submission looks like exactly, and how it will be evaluated and submitted. 4 | 5 | At a high level, the key thing you will contribute is a `Dockerfile`, which will be a reproducible artifact that we can use to test your submission. The `Dockerfile` should contain all the code and dependencies needed to run your submission. We will use this `Dockerfile` to build a docker image and then run it against a set of tasks which will be a subset of the [HELM](https://crfm.stanford.edu/helm/latest/) tasks. 6 | 7 | Your `Dockerfile` will expose a simple HTTP server, which needs to implement 2 endpoints `/process` and `/tokenize`. We will build that `Dockerfile` and expect it to launch an HTTP server. Once that server is launched, we will make requests to it via HELM and record your results. 8 | 9 | At a high level the flow you should follow to ensure a strong submission: 10 | 1. Pick approved LLMs and datasets from [here](https://llm-efficiency-challenge.github.io/challenge) 11 | 2. Start with one of [sample-submissions](sample-submissions) and make sure it runs 12 | 3. Evaluate it locally on your own 40Gb A100 or 4090, if you don't have funding for either please see the [GPU funding](#gpu-funding) section for some more options 13 | 4. Once you have something working you can make a submission on our [Discord Leaderboard](https://discord.com/channels/1124130156336922665/1124134272631054447/1151718598818156645) to see how you fare up against other competitors 14 | 5. On the competition deadline make sure you have the final eval Dockerfile you'd like us to run in your github repo, refer to the [timeline](https://llm-efficiency-challenge.github.io/dates) 15 | 6. If your entry makes the shortlist, we will work with you to reproduce all of your artifacts with another finetuning Dockerfile 16 | 17 | ## Contents 18 | 19 | - [Approved LLM & Dataset](#approved-llm-and-dataset) 20 | - [Submission](#submission) 21 | - [Evaluate Your Model Locally Using HELM](#evaluate-your-model-locally-using-helm) 22 | - [Finetune](#finetune) 23 | - [Create your own submission template](#create-your-own-submission-template) 24 | - [Discord Leaderboard](#discord-leaderboard) 25 | - [Final Leaderboard Submission](#final-eval-submission) 26 | - [Evaluating the Final Submission](#evaluating-the-final-submission) 27 | - [GPU funding](#gpu-funding) 28 | 29 | ## Approved LLM and dataset 30 | 31 | The LLM space has complex licenses which can make it difficult to figure out what's permittable to use in a competition to streamline this process we've shortlisted a few models and datasets we know are safe to use [here](https://llm-efficiency-challenge.github.io/challenge) 32 | 33 | That said the LLM space is fast moving so if you'd like to use a dataset or model that isn't on our list make sure to ask us about it on [https://discord.gg/XJwQ5ddMK7](https://discord.gg/XJwQ5ddMK7) 34 | 35 | ## Submission 36 | 37 | The submission in this repository is a basic implementation of the setting up an HTTP server in accordance to the `open_api` spec. It includes a sample solution built off of [Lit-GPT](https://github.com/Lightning-AI/lit-gpt) and open-llama weights that participants can reference or modify as they see fit. 38 | 39 | You can use the provided code as a reference or starting point for your own implementation. The `main.py` file contains the simple FastAPI server, and you can modify it to suit your needs. 40 | 41 | You can find the Lit-GPT submission [here](sample-submissions/lit-gpt/) and the llama-recipes submission [here](sample-submissions/llama_recipes/) with instructions on how to run each locally. 42 | 43 | Make sure that your final submission has only a single `Dockerfile` and that your weights are not directly included in the repo, they need to be downloaded during docker build or at runtime. 44 | 45 | ## Evaluate Your Model Locally Using HELM 46 | 47 | Every submission will be tested against [HELM](https://crfm.stanford.edu/helm/latest/) which is a standard suite to evaluate LLMs on a broad set of datasets. This competition will leverage HELM for its evaluation infrastructure. The organizers will leverage standard STEM tasks from HELM although we will keep the exact set a secret and in addition we'll be including some heldout tasks that are presently not in HELM. 48 | 49 | As you're working on your submission `Dockerfile` you'll want to test it out locally to make sure your contribution works as expected before you submit it. 50 | 51 | HELM makes it easy to add new evaluation datasets by just adding another line in a config file so make sure to experiment with the different datasets they have available and feel free to contribute your own. 52 | 53 | To learn more about how to test your submission with HELM, please follow the instructions [here](helm.md). 54 | 55 | ## Finetune 56 | 57 | It's likely that an untuned base model won't give you satisfactory results, in that case you might find it helpful to do some additional finetuning. There are many frameworks to do this but we've created 2 sample submissions to do so 58 | 1. [lit-gpt](/sample-submissions/lit-gpt/) 59 | 2. [llama-recipes](/sample-submissions/llama_recipes/) 60 | 61 | 62 | ### Create Your Own Submission Template 63 | 64 | Note that we've offered 2 sample submissions, our evaluation infrastructure is generic and only assumes an HTTP client so you can use a finetuning framework in Python like the ones we've suggested but also any non based Python framework you like using. 65 | 66 | The `openapi.json` file in this repository contains the OpenAPI specification for the Competition API. Competitors can use this specification to understand the API endpoints, request and response structures, and overall requirements for interacting with the competition platform. 67 | 68 | The OpenAPI specification provides a standardized way to describe the API, making it easier for competitors to develop their own solutions and integrate them seamlessly with the competition infrastructure. 69 | 70 | 71 | ## Discord Leaderboard 72 | 73 | The [Lightning AI](https://lightning.ai/) has built a Discord based for us. You can find it on discord by its name `evalbot#4372`. 74 | 75 | You can interact with it by DM'ing it with a zipped file of your sample submission and message it to either `eval A100` or `eval 4090`. More details on the bot are [here](https://discord.com/channels/1124130156336922665/1124134272631054447/1151718598818156645) 76 | 77 | Once you make a submission the bot will inform you whether your submission failed or succeeded and after a few hours will publicly post your results. If you're at the top of the queue you can expect the eval to take 1-2h but depending on the size of the queue this could be longer. So please be mindful to not hurt other competitors trying to use the limited amount of hardware and ensure that your submissions work locally first. 78 | 79 | Your submission will remain private to other competitors. 80 | 81 | The end to end flow is described [here](leaderboard.md) 82 | 83 | ## Final Leaderboard Submission 84 | 85 | When you registered for the competition you would have needed to create a github repo. When the submission deadline is reached make sure your Github repo has a `Dockerfile`, in case the location is ambiguous please sure to let us know in your `README.md`. The organizers will take your `Dockerfile` and run it as is and compute a baseline eval score. The purpose of this step is to primarily filter out broken submissions or submissions that can't outperform the unfinetuned sample submissions. 86 | 87 | The deadline is on Oct 25 2023 with important dates listed [here](https://llm-efficiency-challenge.github.io/dates) 88 | 89 | ## Evaluating the Final Submission 90 | 91 | Once the organizers have identified a shortlist of strong submissions, we will message you directly for another `Dockerfile` that would reproduce all of your artifacts. The best submission among this shortlist will win the competition and be invited to present their work at NeurIPS at our workshop. 92 | 93 | ## GPU funding 94 | 95 | [AWS](https://aws.amazon.com/) has graciously agreed to provide $500 in AWS credits to 25 participating teams in the LLM efficiency competition. You will be able to pick and choose from available hardware to experiment before you make your final submission. To be eligible, please make sure to sign up at https://llm-efficiency-challenge.github.io/submission and write a short proposal in your `README.md` and add [@jisaacso](https://github.com/jisaacso) to your repos who will review your proposals. 96 | 97 | We'll be prioritizing the first teams with serious proposals. Good luck! 98 | 99 | There are some other free ways of getting GPUs that people have posted on discord [here](https://discord.com/channels/1124130156336922665/1149283885524463637/1149283885524463637) and you can shop around for both 4090 and A100 on cloud on [https://cloud-gpus.com/](https://cloud-gpus.com/) 100 | -------------------------------------------------------------------------------- /build_run_specs_full.py: -------------------------------------------------------------------------------- 1 | entries = [ 2 | #bigbench 3 | # 1. auto_debugging: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/auto_debugging 4 | {'scenario':'auto_debugging','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=auto_debugging,subtask=", 'priority': 1}, 5 | 6 | # 3. code_line_description: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/code_line_description 7 | {'scenario':'code_line_description','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=code_line_description,subtask=", 'priority': 1}, 8 | 9 | # 4. conceptual_combinations: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/conceptual_combinations 10 | {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=contradictions", 'priority': 1}, 11 | {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=emergent_properties", 'priority': 1}, 12 | {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=fanciful_fictional_combinations", 'priority': 1}, 13 | {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=homonyms", 'priority': 1}, 14 | {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=invented_words", 'priority': 1}, 15 | 16 | # 6. emoji_movie: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/emoji_movie 17 | {'scenario':'emoji_movie','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=emoji_movie,subtask=", 'priority': 1}, 18 | 19 | # 7. formal_fallacies_syllogisms_negation: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/formal_fallacies_syllogisms_negation 20 | {'scenario':'formal_fallacies_syllogisms_negation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=formal_fallacies_syllogisms_negation,subtask=", 'priority': 1}, 21 | 22 | # 8. hindu_knowledge: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge 23 | # {'scenario':'hindu_knowledge','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=hindu_knowledge,subtask=", 'priority': 1}, 24 | 25 | # 9. known_unknowns: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/known_unknowns 26 | {'scenario':'known_unknowns','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=known_unknowns,subtask=", 'priority': 1}, 27 | 28 | # 11. linguistics_puzzles: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/linguistics_puzzles 29 | {'scenario':'linguistics_puzzles','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=linguistics_puzzles,subtask=", 'priority': 1}, 30 | 31 | # 12. logic_grid_puzzle: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/logic_grid_puzzle 32 | {'scenario':'logic_grid_puzzle','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logic_grid_puzzle,subtask=", 'priority': 1}, 33 | 34 | # 13. logical_deduction: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/logical_deduction 35 | {'scenario':'logical_deduction','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=three_objects", 'priority': 1}, 36 | {'scenario':'logical_deduction','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=five_objects", 'priority': 1}, 37 | {'scenario':'logical_deduction','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=seven_objects", 'priority': 1}, 38 | 39 | # 14. misconceptions_russian: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/misconceptions_russian 40 | # {'scenario':'misconceptions_russian','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=misconceptions_russian,subtask=", 'priority': 1}, 41 | 42 | # 15. novel_concepts: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/novel_concepts 43 | {'scenario':'novel_concepts','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=novel_concepts,subtask=", 'priority': 1}, 44 | 45 | # 16. operators: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/operators 46 | {'scenario':'operator','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=operators,subtask=", 'priority': 1}, 47 | 48 | # 17. parsinlu_reading_comprehension: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/parsinlu_reading_comprehension 49 | # {'scenario':'parsinlu_reading_comprehension','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=parsinlu_reading_comprehension,subtask=", 'priority': 1}, 50 | 51 | # 18. play_dialog_same_or_different: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/play_dialog_same_or_different 52 | {'scenario':'play_dialog_same_or_different','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=play_dialog_same_or_different,subtask=", 'priority': 1}, 53 | 54 | # 19. repeat_copy_logic: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/repeat_copy_logic 55 | {'scenario':'repeat_copy_logic','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=repeat_copy_logic,subtask=", 'priority': 1}, 56 | 57 | # 20. strange_stories: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/strange_stories 58 | {'scenario':'strange_stories','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=boolean", 'priority': 1}, 59 | {'scenario':'strange_stories','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=multiple_choice", 'priority': 1}, 60 | 61 | # 21. strategyqa: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/strategyqa 62 | {'scenario':'strategyqa','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strategyqa,subtask=", 'priority': 1}, 63 | 64 | # 22. symbol_interpretation: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/symbol_interpretation 65 | {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=adversarial", 'priority': 1}, 66 | {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=emoji_agnostic", 'priority': 1}, 67 | {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=name_agnostic", 'priority': 1}, 68 | {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=plain", 'priority': 1}, 69 | {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=tricky", 'priority': 1}, 70 | 71 | # 23. vitaminc_fact_verification: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/vitaminc_fact_verification 72 | {'scenario':'vitaminc_fact_verification','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=vitaminc_fact_verification,subtask=", 'priority': 1}, 73 | 74 | # 24. winowhy: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/winowhy 75 | {'scenario':'winowhy','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=winowhy,subtask=", 'priority': 1}, 76 | 77 | # MMLU STEM: Medicine/Biology 78 | {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=anatomy,data_augmentation=canonical", 'priority': 2}, 79 | {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=college_medicine,data_augmentation=canonical", 'priority': 2}, 80 | {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=college_biology,data_augmentation=canonical", 'priority': 2}, 81 | {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=high_school_biology,data_augmentation=canonical", 'priority': 2}, 82 | 83 | # MMLU STEM: CS 84 | {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=college_computer_science,data_augmentation=canonical", 'priority': 2}, 85 | {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=high_school_computer_science,data_augmentation=canonical", 'priority': 2}, 86 | {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=computer_security,data_augmentation=canonical", 'priority': 2}, 87 | {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=electrical_engineering,data_augmentation=canonical", 'priority': 2}, 88 | {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=machine_learning,data_augmentation=canonical", 'priority': 2}, 89 | 90 | # MMLU STEM: Math 91 | {'scenario':'math','description': "mmlu:model=neurips/local,subject=high_school_mathematics,data_augmentation=canonical", 'priority': 2}, 92 | {'scenario':'math','description': "mmlu:model=neurips/local,subject=college_mathematics,data_augmentation=canonical", 'priority': 2}, 93 | {'scenario':'math','description': "mmlu:model=neurips/local,subject=abstract_algebra,data_augmentation=canonical", 'priority': 2}, 94 | {'scenario':'math','description': "mmlu:model=neurips/local,subject=high_school_statistics,data_augmentation=canonical", 'priority': 2}, 95 | 96 | # MMLU STEM: Chemistry/Physics 97 | {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=college_chemistry,data_augmentation=canonical", 'priority': 2}, 98 | {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=high_school_chemistry,data_augmentation=canonical", 'priority': 2}, 99 | {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=high_school_physics,data_augmentation=canonical", 'priority': 2}, 100 | {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=college_physics,data_augmentation=canonical", 'priority': 2}, 101 | {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=astronomy,data_augmentation=canonical", 'priority': 2}, 102 | 103 | # MMLU Humanities: Formal reasoning 104 | {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=formal_logic,data_augmentation=canonical", 'priority': 2}, 105 | {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=logical_fallacies,data_augmentation=canonical", 'priority': 2}, 106 | {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=philosophy,data_augmentation=canonical", 'priority': 2}, 107 | {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=moral_disputes,data_augmentation=canonical", 'priority': 2}, 108 | {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=moral_scenarios,data_augmentation=canonical", 'priority': 2}, 109 | 110 | # MMLU Humanities: Law 111 | {'scenario':'law','description': "mmlu:model=neurips/local,subject=professional_law,data_augmentation=canonical", 'priority': 2}, 112 | {'scenario':'law','description': "mmlu:model=neurips/local,subject=international_law,data_augmentation=canonical", 'priority': 2}, 113 | {'scenario':'law','description': "mmlu:model=neurips/local,subject=jurisprudence,data_augmentation=canonical", 'priority': 2}, 114 | 115 | # MMLU Humanities: Histroy 116 | {'scenario':'history','description': "mmlu:model=neurips/local,subject=high_school_european_history,data_augmentation=canonical", 'priority': 2}, 117 | {'scenario':'history','description': "mmlu:model=neurips/local,subject=high_school_us_history,data_augmentation=canonical", 'priority': 2}, 118 | {'scenario':'history','description': "mmlu:model=neurips/local,subject=high_school_world_history,data_augmentation=canonical", 'priority': 2}, 119 | {'scenario':'history','description': "mmlu:model=neurips/local,subject=prehistory,data_augmentation=canonical", 'priority': 2}, 120 | {'scenario':'history','description': "mmlu:model=neurips/local,subject=world_religions,data_augmentation=canonical", 'priority': 2}, 121 | 122 | # MMLU Other: Business 123 | {'scenario':'business','description': "mmlu:model=neurips/local,subject=business_ethics,data_augmentation=canonical", 'priority': 2}, 124 | {'scenario':'business','description': "mmlu:model=neurips/local,subject=global_facts,data_augmentation=canonical", 'priority': 2}, 125 | {'scenario':'business','description': "mmlu:model=neurips/local,subject=management,data_augmentation=canonical", 'priority': 2}, 126 | {'scenario':'business','description': "mmlu:model=neurips/local,subject=marketing,data_augmentation=canonical", 'priority': 2}, 127 | {'scenario':'business','description': "mmlu:model=neurips/local,subject=miscellaneous,data_augmentation=canonical", 'priority': 2}, 128 | {'scenario':'business','description': "mmlu:model=neurips/local,subject=professional_accounting,data_augmentation=canonical", 'priority': 2}, 129 | 130 | # MMLU Other: Health 131 | {'scenario':'health','description': "mmlu:model=neurips/local,subject=nutrition,data_augmentation=canonical", 'priority': 2}, 132 | {'scenario':'health','description': "mmlu:model=neurips/local,subject=human_aging,data_augmentation=canonical", 'priority': 2}, 133 | {'scenario':'health','description': "mmlu:model=neurips/local,subject=clinical_knowledge,data_augmentation=canonical", 'priority': 2}, 134 | {'scenario':'health','description': "mmlu:model=neurips/local,subject=medical_genetics,data_augmentation=canonical", 'priority': 2}, 135 | {'scenario':'health','description': "mmlu:model=neurips/local,subject=professional_medicine,data_augmentation=canonical", 'priority': 2}, 136 | {'scenario':'health','description': "mmlu:model=neurips/local,subject=virology,data_augmentation=canonical", 'priority': 2}, 137 | 138 | # MMLU Social Sciences: Social studies 139 | {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=high_school_government_and_politics,data_augmentation=canonical", 'priority': 2}, 140 | {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=high_school_geography,data_augmentation=canonical", 'priority': 2}, 141 | {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=us_foreign_policy,data_augmentation=canonical", 'priority': 2}, 142 | {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=public_relations,data_augmentation=canonical", 'priority': 2}, 143 | {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=security_studies,data_augmentation=canonical", 'priority': 2}, 144 | 145 | # MMLU Social Sciences: Human behavior 146 | {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=high_school_psychology,data_augmentation=canonical", 'priority': 2}, 147 | {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=human_sexuality,data_augmentation=canonical", 'priority': 2}, 148 | {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=professional_psychology,data_augmentation=canonical", 'priority': 2}, 149 | {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=sociology,data_augmentation=canonical", 'priority': 2}, 150 | 151 | # MMLU Social Sciences: Economics 152 | {'scenario':'economics','description': "mmlu:model=neurips/local,subject=high_school_microeconomics,data_augmentation=canonical", 'priority': 2}, 153 | {'scenario':'economics','description': "mmlu:model=neurips/local,subject=econometrics,data_augmentation=canonical", 'priority': 2}, 154 | {'scenario':'economics','description': "mmlu:model=neurips/local,subject=high_school_macroeconomics,data_augmentation=canonical", 'priority': 2}, 155 | 156 | # Truthful QA 157 | {'scenario':'truthful_qa','description': "truthful_qa:task=mc_single,model=neurips/local", 'priority': 1}, 158 | 159 | # CNN/daily mail 160 | {'scenario':'truthful_qa','description': "summarization_cnndm:model=neurips/local", 'priority': 1}, 161 | # GSM 162 | {'scenario':'gsm','description': "gsm:model=neurips/local", 'priority': 1}, 163 | # BBQ 164 | {'scenario':'bbq','description': "bbq:subject=all,model=neurips/local", 'priority': 1}, 165 | 166 | ] 167 | 168 | def generate_equal_sum_list(V, N): 169 | # Calculate the base value that will be repeated. 170 | base_value = V // N 171 | # Calculate the remainder for distribution. 172 | remainder = V % N 173 | 174 | # Create the list with base_value repeated N times. 175 | result = [base_value] * N 176 | 177 | # Distribute the remainder evenly among the elements. 178 | for i in range(remainder): 179 | result[i] += 1 180 | 181 | return result 182 | 183 | import pandas as pd 184 | import argparse 185 | 186 | if __name__ == "__main__": 187 | 188 | import argparse 189 | parser = argparse.ArgumentParser( 190 | description=''' 191 | This method automatically generates a configuration file for the neurips_llm_efficiency_challenge 192 | 193 | Calling it with: `python build_run_specs_full.py --example_budget=600` will produce a conf file 194 | with a total of 600 examples distributed evenly across scenarios as also defined here. 195 | ''', 196 | ) 197 | parser.add_argument("--example_budget", required=True, type=int, help='# example to use') 198 | args = parser.parse_args() 199 | 200 | # get a list of scenarios and n_examples 201 | df = pd.DataFrame(entries) 202 | scenario_count_dict = df.value_counts('scenario').to_dict() 203 | n_scenarios = len(df.scenario.unique()) 204 | max_eval_instances_per_scenario = generate_equal_sum_list(args.example_budget, n_scenarios) 205 | 206 | # get a dict of the amount of examples per 207 | scenario_n_examples_dict = {} 208 | for scenario, n_subscenarios in scenario_count_dict.items(): 209 | cur_max_eval_instances_per_scenario = max_eval_instances_per_scenario.pop() 210 | scenario_n_examples_dict[scenario] = generate_equal_sum_list(cur_max_eval_instances_per_scenario,n_subscenarios) 211 | 212 | for i in range(len(entries)): 213 | cur_scenario = entries[i]['scenario'] 214 | # print(f"added {v} to {entries[i]['max_eval_instances']}") 215 | v = scenario_n_examples_dict[cur_scenario].pop() 216 | entries[i]['max_eval_instances'] = v 217 | 218 | with open(f'./run_specs_full_coarse_{args.example_budget}_budget.conf','w') as f: 219 | f.write('entries: [\n') 220 | last_scenario = '' 221 | for entry in entries: 222 | cur_scenario = entry['scenario'] 223 | if cur_scenario != last_scenario: 224 | f.write(f'\n# {cur_scenario}\n') 225 | print(entry) 226 | last_scenario = cur_scenario 227 | f.write('{') 228 | f.write(f'description: """{entry["description"]}'.replace('"""','"')) 229 | f.write(f',max_eval_instances={entry["max_eval_instances"]}""",priority: 1'.replace('"""','"')) 230 | f.write('}\n') 231 | f.write(']') 232 | 233 | print(f'Saved ./run_specs_full_coarse_{args.example_budget}_budget.conf') 234 | -------------------------------------------------------------------------------- /helm.md: -------------------------------------------------------------------------------- 1 | # How to test HELM locally 2 | 3 | ## Install the NeurIPS client 4 | 5 | Install HELM: `pip install git+https://github.com/stanford-crfm/helm.git` 6 | 7 | 8 | ## Setup an HTTP server 9 | 10 | Follow instructions in [toy-submission](/sample-submissions/lit-gpt/) to setup a simple HTTP client that can use to local tests 11 | 12 | ## Configure HELM 13 | 14 | You can configure which datasets to run HELM on by editing a `run_specs.conf`, to run your model on a large set of datasets. For the preliminary evaluation the organizers will use https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/blob/master/run_specs_full_coarse_600_budget.conf 15 | 16 | ```bash 17 | helm-run --conf-paths run_specs_full_coarse_600_budget.conf --suite v1 --max-eval-instances 10 18 | helm-summarize --suite v1 19 | ``` 20 | 21 | ## Analyze your results 22 | 23 | You can launch a web server to visually inspect the results of your run, `helm-summarize` can also print the results textually for you in your terminal but we've found the web server to be useful. 24 | 25 | ``` 26 | helm-server 27 | ``` 28 | 29 | This will launch a server on your local host, if you're working on a remote machine you might need to setup port forwarding. If everything worked correctly you should see a page that looks like [this](https://user-images.githubusercontent.com/3282513/249620854-080f4d77-c5fd-4ea4-afa4-cf6a9dceb8c9.png) 30 | -------------------------------------------------------------------------------- /leaderboard.md: -------------------------------------------------------------------------------- 1 | # Leaderboard usage 2 | 3 | 4 | ## How to use the leaderboard 5 | The [Lightning AI](https://lightning.ai/) team has built us a leaderboard on Discord. This is the single best way you can make sure your submissions actually work before the submission, try to beat the unfinetuned toy submission as a starting point. 6 | 7 | You might have noticed a new friendly bot has joined the server called @evalbot to use it 8 | 1. DM the bot with `eval 4090` or `eval A100` and attach a zipped file of your submission to the message (You can also just openly message the bot but DM'ing will protect your secret sauce) 9 | 2. If successful the bot will give you a job ID and a running status, the eval will take roughly 1-2h so be patient if you're top of queue 10 | 3. Once the bot completes your run it will update either the ⁠leaderboard_4090 or ⁠leaderboard_a100 channel, we will not be monitoring these 2 text channels they will be purely for the bot to post the new updated leaderboard 11 | 12 | ## How to create a zip submission 13 | 14 | We will showcase an example using our actual repo https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge 15 | 1. `git clone --recurse-submodules https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge` to ensure `lit-gpt` folder is actually in the repo 16 | 2. `rm -rf sample-submissions/llama_recipes`, the leaderboard will recursively traverse your repo and find the first `Dockerfile` and assume that's the submission 17 | 3. `zip -r neurips_llm_efficiency_challenge.zip neurips_llm_efficiency_challenge/` 18 | 19 | And once you have that submission DM the `evalbot` with either `eval 4090` or `eval A100` with the zip file attached to your submission. Discord does impose size limits on messages so make sure your artifacts aren't stored directly in the repo but that you `wget` from somewhere else. 20 | 21 | 22 | **Note**: 23 | 1. The way the bot works is it will recursively scan your repo for the first Dockerfile and use only that to eval against 24 | Providing free GPUs is expensive so if you're up to funny business like opening multiple discord accounts and/or spamming our bot we will disqualify you from the competition 25 | 2. You will be allowed a maximum of 3 submissions a day 26 | 3. Depending on volume of submissions eval might take a long time while you wait in the queue, the 2 techniques we have of resolving this are either adding more GPUs in our pool or reducing the number of eval instances, we will communicate whenever we make either of 2 decisions on Discord directly 27 | -------------------------------------------------------------------------------- /open_api_spec.json: -------------------------------------------------------------------------------- 1 | {"openapi": "3.0.2", "info": {"title": "FastAPI", "version": "0.1.0"}, "paths": {"/process": {"post": {"summary": "Process Request", "operationId": "process_request_process_post", "requestBody": {"content": {"application/json": {"schema": {"$ref": "#/components/schemas/ProcessRequest"}}}, "required": true}, "responses": {"200": {"description": "Successful Response", "content": {"application/json": {"schema": {}}}}, "422": {"description": "Validation Error", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/HTTPValidationError"}}}}}}}, "/tokenize": {"post": {"summary": "Tokenize", "operationId": "tokenize_tokenize_post", "requestBody": {"content": {"application/json": {"schema": {"$ref": "#/components/schemas/TokenizeRequest"}}}, "required": true}, "responses": {"200": {"description": "Successful Response", "content": {"application/json": {"schema": {}}}}, "422": {"description": "Validation Error", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/HTTPValidationError"}}}}}}}}, "components": {"schemas": {"HTTPValidationError": {"title": "HTTPValidationError", "type": "object", "properties": {"detail": {"title": "Detail", "type": "array", "items": {"$ref": "#/components/schemas/ValidationError"}}}}, "ProcessRequest": {"title": "ProcessRequest", "required": ["prompt"], "type": "object", "properties": {"prompt": {"title": "Prompt", "type": "string"}, "num_samples": {"title": "Num Samples", "type": "integer", "default": 1}, "max_new_tokens": {"title": "Max New Tokens", "type": "integer", "default": 50}, "top_k": {"title": "Top K", "type": "integer", "default": 200}, "temperature": {"title": "Temperature", "type": "number", "default": 0.8}, "seed": {"title": "Seed", "type": "integer"}}}, "TokenizeRequest": {"title": "TokenizeRequest", "required": ["text"], "type": "object", "properties": {"text": {"title": "Text", "type": "string"}, "truncation": {"title": "Truncation", "type": "boolean", "default": true}, "max_length": {"title": "Max Length", "type": "integer", "default": 2048}}}, "ValidationError": {"title": "ValidationError", "required": ["loc", "msg", "type"], "type": "object", "properties": {"loc": {"title": "Location", "type": "array", "items": {"type": "string"}}, "msg": {"title": "Message", "type": "string"}, "type": {"title": "Error Type", "type": "string"}}}}}} -------------------------------------------------------------------------------- /run_specs.conf: -------------------------------------------------------------------------------- 1 | entries: [ 2 | #bigbench 3 | 4 | #analytic_entailment: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/analytic_entailment 5 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=analytic_entailment,subtask=", priority: 1} 6 | 7 | #causal_judgment: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/causal_judgment 8 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=causal_judgment,subtask=", priority: 1} 9 | 10 | #emoji_movie: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/emoji_movie 11 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=emoji_movie,subtask=", priority: 1} 12 | 13 | #empirical_judgments: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/empirical_judgments 14 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=empirical_judgments,subtask=", priority: 1} 15 | 16 | #known_unknowns: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/known_unknowns 17 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=known_unknowns,subtask=", priority: 1} 18 | 19 | # logical_deduction: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/logical_deduction 20 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=logical_deduction,subtask=three_objects", priority: 1} 21 | 22 | #strange_stories: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/strange_stories 23 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=strange_stories,subtask=multiple_choice", priority: 1} 24 | 25 | #snarks: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/snarks 26 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=snarks,subtask=", priority: 1} 27 | 28 | #dark_humor_detection: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/dark_humor_detection 29 | {description: "big_bench:model=neurips/local,max_train_instances=3,task=dark_humor_detection,subtask=", priority: 1} 30 | 31 | 32 | #mmlu 33 | {description: "mmlu:model=neurips/local,subject=philosophy,data_augmentation=canonical", priority: 1} 34 | {description: "mmlu:model=neurips/local,subject=high_school_biology,data_augmentation=canonical", priority: 1} 35 | {description: "mmlu:model=neurips/local,subject=high_school_chemistry,data_augmentation=canonical", priority: 1} 36 | {description: "mmlu:model=neurips/local,subject=high_school_computer_science,data_augmentation=canonical", priority: 1} 37 | {description: "mmlu:model=neurips/local,subject=high_school_european_history,data_augmentation=canonical", priority: 1} 38 | {description: "mmlu:model=neurips/local,subject=high_school_geography,data_augmentation=canonical", priority: 1} 39 | {description: "mmlu:model=neurips/local,subject=high_school_government_and_politics,data_augmentation=canonical", priority: 1} 40 | {description: "mmlu:model=neurips/local,subject=high_school_macroeconomics,data_augmentation=canonical", priority: 1} 41 | {description: "mmlu:model=neurips/local,subject=high_school_mathematics,data_augmentation=canonical", priority: 1} 42 | {description: "mmlu:model=neurips/local,subject=high_school_microeconomics,data_augmentation=canonical", priority: 1} 43 | {description: "mmlu:model=neurips/local,subject=high_school_physics,data_augmentation=canonical", priority: 1} 44 | {description: "mmlu:model=neurips/local,subject=high_school_psychology,data_augmentation=canonical", priority: 1} 45 | {description: "mmlu:model=neurips/local,subject=high_school_statistics,data_augmentation=canonical", priority: 1} 46 | {description: "mmlu:model=neurips/local,subject=high_school_us_history,data_augmentation=canonical", priority: 1} 47 | {description: "mmlu:model=neurips/local,subject=high_school_world_history,data_augmentation=canonical", priority: 1} 48 | {description: "mmlu:model=neurips/local,subject=moral_disputes,data_augmentation=canonical", priority: 1} 49 | {description: "mmlu:model=neurips/local,subject=moral_scenarios,data_augmentation=canonical", priority: 1} 50 | 51 | 52 | #truthful QA 53 | {description: "truthful_qa:task=mc_single,model=neurips/local", priority: 1}, 54 | 55 | #CNN/daily mail 56 | {description: "summarization_cnndm:model=neurips/local", priority: 1}, 57 | #GSM 58 | {description: "gsm:model=neurips/local", priority: 1} 59 | #BBQ 60 | {description: "bbq:subject=all,model=neurips/local", priority: 1}, 61 | 62 | ] 63 | -------------------------------------------------------------------------------- /run_specs_full_coarse_600_budget.conf: -------------------------------------------------------------------------------- 1 | entries: [ 2 | 3 | # auto_debugging 4 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=auto_debugging,subtask=,max_eval_instances=18",priority: 1} 5 | 6 | # code_line_description 7 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=code_line_description,subtask=,max_eval_instances=19",priority: 1} 8 | 9 | # conceptual_combinations 10 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=contradictions,max_eval_instances=3",priority: 1} 11 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=emergent_properties,max_eval_instances=3",priority: 1} 12 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=fanciful_fictional_combinations,max_eval_instances=4",priority: 1} 13 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=homonyms,max_eval_instances=4",priority: 1} 14 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=invented_words,max_eval_instances=4",priority: 1} 15 | 16 | # emoji_movie 17 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=emoji_movie,subtask=,max_eval_instances=19",priority: 1} 18 | 19 | # formal_fallacies_syllogisms_negation 20 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=formal_fallacies_syllogisms_negation,subtask=,max_eval_instances=19",priority: 1} 21 | 22 | # known_unknowns 23 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=known_unknowns,subtask=,max_eval_instances=19",priority: 1} 24 | 25 | # linguistics_puzzles 26 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=linguistics_puzzles,subtask=,max_eval_instances=18",priority: 1} 27 | 28 | # logic_grid_puzzle 29 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logic_grid_puzzle,subtask=,max_eval_instances=18",priority: 1} 30 | 31 | # logical_deduction 32 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=three_objects,max_eval_instances=6",priority: 1} 33 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=five_objects,max_eval_instances=6",priority: 1} 34 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=seven_objects,max_eval_instances=6",priority: 1} 35 | 36 | # novel_concepts 37 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=novel_concepts,subtask=,max_eval_instances=18",priority: 1} 38 | 39 | # operator 40 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=operators,subtask=,max_eval_instances=18",priority: 1} 41 | 42 | # play_dialog_same_or_different 43 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=play_dialog_same_or_different,subtask=,max_eval_instances=18",priority: 1} 44 | 45 | # repeat_copy_logic 46 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=repeat_copy_logic,subtask=,max_eval_instances=18",priority: 1} 47 | 48 | # strange_stories 49 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=boolean,max_eval_instances=9",priority: 1} 50 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=multiple_choice,max_eval_instances=9",priority: 1} 51 | 52 | # strategyqa 53 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strategyqa,subtask=,max_eval_instances=18",priority: 1} 54 | 55 | # symbol_interpretation 56 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=adversarial,max_eval_instances=3",priority: 1} 57 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=emoji_agnostic,max_eval_instances=3",priority: 1} 58 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=name_agnostic,max_eval_instances=4",priority: 1} 59 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=plain,max_eval_instances=4",priority: 1} 60 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=tricky,max_eval_instances=4",priority: 1} 61 | 62 | # vitaminc_fact_verification 63 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=vitaminc_fact_verification,subtask=,max_eval_instances=18",priority: 1} 64 | 65 | # winowhy 66 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=winowhy,subtask=,max_eval_instances=19",priority: 1} 67 | 68 | # medicine_biology 69 | {description: "mmlu:model=neurips/local,subject=anatomy,data_augmentation=canonical,max_eval_instances=4",priority: 1} 70 | {description: "mmlu:model=neurips/local,subject=college_medicine,data_augmentation=canonical,max_eval_instances=4",priority: 1} 71 | {description: "mmlu:model=neurips/local,subject=college_biology,data_augmentation=canonical,max_eval_instances=5",priority: 1} 72 | {description: "mmlu:model=neurips/local,subject=high_school_biology,data_augmentation=canonical,max_eval_instances=5",priority: 1} 73 | 74 | # computer_science 75 | {description: "mmlu:model=neurips/local,subject=college_computer_science,data_augmentation=canonical,max_eval_instances=3",priority: 1} 76 | {description: "mmlu:model=neurips/local,subject=high_school_computer_science,data_augmentation=canonical,max_eval_instances=3",priority: 1} 77 | {description: "mmlu:model=neurips/local,subject=computer_security,data_augmentation=canonical,max_eval_instances=4",priority: 1} 78 | {description: "mmlu:model=neurips/local,subject=electrical_engineering,data_augmentation=canonical,max_eval_instances=4",priority: 1} 79 | {description: "mmlu:model=neurips/local,subject=machine_learning,data_augmentation=canonical,max_eval_instances=4",priority: 1} 80 | 81 | # math 82 | {description: "mmlu:model=neurips/local,subject=high_school_mathematics,data_augmentation=canonical,max_eval_instances=4",priority: 1} 83 | {description: "mmlu:model=neurips/local,subject=college_mathematics,data_augmentation=canonical,max_eval_instances=4",priority: 1} 84 | {description: "mmlu:model=neurips/local,subject=abstract_algebra,data_augmentation=canonical,max_eval_instances=5",priority: 1} 85 | {description: "mmlu:model=neurips/local,subject=high_school_statistics,data_augmentation=canonical,max_eval_instances=5",priority: 1} 86 | 87 | # physics_chemistry 88 | {description: "mmlu:model=neurips/local,subject=college_chemistry,data_augmentation=canonical,max_eval_instances=3",priority: 1} 89 | {description: "mmlu:model=neurips/local,subject=high_school_chemistry,data_augmentation=canonical,max_eval_instances=3",priority: 1} 90 | {description: "mmlu:model=neurips/local,subject=high_school_physics,data_augmentation=canonical,max_eval_instances=4",priority: 1} 91 | {description: "mmlu:model=neurips/local,subject=college_physics,data_augmentation=canonical,max_eval_instances=4",priority: 1} 92 | {description: "mmlu:model=neurips/local,subject=astronomy,data_augmentation=canonical,max_eval_instances=4",priority: 1} 93 | 94 | # formal_reasoning 95 | {description: "mmlu:model=neurips/local,subject=formal_logic,data_augmentation=canonical,max_eval_instances=3",priority: 1} 96 | {description: "mmlu:model=neurips/local,subject=logical_fallacies,data_augmentation=canonical,max_eval_instances=3",priority: 1} 97 | {description: "mmlu:model=neurips/local,subject=philosophy,data_augmentation=canonical,max_eval_instances=4",priority: 1} 98 | {description: "mmlu:model=neurips/local,subject=moral_disputes,data_augmentation=canonical,max_eval_instances=4",priority: 1} 99 | {description: "mmlu:model=neurips/local,subject=moral_scenarios,data_augmentation=canonical,max_eval_instances=4",priority: 1} 100 | 101 | # law 102 | {description: "mmlu:model=neurips/local,subject=professional_law,data_augmentation=canonical,max_eval_instances=6",priority: 1} 103 | {description: "mmlu:model=neurips/local,subject=international_law,data_augmentation=canonical,max_eval_instances=6",priority: 1} 104 | {description: "mmlu:model=neurips/local,subject=jurisprudence,data_augmentation=canonical,max_eval_instances=6",priority: 1} 105 | 106 | # history 107 | {description: "mmlu:model=neurips/local,subject=high_school_european_history,data_augmentation=canonical,max_eval_instances=3",priority: 1} 108 | {description: "mmlu:model=neurips/local,subject=high_school_us_history,data_augmentation=canonical,max_eval_instances=3",priority: 1} 109 | {description: "mmlu:model=neurips/local,subject=high_school_world_history,data_augmentation=canonical,max_eval_instances=4",priority: 1} 110 | {description: "mmlu:model=neurips/local,subject=prehistory,data_augmentation=canonical,max_eval_instances=4",priority: 1} 111 | {description: "mmlu:model=neurips/local,subject=world_religions,data_augmentation=canonical,max_eval_instances=4",priority: 1} 112 | 113 | # business 114 | {description: "mmlu:model=neurips/local,subject=business_ethics,data_augmentation=canonical,max_eval_instances=3",priority: 1} 115 | {description: "mmlu:model=neurips/local,subject=global_facts,data_augmentation=canonical,max_eval_instances=3",priority: 1} 116 | {description: "mmlu:model=neurips/local,subject=management,data_augmentation=canonical,max_eval_instances=3",priority: 1} 117 | {description: "mmlu:model=neurips/local,subject=marketing,data_augmentation=canonical,max_eval_instances=3",priority: 1} 118 | {description: "mmlu:model=neurips/local,subject=miscellaneous,data_augmentation=canonical,max_eval_instances=3",priority: 1} 119 | {description: "mmlu:model=neurips/local,subject=professional_accounting,data_augmentation=canonical,max_eval_instances=3",priority: 1} 120 | 121 | # health 122 | {description: "mmlu:model=neurips/local,subject=nutrition,data_augmentation=canonical,max_eval_instances=3",priority: 1} 123 | {description: "mmlu:model=neurips/local,subject=human_aging,data_augmentation=canonical,max_eval_instances=3",priority: 1} 124 | {description: "mmlu:model=neurips/local,subject=clinical_knowledge,data_augmentation=canonical,max_eval_instances=3",priority: 1} 125 | {description: "mmlu:model=neurips/local,subject=medical_genetics,data_augmentation=canonical,max_eval_instances=3",priority: 1} 126 | {description: "mmlu:model=neurips/local,subject=professional_medicine,data_augmentation=canonical,max_eval_instances=3",priority: 1} 127 | {description: "mmlu:model=neurips/local,subject=virology,data_augmentation=canonical,max_eval_instances=3",priority: 1} 128 | 129 | # social_studies 130 | {description: "mmlu:model=neurips/local,subject=high_school_government_and_politics,data_augmentation=canonical,max_eval_instances=3",priority: 1} 131 | {description: "mmlu:model=neurips/local,subject=high_school_geography,data_augmentation=canonical,max_eval_instances=3",priority: 1} 132 | {description: "mmlu:model=neurips/local,subject=us_foreign_policy,data_augmentation=canonical,max_eval_instances=4",priority: 1} 133 | {description: "mmlu:model=neurips/local,subject=public_relations,data_augmentation=canonical,max_eval_instances=4",priority: 1} 134 | {description: "mmlu:model=neurips/local,subject=security_studies,data_augmentation=canonical,max_eval_instances=4",priority: 1} 135 | 136 | # human_behavior 137 | {description: "mmlu:model=neurips/local,subject=high_school_psychology,data_augmentation=canonical,max_eval_instances=4",priority: 1} 138 | {description: "mmlu:model=neurips/local,subject=human_sexuality,data_augmentation=canonical,max_eval_instances=4",priority: 1} 139 | {description: "mmlu:model=neurips/local,subject=professional_psychology,data_augmentation=canonical,max_eval_instances=5",priority: 1} 140 | {description: "mmlu:model=neurips/local,subject=sociology,data_augmentation=canonical,max_eval_instances=5",priority: 1} 141 | 142 | # economics 143 | {description: "mmlu:model=neurips/local,subject=high_school_microeconomics,data_augmentation=canonical,max_eval_instances=6",priority: 1} 144 | {description: "mmlu:model=neurips/local,subject=econometrics,data_augmentation=canonical,max_eval_instances=6",priority: 1} 145 | {description: "mmlu:model=neurips/local,subject=high_school_macroeconomics,data_augmentation=canonical,max_eval_instances=6",priority: 1} 146 | 147 | # truthful_qa 148 | {description: "truthful_qa:task=mc_single,model=neurips/local,max_eval_instances=9",priority: 1} 149 | {description: "summarization_cnndm:model=neurips/local,max_eval_instances=9",priority: 1} 150 | 151 | # gsm 152 | {description: "gsm:model=neurips/local,max_eval_instances=19",priority: 1} 153 | 154 | # bbq 155 | {description: "bbq:subject=all,model=neurips/local,max_eval_instances=18",priority: 1} 156 | ] -------------------------------------------------------------------------------- /sample-submissions/lit-gpt/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use latest official release with CUDA support https://hub.docker.com/r/pytorch/pytorch/tags 2 | FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel 3 | 4 | # Set the working directory in the container to /submission 5 | WORKDIR /submission 6 | 7 | # Copy the specific file into the container at /submission 8 | COPY /lit-gpt/ /submission/ 9 | 10 | # Setup server requriements 11 | COPY ./fast_api_requirements.txt fast_api_requirements.txt 12 | RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt 13 | 14 | RUN apt-get update && apt-get install -y git 15 | # Install any needed packages specified in requirements.txt that come from lit-gpt plus some optionals 16 | RUN pip install -r requirements.txt huggingface_hub sentencepiece tokenizers bitsandbytes scipy 17 | 18 | # some huggingface_hub versions require that the target dir exists 19 | RUN mkdir -p checkpoints/openlm-research/open_llama_3b 20 | # get open-llama weights: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/download_openllama.md 21 | RUN python scripts/download.py --repo_id openlm-research/open_llama_3b 22 | RUN python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/openlm-research/open_llama_3b 23 | 24 | # Copy over single file server 25 | COPY ./main.py /submission/main.py 26 | COPY ./helper.py /submission/helper.py 27 | COPY ./api.py /submission/api.py 28 | # Run the server 29 | CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"] 30 | -------------------------------------------------------------------------------- /sample-submissions/lit-gpt/README.md: -------------------------------------------------------------------------------- 1 | # Sample Submission 2 | This sample-submission contains a dockerfile that exposes a HTTP server. Requests will be made against this server during the evaluation phase of the competition 3 | 4 | ### Getting Started 5 | Make sure you have recursively cloned the top this repository in order to get lit-gpt. 6 | 7 | ❗ Make sure the repo is cloned with git submodule support either: 8 | 9 | ```sh 10 | git clone --recurse-submodules ... 11 | ``` 12 | 13 | or if you cloned the repo but are missing the `lit-gpt` folder 14 | 15 | ```sh 16 | git submodule update --init --recursive 17 | ``` 18 | 19 | ### Structure 20 | * lit-gpt/ 21 | * unmodified submodule that contains a hackable `torch.nn.Module` GPT definition as well as optional fine-tuning 22 | and inference code. 23 | * main.py 24 | * The process/ and tokenize/ endpoints are defined here 25 | * helper.py 26 | * Applies logic on top of lit-gpt's generate in order to produce responses in accordance with the spec. 27 | * api.py 28 | * Defines the pydantic classes for the FASTapi server 29 | * Dockerfile 30 | * Definition of the image that will set-up the server used for submissions 31 | 32 | ### Make your GPUs visible to Docker 33 | Follow this guide to install [nvidia-ctk](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). 34 | ```sh 35 | nvidia-ctk runtime configure 36 | systemctl restart docker 37 | ``` 38 | 39 | ### Build and run 40 | ```sh 41 | docker build -t sample_submission . 42 | docker run --gpus all -p 8080:80 sample_submission 43 | ``` 44 | ### Send requests 45 | ```sh 46 | curl -X POST -H "Content-Type: application/json" -d '{"prompt": "The capital of france is "}' http://localhost:8080/process 47 | ``` 48 | -------------------------------------------------------------------------------- /sample-submissions/lit-gpt/api.py: -------------------------------------------------------------------------------- 1 | from pydantic import BaseModel 2 | 3 | from typing import List, Dict, Optional 4 | 5 | 6 | class ProcessRequest(BaseModel): 7 | prompt: str 8 | num_samples: int = 1 9 | max_new_tokens: int = 50 10 | top_k: int = 200 11 | temperature: float = 0.8 12 | seed: Optional[int] = None 13 | echo_prompt: Optional[bool] 14 | 15 | 16 | class Token(BaseModel): 17 | text: str 18 | logprob: float 19 | top_logprob: Dict[str, float] 20 | 21 | 22 | class ProcessResponse(BaseModel): 23 | text: str 24 | tokens: List[Token] 25 | logprob: float 26 | request_time: float 27 | 28 | 29 | class TokenizeRequest(BaseModel): 30 | text: str 31 | truncation: bool = True 32 | max_length: int = 2048 33 | 34 | 35 | class TokenizeResponse(BaseModel): 36 | tokens: List[int] 37 | request_time: float 38 | 39 | 40 | class DecodeRequest(BaseModel): 41 | tokens: List[int] 42 | 43 | 44 | class DecodeResponse(BaseModel): 45 | text: str 46 | request_time: float 47 | -------------------------------------------------------------------------------- /sample-submissions/lit-gpt/fast_api_requirements.txt: -------------------------------------------------------------------------------- 1 | # FAST API 2 | fastapi>=0.68.0,<0.69.0 3 | pydantic>=1.8.0,<2.0.0 4 | uvicorn>=0.15.0,<0.16.0 5 | -------------------------------------------------------------------------------- /sample-submissions/lit-gpt/helper.py: -------------------------------------------------------------------------------- 1 | from typing import List, Optional, Tuple 2 | 3 | import torch 4 | 5 | 6 | @torch.no_grad() 7 | def toysubmission_generate( 8 | model: torch.nn.Module, 9 | idx: torch.Tensor, 10 | max_returned_tokens: int, 11 | *, 12 | temperature: float = 1.0, 13 | top_k: Optional[int] = None, 14 | eos_id: Optional[int] = None, 15 | ) -> Tuple[List[int], List[float], List[Tuple[int, float]]]: 16 | """Takes a conditioning sequence (prompt) as input and continues to generate as many tokens as requested. 17 | 18 | The implementation of this function is modified from A. Karpathy's nanoGPT. 19 | 20 | Args: 21 | model: The model to use. 22 | idx: Tensor of shape (T) with indices of the prompt sequence. 23 | max_returned_tokens: The maximum number of tokens to return (given plus generated). 24 | temperature: Scales the predicted logits by 1 / temperature. 25 | top_k: If specified, only sample among the tokens with the k highest probabilities. 26 | eos_id: If specified, stop generating any more token once the token is triggered. 27 | 28 | Returns: 29 | Tuple containing a list of token indexes, id of the top log probability, and the actual log probability of the 30 | selected token. 31 | """ 32 | T = idx.size(0) 33 | assert max_returned_tokens > T 34 | if model.max_seq_length < max_returned_tokens - 1: 35 | # rolling the kv cache based on the `input_pos` value would be necessary. However, doing so would introduce a 36 | # data dependency on the `input_pos` tensor and impact model compilation. Since this setting is uncommon, we do 37 | # not support it to avoid negatively impacting the overall speed 38 | raise NotImplementedError( 39 | f"max_seq_length {model.max_seq_length} needs to be >= {max_returned_tokens - 1}" 40 | ) 41 | 42 | device, dtype = idx.device, idx.dtype 43 | # create an empty tensor of the expected final shape and fill in the current tokens 44 | empty = torch.empty(max_returned_tokens, dtype=dtype, device=device) 45 | # prefill empty with the prompt token indexes 46 | empty[:T] = idx 47 | idx = empty 48 | input_pos = torch.arange(0, T, device=device) 49 | 50 | top_logprob = [] 51 | logprob = [] 52 | 53 | # Generate log_prob and top_log_prob for the prompt 54 | logits = model(idx[:T].view(1, -1), input_pos) 55 | probs = torch.nn.functional.softmax(logits, dim=-1)[0] 56 | prompt_log_probs = torch.log(probs) 57 | prompt_max_probs, prompt_argmax_probs = torch.max(probs, dim=-1) 58 | # Grab the logprob for all the tokens in the prompt 59 | logprob.extend( 60 | prompt_log_probs.gather(-1, idx[:T, None].to(torch.int64)).squeeze(-1).tolist() 61 | ) 62 | top_logprob.extend( 63 | [ 64 | (argmax.item(), max_prob.item()) 65 | for argmax, max_prob in zip(prompt_argmax_probs, prompt_max_probs) 66 | ] 67 | ) 68 | 69 | # generate up to a fixed number of tokens 70 | for _ in range(max_returned_tokens - T): 71 | x = idx.index_select(0, input_pos).view(1, -1) 72 | 73 | # forward 74 | logits = model(x, input_pos) 75 | logits = logits[0, -1] / temperature 76 | 77 | # optionally crop the logits to only the top k options 78 | if top_k is not None: 79 | v, _ = torch.topk(logits, min(top_k, logits.size(-1))) 80 | logits = torch.where(logits < v[[-1]], -float("Inf"), logits) 81 | 82 | probs = torch.nn.functional.softmax(logits, dim=-1) 83 | 84 | idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype) 85 | 86 | # append the logprob of selected token 87 | logprob.append(torch.log(probs[idx_next]).item()) 88 | 89 | # append th idx and logprob of top token 90 | top_logprob.append((torch.argmax(probs).item(), torch.log(probs).max().item())) 91 | 92 | # advance 93 | input_pos = input_pos[-1:] + 1 94 | 95 | # concatenate the new generation 96 | idx = idx.index_copy(0, input_pos, idx_next) 97 | 98 | # if token is triggered, return the output (stop generation) 99 | if idx_next == eos_id: 100 | return idx[:input_pos], logprob, top_logprob # include the EOS token 101 | 102 | return idx, logprob, top_logprob 103 | -------------------------------------------------------------------------------- /sample-submissions/lit-gpt/main.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI 2 | 3 | import logging 4 | 5 | # Lit-GPT imports 6 | import sys 7 | import time 8 | from pathlib import Path 9 | import json 10 | 11 | # support running without installing as a package 12 | wd = Path(__file__).parent.parent.resolve() 13 | sys.path.append(str(wd)) 14 | 15 | import lightning as L 16 | import torch 17 | 18 | torch.set_float32_matmul_precision("high") 19 | 20 | from lit_gpt import GPT, Tokenizer, Config 21 | from lit_gpt.utils import lazy_load, quantization 22 | 23 | # Toy submission imports 24 | from helper import toysubmission_generate 25 | from api import ( 26 | ProcessRequest, 27 | ProcessResponse, 28 | TokenizeRequest, 29 | TokenizeResponse, 30 | Token, 31 | DecodeRequest, 32 | DecodeResponse 33 | ) 34 | 35 | app = FastAPI() 36 | 37 | logger = logging.getLogger(__name__) 38 | # Configure the logging module 39 | logging.basicConfig(level=logging.INFO) 40 | 41 | quantize = "bnb.nf4-dq" # 4-bit NormalFloat with Double-Quantization (see QLoRA paper) 42 | checkpoint_dir = Path("checkpoints/openlm-research/open_llama_3b") 43 | precision = "bf16-true" # weights and data in bfloat16 precision 44 | 45 | fabric = L.Fabric(devices=1, accelerator="cuda", precision=precision) 46 | 47 | with open(checkpoint_dir / "lit_config.json") as fp: 48 | config = Config(**json.load(fp)) 49 | 50 | checkpoint_path = checkpoint_dir / "lit_model.pth" 51 | logger.info(f"Loading model {str(checkpoint_path)!r} with {config.__dict__}") 52 | with fabric.init_module(empty_init=True), quantization(quantize): 53 | model = GPT(config) 54 | 55 | with lazy_load(checkpoint_path) as checkpoint: 56 | model.load_state_dict(checkpoint, strict=quantize is None) 57 | 58 | model.eval() 59 | model = fabric.setup(model) 60 | 61 | tokenizer = Tokenizer(checkpoint_dir) 62 | 63 | 64 | @app.post("/process") 65 | async def process_request(input_data: ProcessRequest) -> ProcessResponse: 66 | if input_data.seed is not None: 67 | L.seed_everything(input_data.seed) 68 | logger.info("Using device: {}".format(fabric.device)) 69 | encoded = tokenizer.encode( 70 | input_data.prompt, bos=True, eos=False, device=fabric.device 71 | ) 72 | prompt_length = encoded.size(0) 73 | max_returned_tokens = prompt_length + input_data.max_new_tokens 74 | 75 | with fabric.init_tensor(): 76 | # set the max_seq_length to limit the memory usage to what we need 77 | model.max_seq_length = max_returned_tokens 78 | # enable the kv cache 79 | model.set_kv_cache(batch_size=1) 80 | 81 | 82 | t0 = time.perf_counter() 83 | tokens, logprobs, top_logprobs = toysubmission_generate( 84 | model, 85 | encoded, 86 | max_returned_tokens, 87 | temperature=input_data.temperature, 88 | top_k=input_data.top_k, 89 | ) 90 | 91 | t = time.perf_counter() - t0 92 | 93 | if input_data.echo_prompt is False: 94 | output = tokenizer.decode(tokens[prompt_length:]) 95 | tokens = tokens[prompt_length:] 96 | logprobs = logprobs[prompt_length:] 97 | top_logprobs = top_logprobs[prompt_length:] 98 | else: 99 | output = tokenizer.decode(tokens) 100 | tokens_generated = tokens.size(0) - prompt_length 101 | logger.info( 102 | f"Time for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec" 103 | ) 104 | 105 | logger.info(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB") 106 | generated_tokens = [] 107 | for t, lp, tlp in zip(tokens, logprobs, top_logprobs): 108 | idx, val = tlp 109 | tok_str = tokenizer.processor.decode([idx]) 110 | token_tlp = {tok_str: val} 111 | generated_tokens.append( 112 | Token(text=tokenizer.decode(t), logprob=lp, top_logprob=token_tlp) 113 | ) 114 | logprobs_sum = sum(logprobs) 115 | # Process the input data here 116 | return ProcessResponse( 117 | text=output, tokens=generated_tokens, logprob=logprobs_sum, request_time=t 118 | ) 119 | 120 | 121 | @app.post("/tokenize") 122 | async def tokenize(input_data: TokenizeRequest) -> TokenizeResponse: 123 | logger.info("Using device: {}".format(fabric.device)) 124 | t0 = time.perf_counter() 125 | encoded = tokenizer.encode( 126 | input_data.text, bos=True, eos=False, device=fabric.device 127 | ) 128 | t = time.perf_counter() - t0 129 | tokens = encoded.tolist() 130 | return TokenizeResponse(tokens=tokens, request_time=t) 131 | 132 | 133 | @app.post("/decode") 134 | async def decode(input_data: DecodeRequest) -> DecodeResponse: 135 | logger.info("Using device: {}".format(fabric.device)) 136 | t0 = time.perf_counter() 137 | # decoded = tokenizer.decode(torch.Tensor(input_data.tokens)) 138 | decoded = tokenizer.processor.decode(input_data.tokens) 139 | t = time.perf_counter() - t0 140 | return DecodeResponse(text=decoded, request_time=t) -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel 2 | 3 | RUN apt-get update && apt-get install -y git python3-virtualenv wget 4 | 5 | RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793 6 | 7 | WORKDIR /workspace 8 | # Setup server requriements 9 | COPY ./fast_api_requirements.txt fast_api_requirements.txt 10 | RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt 11 | 12 | ENV HUGGINGFACE_TOKEN="YOUR_TOKEN" 13 | ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO" 14 | 15 | # Copy over single file server 16 | COPY ./main.py main.py 17 | COPY ./api.py api.py 18 | # Run the server 19 | CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"] 20 | -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/Dockerfile.train: -------------------------------------------------------------------------------- 1 | FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel 2 | 3 | RUN apt-get update && apt-get install -y git python3-virtualenv wget 4 | 5 | RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793 6 | 7 | WORKDIR /workspace 8 | 9 | RUN wget https://gist.githubusercontent.com/mreso/ec65015cbfbd395f0c2adc17147adf1f/raw/41070f1058820b9e89bae885968cc666a7d6aa59/custom_dataset.py 10 | 11 | ENV HUGGINGFACE_TOKEN="YOUR_TOKEN" 12 | ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO" 13 | 14 | COPY train.py ./ 15 | 16 | CMD [ "python", "train.py"] 17 | -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/README.md: -------------------------------------------------------------------------------- 1 | # Llama-recipes Example 2 | This example demonstrates how to fine-tune and serve a Llama 2 model with llama-recipes for submission in the LLM efficiency challenge using the [lit-gpt](../lit-gpt/) example as a template. 3 | Llama-recipes provides an easy way to fine-tune a Llama 2 model with custom datasets using efficient techniques like LoRA or Llama-adapters. 4 | 5 | # Getting Started 6 | In order to use llama-recipes we need to install the following pip package: 7 | 8 | ``` 9 | pip install llama-recipes 10 | ``` 11 | 12 | To obtain access to the model weights you need to fill out this [form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to accept the license terms and acceptable use policy. 13 | 14 | After access has been granted, you need to acknowledge this in your HuggingFace account for the model you want to fine-tune. In this example we will continue with the 7B parameter version available under this identifier: meta-llama/Llama-2-7b-hf 15 | 16 | **NOTE** In this example the training result will be uploaded and downloaded through huggingface_hub. The authentication will be done through a token created in the settings of your HuggingFace account. 17 | Make sure to give write access to the token and set the env variables in the Dockerfiles to your token and repo: 18 | 19 | ```bash 20 | ENV HUGGINGFACE_TOKEN="YOUR_TOKEN" 21 | ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO" 22 | ``` 23 | 24 | # Fine-tune The Model 25 | With llama-recipes its possible to fine-tune Llama on custom data with a single command. To fine-tune on a custom dataset we need to implement a function (get_custom_dataset) that provides the custom dataset following this example [custom_dataset.py](https://github.com/facebookresearch/llama-recipes/blob/main/examples/custom_dataset.py). 26 | We can then train on this dataset using this command line: 27 | 28 | ```bash 29 | python3 -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name meta-llama/Llama-2-7b --dataset custom_dataset --custom_dataset.file /workspace/custom_dataset.py --output_dir /volume/output_dir 30 | ``` 31 | 32 | **Note** The custom dataset in this example is dialog based. This is only due to the nature of the example but not a necessity of the custom dataset functionality. To see other examples of get_custom_dataset functions (btw the name of the function get_custom_dataset can be changed in the command line by using this syntax: /workspace/custom_dataset.py:get_foo_dataset) have a look at the [built-in dataset in llama-recipes](https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/datasets/__init__.py). 33 | 34 | # Create Submission 35 | *Note* For a submission to the competition only the inference part (Dockerfile) will be necessary. A training docker (Dockerfile.train) will only be necessary if you need to replicate the submission in case you're within the top 3 contestants. 36 | 37 | ## Prepare Leaderboard Submission 38 | The inference Docker will download base and LoRA weights from huggingface_hub. For the submission it is assumed that the trained weights are uploaded to a repo on huggingface_hub and the env variables HUGGINGFACE_TOKEN and HUGGINGFACE_REPO have been updated accordingly in the [Dockerfile](./Dockerfile). 39 | 40 | To create the zip file for submission to the eval bot use the following commands: 41 | ```bash 42 | cd neurips_llm_efficiency_challenge/sample-submissions 43 | rm llama_recipes/Dockerfile.train 44 | zip -r llama_recipes.zip llama_recipes 45 | ``` 46 | *Note* 1. Make sure to only zip the folder llama_recipes and do not include any other sample submission in the zipfile. 2. We delete llama_recipes/Dockerfile.train as a precaution to avoid errors if submission logic changes. 47 | 48 | ## Run Training And Inference Docker Locally 49 | To locally build and and run the taining Docker we need to execute: 50 | 51 | ```bash 52 | docker build -f ./Dockerfile.train -t llama_recipes_train . 53 | 54 | docker run --gpus "device=0" --rm -ti llama_recipes_train 55 | ``` 56 | 57 | The inference Docker can be created and started locally with: 58 | 59 | ```bash 60 | docker build -f ./Dockerfile -t llama_recipes_inference . 61 | 62 | docker run --gpus "device=0" -p 8080:80 --rm -ti llama_recipes_inference 63 | ``` 64 | 65 | To test the inference docker we can run this query: 66 | 67 | ```bash 68 | curl -X POST -H "Content-Type: application/json" -d '{"text": "What is the capital of france? "}' http://localhost:8080/tokenize 69 | OR 70 | curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of france? "}' http://localhost:8080/process 71 | ``` 72 | -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/api.py: -------------------------------------------------------------------------------- 1 | from pydantic import BaseModel 2 | 3 | from typing import List, Dict, Optional 4 | 5 | 6 | class ProcessRequest(BaseModel): 7 | prompt: str 8 | num_samples: int = 1 9 | max_new_tokens: int = 50 10 | top_k: int = 200 11 | temperature: float = 0.8 12 | seed: Optional[int] = None 13 | echo_prompt: Optional[bool] 14 | 15 | 16 | class Token(BaseModel): 17 | text: str 18 | logprob: float 19 | top_logprob: Dict[str, float] 20 | 21 | 22 | class ProcessResponse(BaseModel): 23 | text: str 24 | tokens: List[Token] 25 | logprob: float 26 | request_time: float 27 | 28 | 29 | class TokenizeRequest(BaseModel): 30 | text: str 31 | truncation: bool = True 32 | max_length: int = 2048 33 | 34 | 35 | class TokenizeResponse(BaseModel): 36 | tokens: List[int] 37 | request_time: float 38 | -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/fast_api_requirements.txt: -------------------------------------------------------------------------------- 1 | # FAST API 2 | fastapi>=0.68.0,<0.69.0 3 | pydantic>=1.8.0,<2.0.0 4 | uvicorn>=0.15.0,<0.16.0 5 | -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/main.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI 2 | 3 | import logging 4 | import os 5 | import time 6 | 7 | import torch 8 | from huggingface_hub import login 9 | from transformers import LlamaTokenizer, LlamaForCausalLM 10 | from llama_recipes.inference.model_utils import load_peft_model 11 | 12 | torch.set_float32_matmul_precision("high") 13 | 14 | from api import ( 15 | ProcessRequest, 16 | ProcessResponse, 17 | TokenizeRequest, 18 | TokenizeResponse, 19 | Token, 20 | ) 21 | 22 | app = FastAPI() 23 | 24 | logger = logging.getLogger(__name__) 25 | # Configure the logging module 26 | logging.basicConfig(level=logging.INFO) 27 | 28 | login(token=os.environ["HUGGINGFACE_TOKEN"]) 29 | 30 | model = LlamaForCausalLM.from_pretrained( 31 | 'meta-llama/Llama-2-7b-hf', 32 | return_dict=True, 33 | torch_dtype=torch.float16, 34 | device_map="cuda" 35 | ) 36 | model = load_peft_model(model, os.environ["HUGGINGFACE_REPO"]) 37 | 38 | model.eval() 39 | 40 | tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b') 41 | 42 | LLAMA2_CONTEXT_LENGTH = 4096 43 | 44 | 45 | @app.post("/process") 46 | async def process_request(input_data: ProcessRequest) -> ProcessResponse: 47 | if input_data.seed is not None: 48 | torch.manual_seed(input_data.seed) 49 | 50 | encoded = tokenizer(input_data.prompt, return_tensors="pt") 51 | 52 | prompt_length = encoded["input_ids"][0].size(0) 53 | max_returned_tokens = prompt_length + input_data.max_new_tokens 54 | assert max_returned_tokens <= LLAMA2_CONTEXT_LENGTH, ( 55 | max_returned_tokens, 56 | LLAMA2_CONTEXT_LENGTH, 57 | ) 58 | 59 | t0 = time.perf_counter() 60 | encoded = {k: v.to("cuda") for k, v in encoded.items()} 61 | with torch.no_grad(): 62 | outputs = model.generate( 63 | **encoded, 64 | max_new_tokens=input_data.max_new_tokens, 65 | do_sample=True, 66 | temperature=input_data.temperature, 67 | top_k=input_data.top_k, 68 | return_dict_in_generate=True, 69 | output_scores=True, 70 | ) 71 | 72 | t = time.perf_counter() - t0 73 | if not input_data.echo_prompt: 74 | output = tokenizer.decode(outputs.sequences[0][prompt_length:], skip_special_tokens=True) 75 | else: 76 | output = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True) 77 | 78 | tokens_generated = outputs.sequences[0].size(0) - prompt_length 79 | logger.info( 80 | f"Time for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec" 81 | ) 82 | 83 | logger.info(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB") 84 | generated_tokens = [] 85 | 86 | log_probs = torch.log(torch.stack(outputs.scores, dim=1).softmax(-1)) 87 | 88 | gen_sequences = outputs.sequences[:, encoded["input_ids"].shape[-1]:] 89 | gen_logprobs = torch.gather(log_probs, 2, gen_sequences[:, :, None]).squeeze(-1) 90 | 91 | top_indices = torch.argmax(log_probs, dim=-1) 92 | top_logprobs = torch.gather(log_probs, 2, top_indices[:,:,None]).squeeze(-1) 93 | top_indices = top_indices.tolist()[0] 94 | top_logprobs = top_logprobs.tolist()[0] 95 | 96 | for t, lp, tlp in zip(gen_sequences.tolist()[0], gen_logprobs.tolist()[0], zip(top_indices, top_logprobs)): 97 | idx, val = tlp 98 | tok_str = tokenizer.decode(idx) 99 | token_tlp = {tok_str: val} 100 | generated_tokens.append( 101 | Token(text=tokenizer.decode(t), logprob=lp, top_logprob=token_tlp) 102 | ) 103 | logprob_sum = gen_logprobs.sum().item() 104 | 105 | return ProcessResponse( 106 | text=output, tokens=generated_tokens, logprob=logprob_sum, request_time=t 107 | ) 108 | 109 | 110 | @app.post("/tokenize") 111 | async def tokenize(input_data: TokenizeRequest) -> TokenizeResponse: 112 | t0 = time.perf_counter() 113 | encoded = tokenizer( 114 | input_data.text 115 | ) 116 | t = time.perf_counter() - t0 117 | tokens = encoded["input_ids"] 118 | return TokenizeResponse(tokens=tokens, request_time=t) 119 | -------------------------------------------------------------------------------- /sample-submissions/llama_recipes/train.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from huggingface_hub import login, HfApi 4 | from llama_recipes.finetuning import main as finetuning 5 | 6 | def main(): 7 | login(token=os.environ["HUGGINGFACE_TOKEN"]) 8 | 9 | kwargs = { 10 | "model_name": "meta-llama/Llama-2-7b-hf", 11 | "use_peft": True, 12 | "peft_method": "lora", 13 | "quantization": True, 14 | "batch_size_training": 2, 15 | "dataset": "custom_dataset", 16 | "custom_dataset.file": "./custom_dataset.py", 17 | "output_dir": "./output_dir", 18 | } 19 | 20 | finetuning(**kwargs) 21 | 22 | api = HfApi() 23 | 24 | api.upload_folder( 25 | folder_path='./output_dir/', 26 | repo_id=os.environ["HUGGINGFACE_REPO"], 27 | repo_type='model', 28 | ) 29 | 30 | if __name__ == "__main__": 31 | main() --------------------------------------------------------------------------------