├── .gitignore
├── .gitmodules
├── README.md
├── build_run_specs_full.py
├── helm.md
├── leaderboard.md
├── open_api_spec.json
├── run_specs.conf
├── run_specs_full_coarse_600_budget.conf
└── sample-submissions
    ├── lit-gpt
        ├── Dockerfile
        ├── README.md
        ├── api.py
        ├── fast_api_requirements.txt
        ├── helper.py
        └── main.py
    └── llama_recipes
        ├── Dockerfile
        ├── Dockerfile.train
        ├── README.md
        ├── api.py
        ├── fast_api_requirements.txt
        ├── main.py
        └── train.py


/.gitignore:
--------------------------------------------------------------------------------
1 | sample-submissions/lit-gpt/__pycache__/


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "sample-submissions/lit-gpt/lit-gpt"]
2 | 	path = sample-submissions/lit-gpt/lit-gpt
3 | 	url = https://github.com/Lightning-AI/lit-gpt
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Neurips 1 LLM 1 GPU Challenge
  2 | 
  3 | This repository provides a starting point for those who are interested in the [NeurIPS 1 LLM 1 GPU Competition](https://llm-efficiency-challenge.github.io/). It provides detailed clarifications on what a submission looks like exactly, and how it will be evaluated and submitted.
  4 | 
  5 | At a high level, the key thing you will contribute is a `Dockerfile`, which will be a reproducible artifact that we can use to test your submission. The `Dockerfile` should contain all the code and dependencies needed to run your submission. We will use this `Dockerfile` to build a docker image and then run it against a set of tasks which will be a subset of the [HELM](https://crfm.stanford.edu/helm/latest/) tasks.
  6 | 
  7 | Your `Dockerfile` will expose a simple HTTP server, which needs to implement 2 endpoints `/process` and `/tokenize`. We will build that `Dockerfile` and expect it to launch an HTTP server. Once that server is launched, we will make requests to it via HELM and record your results.
  8 | 
  9 | At a high level the flow you should follow to ensure a strong submission:
 10 | 1. Pick approved LLMs and datasets from [here](https://llm-efficiency-challenge.github.io/challenge)
 11 | 2. Start with one of [sample-submissions](sample-submissions) and make sure it runs
 12 | 3. Evaluate it locally on your own 40Gb A100 or 4090, if you don't have funding for either please see the [GPU funding](#gpu-funding) section for some more options
 13 | 4. Once you have something working you can make a submission on our [Discord Leaderboard](https://discord.com/channels/1124130156336922665/1124134272631054447/1151718598818156645) to see how you fare up against other competitors
 14 | 5. On the competition deadline make sure you have the final eval Dockerfile you'd like us to run in your github repo, refer to the [timeline](https://llm-efficiency-challenge.github.io/dates)
 15 | 6. If your entry makes the shortlist, we will work with you to reproduce all of your artifacts with another finetuning Dockerfile
 16 | 
 17 | ## Contents
 18 | 
 19 | - [Approved LLM & Dataset](#approved-llm-and-dataset)
 20 | - [Submission](#submission)
 21 | - [Evaluate Your Model Locally Using HELM](#evaluate-your-model-locally-using-helm)
 22 | - [Finetune](#finetune)
 23 | - [Create your own submission template](#create-your-own-submission-template)
 24 | - [Discord Leaderboard](#discord-leaderboard)
 25 | - [Final Leaderboard Submission](#final-eval-submission)
 26 | - [Evaluating the Final Submission](#evaluating-the-final-submission)
 27 | - [GPU funding](#gpu-funding)
 28 | 
 29 | ## Approved LLM and dataset
 30 | 
 31 | The LLM space has complex licenses which can make it difficult to figure out what's permittable to use in a competition to streamline this process we've shortlisted a few models and datasets we know are safe to use [here](https://llm-efficiency-challenge.github.io/challenge)
 32 | 
 33 | That said the LLM space is fast moving so if you'd like to use a dataset or model that isn't on our list make sure to ask us about it on [https://discord.gg/XJwQ5ddMK7](https://discord.gg/XJwQ5ddMK7)
 34 | 
 35 | ## Submission
 36 | 
 37 | The submission in this repository is a basic implementation of the setting up an HTTP server in accordance to the `open_api` spec. It includes a sample solution built off of [Lit-GPT](https://github.com/Lightning-AI/lit-gpt) and open-llama weights that participants can reference or modify as they see fit.
 38 | 
 39 | You can use the provided code as a reference or starting point for your own implementation. The `main.py` file contains the simple FastAPI server, and you can modify it to suit your needs.
 40 | 
 41 | You can find the Lit-GPT submission [here](sample-submissions/lit-gpt/) and the llama-recipes submission [here](sample-submissions/llama_recipes/) with instructions on how to run each locally.
 42 | 
 43 | Make sure that your final submission has only a single `Dockerfile` and that your weights are not directly included in the repo, they need to be downloaded during docker build or at runtime.
 44 | 
 45 | ## Evaluate Your Model Locally Using HELM
 46 | 
 47 | Every submission will be tested against [HELM](https://crfm.stanford.edu/helm/latest/) which is a standard suite to evaluate LLMs on a broad set of datasets. This competition will leverage HELM for its evaluation infrastructure. The organizers will leverage standard STEM tasks from HELM although we will keep the exact set a secret and in addition we'll be including some heldout tasks that are presently not in HELM.
 48 | 
 49 | As you're working on your submission `Dockerfile` you'll want to test it out locally to make sure your contribution works as expected before you submit it.
 50 | 
 51 | HELM makes it easy to add new evaluation datasets by just adding another line in a config file so make sure to experiment with the different datasets they have available and feel free to contribute your own.
 52 | 
 53 | To learn more about how to test your submission with HELM, please follow the instructions [here](helm.md).
 54 | 
 55 | ## Finetune
 56 | 
 57 | It's likely that an untuned base model won't give you satisfactory results, in that case you might find it helpful to do some additional finetuning. There are many frameworks to do this but we've created 2 sample submissions to do so
 58 | 1. [lit-gpt](/sample-submissions/lit-gpt/)
 59 | 2. [llama-recipes](/sample-submissions/llama_recipes/)
 60 | 
 61 | 
 62 | ### Create Your Own Submission Template
 63 | 
 64 | Note that we've offered 2 sample submissions, our evaluation infrastructure is generic and only assumes an HTTP client so you can use a finetuning framework in Python like the ones we've suggested but also any non based Python framework you like using.
 65 | 
 66 | The `openapi.json` file in this repository contains the OpenAPI specification for the Competition API. Competitors can use this specification to understand the API endpoints, request and response structures, and overall requirements for interacting with the competition platform.
 67 | 
 68 | The OpenAPI specification provides a standardized way to describe the API, making it easier for competitors to develop their own solutions and integrate them seamlessly with the competition infrastructure.
 69 | 
 70 | 
 71 | ## Discord Leaderboard
 72 | 
 73 | The [Lightning AI](https://lightning.ai/) has built a Discord based for us. You can find it on discord by its name `evalbot#4372`.
 74 | 
 75 | You can interact with it by DM'ing it with a zipped file of your sample submission and message it to either `eval A100` or `eval 4090`. More details on the bot are [here](https://discord.com/channels/1124130156336922665/1124134272631054447/1151718598818156645)
 76 | 
 77 | Once you make a submission the bot will inform you whether your submission failed or succeeded and after a few hours will publicly post your results. If you're at the top of the queue you can expect the eval to take 1-2h but depending on the size of the queue this could be longer. So please be mindful to not hurt other competitors trying to use the limited amount of hardware and ensure that your submissions work locally first.
 78 | 
 79 | Your submission will remain private to other competitors.
 80 | 
 81 | The end to end flow is described [here](leaderboard.md)
 82 | 
 83 | ## Final Leaderboard Submission
 84 | 
 85 | When you registered for the competition you would have needed to create a github repo. When the submission deadline is reached make sure your Github repo has a `Dockerfile`, in case the location is ambiguous please sure to let us know in your `README.md`. The organizers will take your `Dockerfile` and run it as is and compute a baseline eval score. The purpose of this step is to primarily filter out broken submissions or submissions that can't outperform the unfinetuned sample submissions.
 86 | 
 87 | The deadline is on Oct 25 2023 with important dates listed [here](https://llm-efficiency-challenge.github.io/dates)
 88 | 
 89 | ## Evaluating the Final Submission
 90 | 
 91 | Once the organizers have identified a shortlist of strong submissions, we will message you directly for another `Dockerfile` that would reproduce all of your artifacts. The best submission among this shortlist will win the competition and be invited to present their work at NeurIPS at our workshop.
 92 | 
 93 | ## GPU funding
 94 | 
 95 | [AWS](https://aws.amazon.com/) has graciously agreed to provide $500 in AWS credits to 25 participating teams in the LLM efficiency competition. You will be able to pick and choose from available hardware to experiment before you make your final submission. To be eligible, please make sure to sign up at https://llm-efficiency-challenge.github.io/submission and write a short proposal in your `README.md` and add [@jisaacso](https://github.com/jisaacso) to your repos who will review your proposals.
 96 | 
 97 | We'll be prioritizing the first teams with serious proposals. Good luck!
 98 | 
 99 | There are some other free ways of getting GPUs that people have posted on discord [here](https://discord.com/channels/1124130156336922665/1149283885524463637/1149283885524463637) and you can shop around for both 4090 and A100 on cloud on [https://cloud-gpus.com/](https://cloud-gpus.com/)
100 | 


--------------------------------------------------------------------------------
/build_run_specs_full.py:
--------------------------------------------------------------------------------
  1 | entries = [
  2 |     #bigbench
  3 |     # 1. auto_debugging: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/auto_debugging
  4 |     {'scenario':'auto_debugging','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=auto_debugging,subtask=", 'priority': 1},
  5 | 
  6 |     # 3. code_line_description: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/code_line_description
  7 |     {'scenario':'code_line_description','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=code_line_description,subtask=", 'priority': 1},
  8 | 
  9 |     # 4. conceptual_combinations: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/conceptual_combinations
 10 |     {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=contradictions", 'priority': 1},
 11 |     {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=emergent_properties", 'priority': 1},
 12 |     {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=fanciful_fictional_combinations", 'priority': 1},
 13 |     {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=homonyms", 'priority': 1},
 14 |     {'scenario':'conceptual_combinations','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=invented_words", 'priority': 1},
 15 | 
 16 |     # 6. emoji_movie: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/emoji_movie
 17 |     {'scenario':'emoji_movie','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=emoji_movie,subtask=", 'priority': 1},
 18 | 
 19 |     # 7. formal_fallacies_syllogisms_negation: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/formal_fallacies_syllogisms_negation
 20 |     {'scenario':'formal_fallacies_syllogisms_negation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=formal_fallacies_syllogisms_negation,subtask=", 'priority': 1},
 21 | 
 22 |     # 8. hindu_knowledge: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge
 23 |     # {'scenario':'hindu_knowledge','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=hindu_knowledge,subtask=", 'priority': 1},
 24 | 
 25 |     # 9. known_unknowns: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/known_unknowns
 26 |     {'scenario':'known_unknowns','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=known_unknowns,subtask=", 'priority': 1},
 27 | 
 28 |     # 11. linguistics_puzzles: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/linguistics_puzzles
 29 |     {'scenario':'linguistics_puzzles','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=linguistics_puzzles,subtask=", 'priority': 1},
 30 | 
 31 |     # 12. logic_grid_puzzle: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/logic_grid_puzzle
 32 |     {'scenario':'logic_grid_puzzle','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logic_grid_puzzle,subtask=", 'priority': 1},
 33 | 
 34 |     # 13. logical_deduction: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/logical_deduction
 35 |     {'scenario':'logical_deduction','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=three_objects", 'priority': 1},
 36 |     {'scenario':'logical_deduction','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=five_objects", 'priority': 1},
 37 |     {'scenario':'logical_deduction','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=seven_objects", 'priority': 1},
 38 | 
 39 |     # 14. misconceptions_russian: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/misconceptions_russian
 40 |     # {'scenario':'misconceptions_russian','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=misconceptions_russian,subtask=", 'priority': 1},
 41 | 
 42 |     # 15. novel_concepts: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/novel_concepts
 43 |     {'scenario':'novel_concepts','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=novel_concepts,subtask=", 'priority': 1},
 44 | 
 45 |     # 16. operators: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/operators
 46 |     {'scenario':'operator','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=operators,subtask=", 'priority': 1},
 47 | 
 48 |     # 17. parsinlu_reading_comprehension: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/parsinlu_reading_comprehension
 49 |     # {'scenario':'parsinlu_reading_comprehension','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=parsinlu_reading_comprehension,subtask=", 'priority': 1},
 50 | 
 51 |     # 18. play_dialog_same_or_different: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/play_dialog_same_or_different
 52 |     {'scenario':'play_dialog_same_or_different','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=play_dialog_same_or_different,subtask=", 'priority': 1},
 53 | 
 54 |     # 19. repeat_copy_logic: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/repeat_copy_logic
 55 |     {'scenario':'repeat_copy_logic','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=repeat_copy_logic,subtask=", 'priority': 1},
 56 | 
 57 |     # 20. strange_stories: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/strange_stories
 58 |     {'scenario':'strange_stories','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=boolean", 'priority': 1},
 59 |     {'scenario':'strange_stories','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=multiple_choice", 'priority': 1},
 60 | 
 61 |     # 21. strategyqa: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/strategyqa
 62 |     {'scenario':'strategyqa','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strategyqa,subtask=", 'priority': 1},
 63 | 
 64 |     # 22. symbol_interpretation: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/symbol_interpretation
 65 |     {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=adversarial", 'priority': 1},
 66 |     {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=emoji_agnostic", 'priority': 1},
 67 |     {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=name_agnostic", 'priority': 1},
 68 |     {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=plain", 'priority': 1},
 69 |     {'scenario':'symbol_interpretation','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=tricky", 'priority': 1},
 70 | 
 71 |     # 23. vitaminc_fact_verification: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/vitaminc_fact_verification
 72 |     {'scenario':'vitaminc_fact_verification','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=vitaminc_fact_verification,subtask=", 'priority': 1},
 73 | 
 74 |     # 24. winowhy: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/winowhy
 75 |     {'scenario':'winowhy','description': "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=winowhy,subtask=", 'priority': 1},
 76 |     
 77 |     # MMLU STEM: Medicine/Biology
 78 |     {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=anatomy,data_augmentation=canonical", 'priority': 2},
 79 |     {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=college_medicine,data_augmentation=canonical", 'priority': 2},
 80 |     {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=college_biology,data_augmentation=canonical", 'priority': 2},
 81 |     {'scenario':'medicine_biology','description': "mmlu:model=neurips/local,subject=high_school_biology,data_augmentation=canonical", 'priority': 2},
 82 |     
 83 |     # MMLU STEM: CS
 84 |     {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=college_computer_science,data_augmentation=canonical", 'priority': 2},
 85 |     {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=high_school_computer_science,data_augmentation=canonical", 'priority': 2},
 86 |     {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=computer_security,data_augmentation=canonical", 'priority': 2},
 87 |     {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=electrical_engineering,data_augmentation=canonical", 'priority': 2},
 88 |     {'scenario':'computer_science','description': "mmlu:model=neurips/local,subject=machine_learning,data_augmentation=canonical", 'priority': 2},
 89 |     
 90 |     # MMLU STEM: Math
 91 |     {'scenario':'math','description': "mmlu:model=neurips/local,subject=high_school_mathematics,data_augmentation=canonical", 'priority': 2},
 92 |     {'scenario':'math','description': "mmlu:model=neurips/local,subject=college_mathematics,data_augmentation=canonical", 'priority': 2},
 93 |     {'scenario':'math','description': "mmlu:model=neurips/local,subject=abstract_algebra,data_augmentation=canonical", 'priority': 2},
 94 |     {'scenario':'math','description': "mmlu:model=neurips/local,subject=high_school_statistics,data_augmentation=canonical", 'priority': 2},
 95 | 
 96 |     # MMLU STEM: Chemistry/Physics
 97 |     {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=college_chemistry,data_augmentation=canonical", 'priority': 2},
 98 |     {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=high_school_chemistry,data_augmentation=canonical", 'priority': 2},
 99 |     {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=high_school_physics,data_augmentation=canonical", 'priority': 2},
100 |     {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=college_physics,data_augmentation=canonical", 'priority': 2},
101 |     {'scenario':'physics_chemistry','description': "mmlu:model=neurips/local,subject=astronomy,data_augmentation=canonical", 'priority': 2},
102 | 
103 |     # MMLU Humanities: Formal reasoning
104 |     {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=formal_logic,data_augmentation=canonical", 'priority': 2},
105 |     {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=logical_fallacies,data_augmentation=canonical", 'priority': 2},
106 |     {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=philosophy,data_augmentation=canonical", 'priority': 2},
107 |     {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=moral_disputes,data_augmentation=canonical", 'priority': 2},
108 |     {'scenario':'formal_reasoning','description': "mmlu:model=neurips/local,subject=moral_scenarios,data_augmentation=canonical", 'priority': 2},
109 | 
110 |     # MMLU Humanities: Law
111 |     {'scenario':'law','description': "mmlu:model=neurips/local,subject=professional_law,data_augmentation=canonical", 'priority': 2},
112 |     {'scenario':'law','description': "mmlu:model=neurips/local,subject=international_law,data_augmentation=canonical", 'priority': 2},
113 |     {'scenario':'law','description': "mmlu:model=neurips/local,subject=jurisprudence,data_augmentation=canonical", 'priority': 2},
114 |     
115 |     # MMLU Humanities: Histroy
116 |     {'scenario':'history','description': "mmlu:model=neurips/local,subject=high_school_european_history,data_augmentation=canonical", 'priority': 2},
117 |     {'scenario':'history','description': "mmlu:model=neurips/local,subject=high_school_us_history,data_augmentation=canonical", 'priority': 2},
118 |     {'scenario':'history','description': "mmlu:model=neurips/local,subject=high_school_world_history,data_augmentation=canonical", 'priority': 2},
119 |     {'scenario':'history','description': "mmlu:model=neurips/local,subject=prehistory,data_augmentation=canonical", 'priority': 2},
120 |     {'scenario':'history','description': "mmlu:model=neurips/local,subject=world_religions,data_augmentation=canonical", 'priority': 2},
121 | 
122 |     # MMLU Other: Business
123 |     {'scenario':'business','description': "mmlu:model=neurips/local,subject=business_ethics,data_augmentation=canonical", 'priority': 2},    
124 |     {'scenario':'business','description': "mmlu:model=neurips/local,subject=global_facts,data_augmentation=canonical", 'priority': 2},
125 |     {'scenario':'business','description': "mmlu:model=neurips/local,subject=management,data_augmentation=canonical", 'priority': 2},
126 |     {'scenario':'business','description': "mmlu:model=neurips/local,subject=marketing,data_augmentation=canonical", 'priority': 2},
127 |     {'scenario':'business','description': "mmlu:model=neurips/local,subject=miscellaneous,data_augmentation=canonical", 'priority': 2},
128 |     {'scenario':'business','description': "mmlu:model=neurips/local,subject=professional_accounting,data_augmentation=canonical", 'priority': 2},
129 |     
130 |     # MMLU Other: Health
131 |     {'scenario':'health','description': "mmlu:model=neurips/local,subject=nutrition,data_augmentation=canonical", 'priority': 2},
132 |     {'scenario':'health','description': "mmlu:model=neurips/local,subject=human_aging,data_augmentation=canonical", 'priority': 2},
133 |     {'scenario':'health','description': "mmlu:model=neurips/local,subject=clinical_knowledge,data_augmentation=canonical", 'priority': 2},
134 |     {'scenario':'health','description': "mmlu:model=neurips/local,subject=medical_genetics,data_augmentation=canonical", 'priority': 2},
135 |     {'scenario':'health','description': "mmlu:model=neurips/local,subject=professional_medicine,data_augmentation=canonical", 'priority': 2},
136 |     {'scenario':'health','description': "mmlu:model=neurips/local,subject=virology,data_augmentation=canonical", 'priority': 2},
137 | 
138 |     # MMLU Social Sciences: Social studies
139 |     {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=high_school_government_and_politics,data_augmentation=canonical", 'priority': 2},
140 |     {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=high_school_geography,data_augmentation=canonical", 'priority': 2},
141 |     {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=us_foreign_policy,data_augmentation=canonical", 'priority': 2},
142 |     {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=public_relations,data_augmentation=canonical", 'priority': 2},
143 |     {'scenario':'social_studies','description': "mmlu:model=neurips/local,subject=security_studies,data_augmentation=canonical", 'priority': 2},
144 | 
145 |     # MMLU Social Sciences: Human behavior
146 |     {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=high_school_psychology,data_augmentation=canonical", 'priority': 2},
147 |     {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=human_sexuality,data_augmentation=canonical", 'priority': 2},
148 |     {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=professional_psychology,data_augmentation=canonical", 'priority': 2},
149 |     {'scenario':'human_behavior','description': "mmlu:model=neurips/local,subject=sociology,data_augmentation=canonical", 'priority': 2},
150 | 
151 |     # MMLU Social Sciences: Economics
152 |     {'scenario':'economics','description': "mmlu:model=neurips/local,subject=high_school_microeconomics,data_augmentation=canonical", 'priority': 2},
153 |     {'scenario':'economics','description': "mmlu:model=neurips/local,subject=econometrics,data_augmentation=canonical", 'priority': 2},
154 |     {'scenario':'economics','description': "mmlu:model=neurips/local,subject=high_school_macroeconomics,data_augmentation=canonical", 'priority': 2},
155 |     
156 |     # Truthful QA
157 |     {'scenario':'truthful_qa','description': "truthful_qa:task=mc_single,model=neurips/local", 'priority': 1},
158 | 
159 |     # CNN/daily mail
160 |     {'scenario':'truthful_qa','description': "summarization_cnndm:model=neurips/local", 'priority': 1},
161 |     # GSM
162 |     {'scenario':'gsm','description': "gsm:model=neurips/local", 'priority': 1},
163 |     # BBQ
164 |     {'scenario':'bbq','description': "bbq:subject=all,model=neurips/local", 'priority': 1},
165 | 
166 | ]
167 | 
168 | def generate_equal_sum_list(V, N):
169 |     # Calculate the base value that will be repeated.
170 |     base_value = V // N
171 |     # Calculate the remainder for distribution.
172 |     remainder = V % N
173 |     
174 |     # Create the list with base_value repeated N times.
175 |     result = [base_value] * N
176 |     
177 |     # Distribute the remainder evenly among the elements.
178 |     for i in range(remainder):
179 |         result[i] += 1
180 |     
181 |     return result
182 | 
183 | import pandas as pd
184 | import argparse
185 | 
186 | if __name__ == "__main__":
187 | 
188 |     import argparse
189 |     parser = argparse.ArgumentParser(
190 |         description='''
191 |         This method automatically generates a configuration file for the neurips_llm_efficiency_challenge
192 |         
193 |         Calling it with: `python build_run_specs_full.py --example_budget=600` will produce a conf file 
194 |         with a total of 600 examples distributed evenly across scenarios as also defined here.
195 |         ''',
196 |     )
197 |     parser.add_argument("--example_budget", required=True, type=int, help='# example to use')
198 |     args = parser.parse_args()
199 |     
200 |     # get a list of scenarios and n_examples
201 |     df =  pd.DataFrame(entries)
202 |     scenario_count_dict = df.value_counts('scenario').to_dict()
203 |     n_scenarios = len(df.scenario.unique())
204 |     max_eval_instances_per_scenario = generate_equal_sum_list(args.example_budget, n_scenarios)
205 | 
206 |     # get a dict of the amount of examples per 
207 |     scenario_n_examples_dict = {}
208 |     for scenario, n_subscenarios in scenario_count_dict.items():
209 |         cur_max_eval_instances_per_scenario = max_eval_instances_per_scenario.pop()
210 |         scenario_n_examples_dict[scenario] = generate_equal_sum_list(cur_max_eval_instances_per_scenario,n_subscenarios)
211 | 
212 |     for i in range(len(entries)):
213 |         cur_scenario = entries[i]['scenario']
214 |         # print(f"added {v} to {entries[i]['max_eval_instances']}")
215 |         v = scenario_n_examples_dict[cur_scenario].pop()
216 |         entries[i]['max_eval_instances'] = v
217 | 
218 |     with open(f'./run_specs_full_coarse_{args.example_budget}_budget.conf','w') as f:
219 |         f.write('entries: [\n')
220 |         last_scenario = ''
221 |         for entry in entries:
222 |             cur_scenario = entry['scenario']
223 |             if cur_scenario != last_scenario:
224 |                 f.write(f'\n# {cur_scenario}\n')
225 |                 print(entry)
226 |             last_scenario = cur_scenario
227 |             f.write('{')
228 |             f.write(f'description: """{entry["description"]}'.replace('"""','"'))
229 |             f.write(f',max_eval_instances={entry["max_eval_instances"]}""",priority: 1'.replace('"""','"'))
230 |             f.write('}\n')
231 |         f.write(']')
232 | 
233 |     print(f'Saved ./run_specs_full_coarse_{args.example_budget}_budget.conf')
234 | 


--------------------------------------------------------------------------------
/helm.md:
--------------------------------------------------------------------------------
 1 | # How to test HELM locally
 2 | 
 3 | ## Install the NeurIPS client
 4 | 
 5 | Install HELM: `pip install git+https://github.com/stanford-crfm/helm.git`
 6 | 
 7 | 
 8 | ## Setup an HTTP server
 9 | 
10 | Follow instructions in [toy-submission](/sample-submissions/lit-gpt/) to setup a simple HTTP client that can use to local tests
11 | 
12 | ## Configure HELM
13 | 
14 | You can configure which datasets to run HELM on by editing a `run_specs.conf`, to run your model on a large set of datasets. For the preliminary evaluation the organizers will use  https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/blob/master/run_specs_full_coarse_600_budget.conf
15 | 
16 | ```bash
17 | helm-run --conf-paths run_specs_full_coarse_600_budget.conf --suite v1 --max-eval-instances 10
18 | helm-summarize --suite v1
19 | ```
20 | 
21 | ## Analyze your results
22 | 
23 | You can launch a web server to visually inspect the results of your run, `helm-summarize` can also print the results textually for you in your terminal but we've found the web server to be useful.
24 | 
25 | ```
26 | helm-server
27 | ```
28 | 
29 | This will launch a server on your local host, if you're working on a remote machine you might need to setup port forwarding. If everything worked correctly you should see a page that looks like [this](https://user-images.githubusercontent.com/3282513/249620854-080f4d77-c5fd-4ea4-afa4-cf6a9dceb8c9.png)
30 | 


--------------------------------------------------------------------------------
/leaderboard.md:
--------------------------------------------------------------------------------
 1 | # Leaderboard usage 
 2 | 
 3 | 
 4 | ## How to use the leaderboard
 5 | The [Lightning AI](https://lightning.ai/) team has built us a leaderboard on Discord. This is the single best way you can make sure your submissions actually work before the submission, try to beat the unfinetuned toy submission as a starting point.
 6 | 
 7 | You might have noticed a new friendly bot has joined the server called @evalbot  to use it
 8 | 1. DM the bot with `eval 4090` or `eval A100` and attach a zipped file of your submission to the message (You can also just openly message the bot but DM'ing will protect your secret sauce)
 9 | 2. If successful the bot will give you a job ID and a running status, the eval will take roughly 1-2h so be patient if you're top of queue
10 | 3. Once the bot completes your run it will update either the ⁠leaderboard_4090  or ⁠leaderboard_a100 channel, we will not be monitoring these 2 text channels they will be purely for the bot to post the new updated leaderboard
11 | 
12 | ## How to create a zip submission
13 | 
14 | We will showcase an example using our actual repo https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge
15 | 1. `git clone --recurse-submodules https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge` to ensure `lit-gpt` folder is actually in the repo
16 | 2. `rm -rf sample-submissions/llama_recipes`, the leaderboard will recursively traverse your repo and find the first `Dockerfile` and assume that's the submission
17 | 3. `zip -r neurips_llm_efficiency_challenge.zip neurips_llm_efficiency_challenge/`
18 | 
19 | And once you have that submission DM the `evalbot` with either `eval 4090` or `eval A100` with the zip file attached to your submission. Discord does impose size limits on messages so make sure your artifacts aren't stored directly in the repo but that you `wget` from somewhere else.
20 | 
21 | 
22 | **Note**: 
23 | 1. The way the bot works is it will recursively scan your repo for the first Dockerfile and use only that to eval against
24 | Providing free GPUs is expensive so if you're up to funny business like opening multiple discord accounts and/or spamming our bot we will disqualify you from the competition
25 | 2. You will be allowed a maximum of 3 submissions a day
26 | 3. Depending on volume of submissions eval might take a long time while you wait in the queue, the 2 techniques we have of resolving this are either adding more GPUs in our pool or reducing the number of eval instances, we will communicate whenever we make either of 2 decisions on Discord directly
27 | 


--------------------------------------------------------------------------------
/open_api_spec.json:
--------------------------------------------------------------------------------
1 | {"openapi": "3.0.2", "info": {"title": "FastAPI", "version": "0.1.0"}, "paths": {"/process": {"post": {"summary": "Process Request", "operationId": "process_request_process_post", "requestBody": {"content": {"application/json": {"schema": {"$ref": "#/components/schemas/ProcessRequest"}}}, "required": true}, "responses": {"200": {"description": "Successful Response", "content": {"application/json": {"schema": {}}}}, "422": {"description": "Validation Error", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/HTTPValidationError"}}}}}}}, "/tokenize": {"post": {"summary": "Tokenize", "operationId": "tokenize_tokenize_post", "requestBody": {"content": {"application/json": {"schema": {"$ref": "#/components/schemas/TokenizeRequest"}}}, "required": true}, "responses": {"200": {"description": "Successful Response", "content": {"application/json": {"schema": {}}}}, "422": {"description": "Validation Error", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/HTTPValidationError"}}}}}}}}, "components": {"schemas": {"HTTPValidationError": {"title": "HTTPValidationError", "type": "object", "properties": {"detail": {"title": "Detail", "type": "array", "items": {"$ref": "#/components/schemas/ValidationError"}}}}, "ProcessRequest": {"title": "ProcessRequest", "required": ["prompt"], "type": "object", "properties": {"prompt": {"title": "Prompt", "type": "string"}, "num_samples": {"title": "Num Samples", "type": "integer", "default": 1}, "max_new_tokens": {"title": "Max New Tokens", "type": "integer", "default": 50}, "top_k": {"title": "Top K", "type": "integer", "default": 200}, "temperature": {"title": "Temperature", "type": "number", "default": 0.8}, "seed": {"title": "Seed", "type": "integer"}}}, "TokenizeRequest": {"title": "TokenizeRequest", "required": ["text"], "type": "object", "properties": {"text": {"title": "Text", "type": "string"}, "truncation": {"title": "Truncation", "type": "boolean", "default": true}, "max_length": {"title": "Max Length", "type": "integer", "default": 2048}}}, "ValidationError": {"title": "ValidationError", "required": ["loc", "msg", "type"], "type": "object", "properties": {"loc": {"title": "Location", "type": "array", "items": {"type": "string"}}, "msg": {"title": "Message", "type": "string"}, "type": {"title": "Error Type", "type": "string"}}}}}}


--------------------------------------------------------------------------------
/run_specs.conf:
--------------------------------------------------------------------------------
 1 | entries: [
 2 |     #bigbench
 3 | 
 4 |     #analytic_entailment: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/analytic_entailment
 5 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=analytic_entailment,subtask=", priority: 1}
 6 | 
 7 |     #causal_judgment: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/causal_judgment
 8 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=causal_judgment,subtask=", priority: 1}
 9 | 
10 |     #emoji_movie: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/emoji_movie
11 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=emoji_movie,subtask=", priority: 1}
12 | 
13 |     #empirical_judgments: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/empirical_judgments
14 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=empirical_judgments,subtask=", priority: 1}
15 | 
16 |     #known_unknowns: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/known_unknowns
17 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=known_unknowns,subtask=", priority: 1}
18 | 
19 |     # logical_deduction: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/logical_deduction
20 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=logical_deduction,subtask=three_objects", priority: 1}
21 | 
22 |     #strange_stories: https://github.com/google/big-bench/tree/main/bigbench/benchmark_tasks/strange_stories
23 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=strange_stories,subtask=multiple_choice", priority: 1}
24 | 
25 |     #snarks: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/snarks
26 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=snarks,subtask=", priority: 1}
27 | 
28 |     #dark_humor_detection: https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/dark_humor_detection
29 |     {description: "big_bench:model=neurips/local,max_train_instances=3,task=dark_humor_detection,subtask=", priority: 1}
30 | 
31 |     
32 |     #mmlu
33 |     {description: "mmlu:model=neurips/local,subject=philosophy,data_augmentation=canonical", priority: 1}
34 |     {description: "mmlu:model=neurips/local,subject=high_school_biology,data_augmentation=canonical", priority: 1}
35 |     {description: "mmlu:model=neurips/local,subject=high_school_chemistry,data_augmentation=canonical", priority: 1}
36 |     {description: "mmlu:model=neurips/local,subject=high_school_computer_science,data_augmentation=canonical", priority: 1}
37 |     {description: "mmlu:model=neurips/local,subject=high_school_european_history,data_augmentation=canonical", priority: 1}
38 |     {description: "mmlu:model=neurips/local,subject=high_school_geography,data_augmentation=canonical", priority: 1}
39 |     {description: "mmlu:model=neurips/local,subject=high_school_government_and_politics,data_augmentation=canonical", priority: 1}
40 |     {description: "mmlu:model=neurips/local,subject=high_school_macroeconomics,data_augmentation=canonical", priority: 1}
41 |     {description: "mmlu:model=neurips/local,subject=high_school_mathematics,data_augmentation=canonical", priority: 1}
42 |     {description: "mmlu:model=neurips/local,subject=high_school_microeconomics,data_augmentation=canonical", priority: 1}
43 |     {description: "mmlu:model=neurips/local,subject=high_school_physics,data_augmentation=canonical", priority: 1}
44 |     {description: "mmlu:model=neurips/local,subject=high_school_psychology,data_augmentation=canonical", priority: 1}
45 |     {description: "mmlu:model=neurips/local,subject=high_school_statistics,data_augmentation=canonical", priority: 1}
46 |     {description: "mmlu:model=neurips/local,subject=high_school_us_history,data_augmentation=canonical", priority: 1}
47 |     {description: "mmlu:model=neurips/local,subject=high_school_world_history,data_augmentation=canonical", priority: 1}
48 |     {description: "mmlu:model=neurips/local,subject=moral_disputes,data_augmentation=canonical", priority: 1}
49 |     {description: "mmlu:model=neurips/local,subject=moral_scenarios,data_augmentation=canonical", priority: 1}
50 | 
51 | 
52 |     #truthful QA
53 |     {description: "truthful_qa:task=mc_single,model=neurips/local", priority: 1},
54 | 
55 |     #CNN/daily mail
56 |     {description: "summarization_cnndm:model=neurips/local", priority: 1},
57 |     #GSM
58 |     {description: "gsm:model=neurips/local", priority: 1}
59 |     #BBQ
60 |     {description: "bbq:subject=all,model=neurips/local", priority: 1},
61 | 
62 | ]
63 | 


--------------------------------------------------------------------------------
/run_specs_full_coarse_600_budget.conf:
--------------------------------------------------------------------------------
  1 | entries: [
  2 | 
  3 | # auto_debugging
  4 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=auto_debugging,subtask=,max_eval_instances=18",priority: 1}
  5 | 
  6 | # code_line_description
  7 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=code_line_description,subtask=,max_eval_instances=19",priority: 1}
  8 | 
  9 | # conceptual_combinations
 10 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=contradictions,max_eval_instances=3",priority: 1}
 11 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=emergent_properties,max_eval_instances=3",priority: 1}
 12 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=fanciful_fictional_combinations,max_eval_instances=4",priority: 1}
 13 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=homonyms,max_eval_instances=4",priority: 1}
 14 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=conceptual_combinations,subtask=invented_words,max_eval_instances=4",priority: 1}
 15 | 
 16 | # emoji_movie
 17 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=emoji_movie,subtask=,max_eval_instances=19",priority: 1}
 18 | 
 19 | # formal_fallacies_syllogisms_negation
 20 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=formal_fallacies_syllogisms_negation,subtask=,max_eval_instances=19",priority: 1}
 21 | 
 22 | # known_unknowns
 23 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=known_unknowns,subtask=,max_eval_instances=19",priority: 1}
 24 | 
 25 | # linguistics_puzzles
 26 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=linguistics_puzzles,subtask=,max_eval_instances=18",priority: 1}
 27 | 
 28 | # logic_grid_puzzle
 29 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logic_grid_puzzle,subtask=,max_eval_instances=18",priority: 1}
 30 | 
 31 | # logical_deduction
 32 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=three_objects,max_eval_instances=6",priority: 1}
 33 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=five_objects,max_eval_instances=6",priority: 1}
 34 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=logical_deduction,subtask=seven_objects,max_eval_instances=6",priority: 1}
 35 | 
 36 | # novel_concepts
 37 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=novel_concepts,subtask=,max_eval_instances=18",priority: 1}
 38 | 
 39 | # operator
 40 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=operators,subtask=,max_eval_instances=18",priority: 1}
 41 | 
 42 | # play_dialog_same_or_different
 43 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=play_dialog_same_or_different,subtask=,max_eval_instances=18",priority: 1}
 44 | 
 45 | # repeat_copy_logic
 46 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=repeat_copy_logic,subtask=,max_eval_instances=18",priority: 1}
 47 | 
 48 | # strange_stories
 49 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=boolean,max_eval_instances=9",priority: 1}
 50 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strange_stories,subtask=multiple_choice,max_eval_instances=9",priority: 1}
 51 | 
 52 | # strategyqa
 53 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=strategyqa,subtask=,max_eval_instances=18",priority: 1}
 54 | 
 55 | # symbol_interpretation
 56 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=adversarial,max_eval_instances=3",priority: 1}
 57 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=emoji_agnostic,max_eval_instances=3",priority: 1}
 58 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=name_agnostic,max_eval_instances=4",priority: 1}
 59 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=plain,max_eval_instances=4",priority: 1}
 60 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=symbol_interpretation,subtask=tricky,max_eval_instances=4",priority: 1}
 61 | 
 62 | # vitaminc_fact_verification
 63 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=vitaminc_fact_verification,subtask=,max_eval_instances=18",priority: 1}
 64 | 
 65 | # winowhy
 66 | {description: "big_bench:model=neurips/local,max_train_instances=big_bench_few_shot_setting,task=winowhy,subtask=,max_eval_instances=19",priority: 1}
 67 | 
 68 | # medicine_biology
 69 | {description: "mmlu:model=neurips/local,subject=anatomy,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 70 | {description: "mmlu:model=neurips/local,subject=college_medicine,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 71 | {description: "mmlu:model=neurips/local,subject=college_biology,data_augmentation=canonical,max_eval_instances=5",priority: 1}
 72 | {description: "mmlu:model=neurips/local,subject=high_school_biology,data_augmentation=canonical,max_eval_instances=5",priority: 1}
 73 | 
 74 | # computer_science
 75 | {description: "mmlu:model=neurips/local,subject=college_computer_science,data_augmentation=canonical,max_eval_instances=3",priority: 1}
 76 | {description: "mmlu:model=neurips/local,subject=high_school_computer_science,data_augmentation=canonical,max_eval_instances=3",priority: 1}
 77 | {description: "mmlu:model=neurips/local,subject=computer_security,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 78 | {description: "mmlu:model=neurips/local,subject=electrical_engineering,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 79 | {description: "mmlu:model=neurips/local,subject=machine_learning,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 80 | 
 81 | # math
 82 | {description: "mmlu:model=neurips/local,subject=high_school_mathematics,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 83 | {description: "mmlu:model=neurips/local,subject=college_mathematics,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 84 | {description: "mmlu:model=neurips/local,subject=abstract_algebra,data_augmentation=canonical,max_eval_instances=5",priority: 1}
 85 | {description: "mmlu:model=neurips/local,subject=high_school_statistics,data_augmentation=canonical,max_eval_instances=5",priority: 1}
 86 | 
 87 | # physics_chemistry
 88 | {description: "mmlu:model=neurips/local,subject=college_chemistry,data_augmentation=canonical,max_eval_instances=3",priority: 1}
 89 | {description: "mmlu:model=neurips/local,subject=high_school_chemistry,data_augmentation=canonical,max_eval_instances=3",priority: 1}
 90 | {description: "mmlu:model=neurips/local,subject=high_school_physics,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 91 | {description: "mmlu:model=neurips/local,subject=college_physics,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 92 | {description: "mmlu:model=neurips/local,subject=astronomy,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 93 | 
 94 | # formal_reasoning
 95 | {description: "mmlu:model=neurips/local,subject=formal_logic,data_augmentation=canonical,max_eval_instances=3",priority: 1}
 96 | {description: "mmlu:model=neurips/local,subject=logical_fallacies,data_augmentation=canonical,max_eval_instances=3",priority: 1}
 97 | {description: "mmlu:model=neurips/local,subject=philosophy,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 98 | {description: "mmlu:model=neurips/local,subject=moral_disputes,data_augmentation=canonical,max_eval_instances=4",priority: 1}
 99 | {description: "mmlu:model=neurips/local,subject=moral_scenarios,data_augmentation=canonical,max_eval_instances=4",priority: 1}
100 | 
101 | # law
102 | {description: "mmlu:model=neurips/local,subject=professional_law,data_augmentation=canonical,max_eval_instances=6",priority: 1}
103 | {description: "mmlu:model=neurips/local,subject=international_law,data_augmentation=canonical,max_eval_instances=6",priority: 1}
104 | {description: "mmlu:model=neurips/local,subject=jurisprudence,data_augmentation=canonical,max_eval_instances=6",priority: 1}
105 | 
106 | # history
107 | {description: "mmlu:model=neurips/local,subject=high_school_european_history,data_augmentation=canonical,max_eval_instances=3",priority: 1}
108 | {description: "mmlu:model=neurips/local,subject=high_school_us_history,data_augmentation=canonical,max_eval_instances=3",priority: 1}
109 | {description: "mmlu:model=neurips/local,subject=high_school_world_history,data_augmentation=canonical,max_eval_instances=4",priority: 1}
110 | {description: "mmlu:model=neurips/local,subject=prehistory,data_augmentation=canonical,max_eval_instances=4",priority: 1}
111 | {description: "mmlu:model=neurips/local,subject=world_religions,data_augmentation=canonical,max_eval_instances=4",priority: 1}
112 | 
113 | # business
114 | {description: "mmlu:model=neurips/local,subject=business_ethics,data_augmentation=canonical,max_eval_instances=3",priority: 1}
115 | {description: "mmlu:model=neurips/local,subject=global_facts,data_augmentation=canonical,max_eval_instances=3",priority: 1}
116 | {description: "mmlu:model=neurips/local,subject=management,data_augmentation=canonical,max_eval_instances=3",priority: 1}
117 | {description: "mmlu:model=neurips/local,subject=marketing,data_augmentation=canonical,max_eval_instances=3",priority: 1}
118 | {description: "mmlu:model=neurips/local,subject=miscellaneous,data_augmentation=canonical,max_eval_instances=3",priority: 1}
119 | {description: "mmlu:model=neurips/local,subject=professional_accounting,data_augmentation=canonical,max_eval_instances=3",priority: 1}
120 | 
121 | # health
122 | {description: "mmlu:model=neurips/local,subject=nutrition,data_augmentation=canonical,max_eval_instances=3",priority: 1}
123 | {description: "mmlu:model=neurips/local,subject=human_aging,data_augmentation=canonical,max_eval_instances=3",priority: 1}
124 | {description: "mmlu:model=neurips/local,subject=clinical_knowledge,data_augmentation=canonical,max_eval_instances=3",priority: 1}
125 | {description: "mmlu:model=neurips/local,subject=medical_genetics,data_augmentation=canonical,max_eval_instances=3",priority: 1}
126 | {description: "mmlu:model=neurips/local,subject=professional_medicine,data_augmentation=canonical,max_eval_instances=3",priority: 1}
127 | {description: "mmlu:model=neurips/local,subject=virology,data_augmentation=canonical,max_eval_instances=3",priority: 1}
128 | 
129 | # social_studies
130 | {description: "mmlu:model=neurips/local,subject=high_school_government_and_politics,data_augmentation=canonical,max_eval_instances=3",priority: 1}
131 | {description: "mmlu:model=neurips/local,subject=high_school_geography,data_augmentation=canonical,max_eval_instances=3",priority: 1}
132 | {description: "mmlu:model=neurips/local,subject=us_foreign_policy,data_augmentation=canonical,max_eval_instances=4",priority: 1}
133 | {description: "mmlu:model=neurips/local,subject=public_relations,data_augmentation=canonical,max_eval_instances=4",priority: 1}
134 | {description: "mmlu:model=neurips/local,subject=security_studies,data_augmentation=canonical,max_eval_instances=4",priority: 1}
135 | 
136 | # human_behavior
137 | {description: "mmlu:model=neurips/local,subject=high_school_psychology,data_augmentation=canonical,max_eval_instances=4",priority: 1}
138 | {description: "mmlu:model=neurips/local,subject=human_sexuality,data_augmentation=canonical,max_eval_instances=4",priority: 1}
139 | {description: "mmlu:model=neurips/local,subject=professional_psychology,data_augmentation=canonical,max_eval_instances=5",priority: 1}
140 | {description: "mmlu:model=neurips/local,subject=sociology,data_augmentation=canonical,max_eval_instances=5",priority: 1}
141 | 
142 | # economics
143 | {description: "mmlu:model=neurips/local,subject=high_school_microeconomics,data_augmentation=canonical,max_eval_instances=6",priority: 1}
144 | {description: "mmlu:model=neurips/local,subject=econometrics,data_augmentation=canonical,max_eval_instances=6",priority: 1}
145 | {description: "mmlu:model=neurips/local,subject=high_school_macroeconomics,data_augmentation=canonical,max_eval_instances=6",priority: 1}
146 | 
147 | # truthful_qa
148 | {description: "truthful_qa:task=mc_single,model=neurips/local,max_eval_instances=9",priority: 1}
149 | {description: "summarization_cnndm:model=neurips/local,max_eval_instances=9",priority: 1}
150 | 
151 | # gsm
152 | {description: "gsm:model=neurips/local,max_eval_instances=19",priority: 1}
153 | 
154 | # bbq
155 | {description: "bbq:subject=all,model=neurips/local,max_eval_instances=18",priority: 1}
156 | ]


--------------------------------------------------------------------------------
/sample-submissions/lit-gpt/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use latest official release with CUDA support https://hub.docker.com/r/pytorch/pytorch/tags
 2 | FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
 3 | 
 4 | # Set the working directory in the container to /submission
 5 | WORKDIR /submission
 6 | 
 7 | # Copy the specific file into the container at /submission
 8 | COPY /lit-gpt/ /submission/
 9 | 
10 | # Setup server requriements
11 | COPY ./fast_api_requirements.txt fast_api_requirements.txt
12 | RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt
13 | 
14 | RUN apt-get update && apt-get install -y git
15 | # Install any needed packages specified in requirements.txt that come from lit-gpt plus some optionals
16 | RUN pip install -r requirements.txt huggingface_hub sentencepiece tokenizers bitsandbytes scipy
17 | 
18 | # some huggingface_hub versions require that the target dir exists
19 | RUN mkdir -p checkpoints/openlm-research/open_llama_3b
20 | # get open-llama weights: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/download_openllama.md
21 | RUN python scripts/download.py --repo_id openlm-research/open_llama_3b
22 | RUN python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/openlm-research/open_llama_3b
23 | 
24 | # Copy over single file server
25 | COPY ./main.py /submission/main.py
26 | COPY ./helper.py /submission/helper.py
27 | COPY ./api.py /submission/api.py
28 | # Run the server
29 | CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
30 | 


--------------------------------------------------------------------------------
/sample-submissions/lit-gpt/README.md:
--------------------------------------------------------------------------------
 1 | # Sample Submission
 2 | This sample-submission contains a dockerfile that exposes a HTTP server. Requests will be made against this server during the evaluation phase of the competition
 3 | 
 4 | ### Getting Started
 5 | Make sure you have recursively cloned the top this repository in order to get lit-gpt. 
 6 | 
 7 | ❗ Make sure the repo is cloned with git submodule support either:
 8 | 
 9 | ```sh
10 | git clone --recurse-submodules ...
11 | ```
12 | 
13 | or if you cloned the repo but are missing the `lit-gpt` folder
14 | 
15 | ```sh
16 | git submodule update --init --recursive
17 | ```
18 | 
19 | ### Structure
20 | * lit-gpt/ 
21 |     * unmodified submodule that contains a hackable `torch.nn.Module` GPT definition as well as optional fine-tuning
22 |       and inference code.
23 | * main.py
24 |     * The process/ and tokenize/ endpoints are defined here
25 | * helper.py
26 |     * Applies logic on top of lit-gpt's generate in order to produce responses in accordance with the spec.
27 | * api.py
28 |     * Defines the pydantic classes for the FASTapi server
29 | * Dockerfile
30 |     * Definition of the image that will set-up the server used for submissions
31 |   
32 | ### Make your GPUs visible to Docker 
33 | Follow this guide to install [nvidia-ctk](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
34 | ```sh
35 | nvidia-ctk runtime configure
36 | systemctl restart docker
37 | ```
38 | 
39 | ### Build and run 
40 | ```sh
41 | docker build -t sample_submission .
42 | docker run --gpus all -p 8080:80 sample_submission
43 | ```
44 | ### Send requests
45 | ```sh
46 | curl -X POST -H "Content-Type: application/json" -d '{"prompt": "The capital of france is "}' http://localhost:8080/process
47 | ```
48 | 


--------------------------------------------------------------------------------
/sample-submissions/lit-gpt/api.py:
--------------------------------------------------------------------------------
 1 | from pydantic import BaseModel
 2 | 
 3 | from typing import List, Dict, Optional
 4 | 
 5 | 
 6 | class ProcessRequest(BaseModel):
 7 |     prompt: str
 8 |     num_samples: int = 1
 9 |     max_new_tokens: int = 50
10 |     top_k: int = 200
11 |     temperature: float = 0.8
12 |     seed: Optional[int] = None
13 |     echo_prompt: Optional[bool]
14 | 
15 | 
16 | class Token(BaseModel):
17 |     text: str
18 |     logprob: float
19 |     top_logprob: Dict[str, float]
20 | 
21 | 
22 | class ProcessResponse(BaseModel):
23 |     text: str
24 |     tokens: List[Token]
25 |     logprob: float
26 |     request_time: float
27 | 
28 | 
29 | class TokenizeRequest(BaseModel):
30 |     text: str
31 |     truncation: bool = True
32 |     max_length: int = 2048
33 | 
34 | 
35 | class TokenizeResponse(BaseModel):
36 |     tokens: List[int]
37 |     request_time: float
38 | 
39 | 
40 | class DecodeRequest(BaseModel):
41 |     tokens: List[int]
42 | 
43 | 
44 | class DecodeResponse(BaseModel):
45 |     text: str
46 |     request_time: float
47 | 


--------------------------------------------------------------------------------
/sample-submissions/lit-gpt/fast_api_requirements.txt:
--------------------------------------------------------------------------------
1 | # FAST API
2 | fastapi>=0.68.0,<0.69.0
3 | pydantic>=1.8.0,<2.0.0
4 | uvicorn>=0.15.0,<0.16.0
5 | 


--------------------------------------------------------------------------------
/sample-submissions/lit-gpt/helper.py:
--------------------------------------------------------------------------------
  1 | from typing import List, Optional, Tuple
  2 | 
  3 | import torch
  4 | 
  5 | 
  6 | @torch.no_grad()
  7 | def toysubmission_generate(
  8 |     model: torch.nn.Module,
  9 |     idx: torch.Tensor,
 10 |     max_returned_tokens: int,
 11 |     *,
 12 |     temperature: float = 1.0,
 13 |     top_k: Optional[int] = None,
 14 |     eos_id: Optional[int] = None,
 15 | ) -> Tuple[List[int], List[float], List[Tuple[int, float]]]:
 16 |     """Takes a conditioning sequence (prompt) as input and continues to generate as many tokens as requested.
 17 | 
 18 |     The implementation of this function is modified from A. Karpathy's nanoGPT.
 19 | 
 20 |     Args:
 21 |         model: The model to use.
 22 |         idx: Tensor of shape (T) with indices of the prompt sequence.
 23 |         max_returned_tokens: The maximum number of tokens to return (given plus generated).
 24 |         temperature: Scales the predicted logits by 1 / temperature.
 25 |         top_k: If specified, only sample among the tokens with the k highest probabilities.
 26 |         eos_id: If specified, stop generating any more token once the <eos> token is triggered.
 27 | 
 28 |     Returns:
 29 |         Tuple containing a list of token indexes, id of the top log probability, and the actual log probability of the
 30 |         selected token.
 31 |     """
 32 |     T = idx.size(0)
 33 |     assert max_returned_tokens > T
 34 |     if model.max_seq_length < max_returned_tokens - 1:
 35 |         # rolling the kv cache based on the `input_pos` value would be necessary. However, doing so would introduce a
 36 |         # data dependency on the `input_pos` tensor and impact model compilation. Since this setting is uncommon, we do
 37 |         # not support it to avoid negatively impacting the overall speed
 38 |         raise NotImplementedError(
 39 |             f"max_seq_length {model.max_seq_length} needs to be >= {max_returned_tokens - 1}"
 40 |         )
 41 | 
 42 |     device, dtype = idx.device, idx.dtype
 43 |     # create an empty tensor of the expected final shape and fill in the current tokens
 44 |     empty = torch.empty(max_returned_tokens, dtype=dtype, device=device)
 45 |     # prefill empty with the prompt token indexes
 46 |     empty[:T] = idx
 47 |     idx = empty
 48 |     input_pos = torch.arange(0, T, device=device)
 49 | 
 50 |     top_logprob = []
 51 |     logprob = []
 52 | 
 53 |     # Generate log_prob and top_log_prob for the prompt
 54 |     logits = model(idx[:T].view(1, -1), input_pos)
 55 |     probs = torch.nn.functional.softmax(logits, dim=-1)[0]
 56 |     prompt_log_probs = torch.log(probs)
 57 |     prompt_max_probs, prompt_argmax_probs = torch.max(probs, dim=-1)
 58 |     # Grab the logprob for all the tokens in the prompt
 59 |     logprob.extend(
 60 |         prompt_log_probs.gather(-1, idx[:T, None].to(torch.int64)).squeeze(-1).tolist()
 61 |     )
 62 |     top_logprob.extend(
 63 |         [
 64 |             (argmax.item(), max_prob.item())
 65 |             for argmax, max_prob in zip(prompt_argmax_probs, prompt_max_probs)
 66 |         ]
 67 |     )
 68 | 
 69 |     # generate up to a fixed number of tokens
 70 |     for _ in range(max_returned_tokens - T):
 71 |         x = idx.index_select(0, input_pos).view(1, -1)
 72 | 
 73 |         # forward
 74 |         logits = model(x, input_pos)
 75 |         logits = logits[0, -1] / temperature
 76 | 
 77 |         # optionally crop the logits to only the top k options
 78 |         if top_k is not None:
 79 |             v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
 80 |             logits = torch.where(logits < v[[-1]], -float("Inf"), logits)
 81 | 
 82 |         probs = torch.nn.functional.softmax(logits, dim=-1)
 83 | 
 84 |         idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype)
 85 | 
 86 |         # append the logprob of selected token
 87 |         logprob.append(torch.log(probs[idx_next]).item())
 88 | 
 89 |         # append th idx and logprob of top token
 90 |         top_logprob.append((torch.argmax(probs).item(), torch.log(probs).max().item()))
 91 | 
 92 |         # advance
 93 |         input_pos = input_pos[-1:] + 1
 94 | 
 95 |         # concatenate the new generation
 96 |         idx = idx.index_copy(0, input_pos, idx_next)
 97 | 
 98 |         # if <eos> token is triggered, return the output (stop generation)
 99 |         if idx_next == eos_id:
100 |             return idx[:input_pos], logprob, top_logprob  # include the EOS token
101 | 
102 |     return idx, logprob, top_logprob
103 | 


--------------------------------------------------------------------------------
/sample-submissions/lit-gpt/main.py:
--------------------------------------------------------------------------------
  1 | from fastapi import FastAPI
  2 | 
  3 | import logging
  4 | 
  5 | # Lit-GPT imports
  6 | import sys
  7 | import time
  8 | from pathlib import Path
  9 | import json
 10 | 
 11 | # support running without installing as a package
 12 | wd = Path(__file__).parent.parent.resolve()
 13 | sys.path.append(str(wd))
 14 | 
 15 | import lightning as L
 16 | import torch
 17 | 
 18 | torch.set_float32_matmul_precision("high")
 19 | 
 20 | from lit_gpt import GPT, Tokenizer, Config
 21 | from lit_gpt.utils import lazy_load, quantization
 22 | 
 23 | # Toy submission imports
 24 | from helper import toysubmission_generate
 25 | from api import (
 26 |     ProcessRequest,
 27 |     ProcessResponse,
 28 |     TokenizeRequest,
 29 |     TokenizeResponse,
 30 |     Token,
 31 |     DecodeRequest,
 32 |     DecodeResponse
 33 | )
 34 | 
 35 | app = FastAPI()
 36 | 
 37 | logger = logging.getLogger(__name__)
 38 | # Configure the logging module
 39 | logging.basicConfig(level=logging.INFO)
 40 | 
 41 | quantize = "bnb.nf4-dq"  # 4-bit NormalFloat with Double-Quantization (see QLoRA paper)
 42 | checkpoint_dir = Path("checkpoints/openlm-research/open_llama_3b")
 43 | precision = "bf16-true"  # weights and data in bfloat16 precision
 44 | 
 45 | fabric = L.Fabric(devices=1, accelerator="cuda", precision=precision)
 46 | 
 47 | with open(checkpoint_dir / "lit_config.json") as fp:
 48 |     config = Config(**json.load(fp))
 49 | 
 50 | checkpoint_path = checkpoint_dir / "lit_model.pth"
 51 | logger.info(f"Loading model {str(checkpoint_path)!r} with {config.__dict__}")
 52 | with fabric.init_module(empty_init=True), quantization(quantize):
 53 |     model = GPT(config)
 54 | 
 55 | with lazy_load(checkpoint_path) as checkpoint:
 56 |     model.load_state_dict(checkpoint, strict=quantize is None)
 57 | 
 58 | model.eval()
 59 | model = fabric.setup(model)
 60 | 
 61 | tokenizer = Tokenizer(checkpoint_dir)
 62 | 
 63 | 
 64 | @app.post("/process")
 65 | async def process_request(input_data: ProcessRequest) -> ProcessResponse:
 66 |     if input_data.seed is not None:
 67 |         L.seed_everything(input_data.seed)
 68 |     logger.info("Using device: {}".format(fabric.device))
 69 |     encoded = tokenizer.encode(
 70 |         input_data.prompt, bos=True, eos=False, device=fabric.device
 71 |     )
 72 |     prompt_length = encoded.size(0)
 73 |     max_returned_tokens = prompt_length + input_data.max_new_tokens
 74 | 
 75 |     with fabric.init_tensor():
 76 |         # set the max_seq_length to limit the memory usage to what we need
 77 |         model.max_seq_length = max_returned_tokens
 78 |         # enable the kv cache
 79 |         model.set_kv_cache(batch_size=1)
 80 | 
 81 | 
 82 |     t0 = time.perf_counter()
 83 |     tokens, logprobs, top_logprobs = toysubmission_generate(
 84 |         model,
 85 |         encoded,
 86 |         max_returned_tokens,
 87 |         temperature=input_data.temperature,
 88 |         top_k=input_data.top_k,
 89 |     )
 90 | 
 91 |     t = time.perf_counter() - t0
 92 | 
 93 |     if input_data.echo_prompt is False:
 94 |         output = tokenizer.decode(tokens[prompt_length:])
 95 |         tokens = tokens[prompt_length:]
 96 |         logprobs = logprobs[prompt_length:]
 97 |         top_logprobs = top_logprobs[prompt_length:]
 98 |     else:
 99 |         output = tokenizer.decode(tokens)
100 |     tokens_generated = tokens.size(0) - prompt_length
101 |     logger.info(
102 |         f"Time for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec"
103 |     )
104 | 
105 |     logger.info(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB")
106 |     generated_tokens = []
107 |     for t, lp, tlp in zip(tokens, logprobs, top_logprobs):
108 |         idx, val = tlp
109 |         tok_str = tokenizer.processor.decode([idx])
110 |         token_tlp = {tok_str: val}
111 |         generated_tokens.append(
112 |             Token(text=tokenizer.decode(t), logprob=lp, top_logprob=token_tlp)
113 |         )
114 |     logprobs_sum = sum(logprobs)
115 |     # Process the input data here
116 |     return ProcessResponse(
117 |         text=output, tokens=generated_tokens, logprob=logprobs_sum, request_time=t
118 |     )
119 | 
120 | 
121 | @app.post("/tokenize")
122 | async def tokenize(input_data: TokenizeRequest) -> TokenizeResponse:
123 |     logger.info("Using device: {}".format(fabric.device))
124 |     t0 = time.perf_counter()
125 |     encoded = tokenizer.encode(
126 |         input_data.text, bos=True, eos=False, device=fabric.device
127 |     )
128 |     t = time.perf_counter() - t0
129 |     tokens = encoded.tolist()
130 |     return TokenizeResponse(tokens=tokens, request_time=t)
131 | 
132 | 
133 | @app.post("/decode")
134 | async def decode(input_data: DecodeRequest) -> DecodeResponse:
135 |     logger.info("Using device: {}".format(fabric.device))
136 |     t0 = time.perf_counter()
137 |     # decoded = tokenizer.decode(torch.Tensor(input_data.tokens))
138 |     decoded = tokenizer.processor.decode(input_data.tokens)
139 |     t = time.perf_counter() - t0
140 |     return DecodeResponse(text=decoded, request_time=t)


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
 2 | 
 3 | RUN apt-get update  && apt-get install -y git python3-virtualenv wget 
 4 | 
 5 | RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
 6 | 
 7 | WORKDIR /workspace
 8 | # Setup server requriements
 9 | COPY ./fast_api_requirements.txt fast_api_requirements.txt
10 | RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt
11 | 
12 | ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
13 | ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
14 | 
15 | # Copy over single file server
16 | COPY ./main.py main.py
17 | COPY ./api.py api.py
18 | # Run the server
19 | CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
20 | 


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/Dockerfile.train:
--------------------------------------------------------------------------------
 1 | FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
 2 | 
 3 | RUN apt-get update  && apt-get install -y git python3-virtualenv wget
 4 | 
 5 | RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
 6 | 
 7 | WORKDIR /workspace
 8 | 
 9 | RUN wget https://gist.githubusercontent.com/mreso/ec65015cbfbd395f0c2adc17147adf1f/raw/41070f1058820b9e89bae885968cc666a7d6aa59/custom_dataset.py
10 | 
11 | ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
12 | ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
13 | 
14 | COPY train.py ./
15 | 
16 | CMD [ "python", "train.py"]
17 | 


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/README.md:
--------------------------------------------------------------------------------
 1 | # Llama-recipes Example
 2 | This example demonstrates how to fine-tune and serve a Llama 2 model with llama-recipes for submission in the LLM efficiency challenge using the [lit-gpt](../lit-gpt/) example as a template.
 3 | Llama-recipes provides an easy way to fine-tune a Llama 2 model with custom datasets using efficient techniques like LoRA or Llama-adapters.
 4 | 
 5 | # Getting Started
 6 | In order to use llama-recipes we need to install the following pip package:
 7 | 
 8 | ```
 9 | pip install llama-recipes
10 | ```
11 | 
12 | To obtain access to the model weights you need to fill out this [form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to accept the license terms and acceptable use policy.
13 | 
14 | After access has been granted, you need to acknowledge this in your HuggingFace account for the model you want to fine-tune. In this example we will continue with the 7B parameter version available under this identifier: meta-llama/Llama-2-7b-hf
15 | 
16 | **NOTE** In this example the training result will be uploaded and downloaded through huggingface_hub. The authentication will be done through a token created in the settings of your HuggingFace account.
17 | Make sure to give write access to the token and set the env variables in the Dockerfiles to your token and repo:
18 | 
19 | ```bash
20 | ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
21 | ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
22 | ```
23 | 
24 | # Fine-tune The Model
25 | With llama-recipes its possible to fine-tune Llama on custom data with a single command. To fine-tune on a custom dataset we need to implement a function (get_custom_dataset) that provides the custom dataset following this example [custom_dataset.py](https://github.com/facebookresearch/llama-recipes/blob/main/examples/custom_dataset.py).
26 | We can then train on this dataset using this command line:
27 | 
28 | ```bash
29 | python3 -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name meta-llama/Llama-2-7b --dataset custom_dataset --custom_dataset.file /workspace/custom_dataset.py --output_dir /volume/output_dir
30 | ```
31 | 
32 | **Note** The custom dataset in this example is dialog based. This is only due to the nature of the example but not a necessity of the custom dataset functionality. To see other examples of get_custom_dataset functions (btw the name of the function get_custom_dataset can be changed in the command line by using this syntax: /workspace/custom_dataset.py:get_foo_dataset) have a look at the [built-in dataset in llama-recipes](https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/datasets/__init__.py).
33 | 
34 | # Create Submission
35 | *Note* For a submission to the competition only the inference part (Dockerfile) will be necessary. A training docker (Dockerfile.train) will only be necessary if you need to replicate the submission in case you're within the top 3 contestants.
36 | 
37 | ## Prepare Leaderboard Submission
38 | The inference Docker will download base and LoRA weights from huggingface_hub. For the submission it is assumed that the trained weights are uploaded to a repo on huggingface_hub and the env variables HUGGINGFACE_TOKEN and HUGGINGFACE_REPO have been updated accordingly in the [Dockerfile](./Dockerfile).
39 | 
40 | To create the zip file for submission to the eval bot use the following commands:
41 | ```bash
42 | cd neurips_llm_efficiency_challenge/sample-submissions
43 | rm llama_recipes/Dockerfile.train
44 | zip -r llama_recipes.zip llama_recipes
45 | ```
46 | *Note* 1. Make sure to only zip the folder llama_recipes and do not include any other sample submission in the zipfile. 2. We delete llama_recipes/Dockerfile.train as a precaution to avoid errors if submission logic changes.
47 | 
48 | ## Run Training And Inference Docker Locally
49 | To locally build and and run the taining Docker we need to execute:
50 | 
51 | ```bash
52 | docker build -f ./Dockerfile.train -t llama_recipes_train .
53 | 
54 | docker run --gpus "device=0" --rm -ti llama_recipes_train
55 | ```
56 | 
57 | The inference Docker can be created and started locally with:
58 | 
59 | ```bash
60 | docker build -f ./Dockerfile -t llama_recipes_inference .
61 | 
62 | docker run --gpus "device=0" -p 8080:80 --rm -ti llama_recipes_inference
63 | ```
64 | 
65 | To test the inference docker we can run this query:
66 | 
67 | ```bash
68 | curl -X POST -H "Content-Type: application/json" -d '{"text": "What is the capital of france? "}' http://localhost:8080/tokenize
69 | OR
70 | curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of france? "}' http://localhost:8080/process
71 | ```
72 | 


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/api.py:
--------------------------------------------------------------------------------
 1 | from pydantic import BaseModel
 2 | 
 3 | from typing import List, Dict, Optional
 4 | 
 5 | 
 6 | class ProcessRequest(BaseModel):
 7 |     prompt: str
 8 |     num_samples: int = 1
 9 |     max_new_tokens: int = 50
10 |     top_k: int = 200
11 |     temperature: float = 0.8
12 |     seed: Optional[int] = None
13 |     echo_prompt: Optional[bool]
14 | 
15 | 
16 | class Token(BaseModel):
17 |     text: str
18 |     logprob: float
19 |     top_logprob: Dict[str, float]
20 | 
21 | 
22 | class ProcessResponse(BaseModel):
23 |     text: str
24 |     tokens: List[Token]
25 |     logprob: float
26 |     request_time: float
27 | 
28 | 
29 | class TokenizeRequest(BaseModel):
30 |     text: str
31 |     truncation: bool = True
32 |     max_length: int = 2048
33 | 
34 | 
35 | class TokenizeResponse(BaseModel):
36 |     tokens: List[int]
37 |     request_time: float
38 | 


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/fast_api_requirements.txt:
--------------------------------------------------------------------------------
1 | # FAST API
2 | fastapi>=0.68.0,<0.69.0
3 | pydantic>=1.8.0,<2.0.0
4 | uvicorn>=0.15.0,<0.16.0
5 | 


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/main.py:
--------------------------------------------------------------------------------
  1 | from fastapi import FastAPI
  2 | 
  3 | import logging
  4 | import os
  5 | import time
  6 | 
  7 | import torch
  8 | from huggingface_hub import login
  9 | from transformers import LlamaTokenizer, LlamaForCausalLM
 10 | from llama_recipes.inference.model_utils import load_peft_model
 11 | 
 12 | torch.set_float32_matmul_precision("high")
 13 | 
 14 | from api import (
 15 |     ProcessRequest,
 16 |     ProcessResponse,
 17 |     TokenizeRequest,
 18 |     TokenizeResponse,
 19 |     Token,
 20 | )
 21 | 
 22 | app = FastAPI()
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | # Configure the logging module
 26 | logging.basicConfig(level=logging.INFO)
 27 | 
 28 | login(token=os.environ["HUGGINGFACE_TOKEN"])
 29 | 
 30 | model = LlamaForCausalLM.from_pretrained(
 31 |     'meta-llama/Llama-2-7b-hf',
 32 |     return_dict=True,
 33 |     torch_dtype=torch.float16,
 34 |     device_map="cuda"
 35 |     )
 36 | model = load_peft_model(model, os.environ["HUGGINGFACE_REPO"])
 37 | 
 38 | model.eval()
 39 | 
 40 | tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b')
 41 | 
 42 | LLAMA2_CONTEXT_LENGTH = 4096
 43 | 
 44 | 
 45 | @app.post("/process")
 46 | async def process_request(input_data: ProcessRequest) -> ProcessResponse:
 47 |     if input_data.seed is not None:
 48 |         torch.manual_seed(input_data.seed)
 49 |     
 50 |     encoded = tokenizer(input_data.prompt, return_tensors="pt")
 51 |     
 52 |     prompt_length = encoded["input_ids"][0].size(0)
 53 |     max_returned_tokens = prompt_length + input_data.max_new_tokens
 54 |     assert max_returned_tokens <= LLAMA2_CONTEXT_LENGTH, (
 55 |         max_returned_tokens,
 56 |         LLAMA2_CONTEXT_LENGTH,
 57 |     )
 58 | 
 59 |     t0 = time.perf_counter()
 60 |     encoded = {k: v.to("cuda") for k, v in encoded.items()}
 61 |     with torch.no_grad():
 62 |         outputs = model.generate(
 63 |             **encoded,
 64 |             max_new_tokens=input_data.max_new_tokens,
 65 |             do_sample=True,
 66 |             temperature=input_data.temperature,
 67 |             top_k=input_data.top_k,
 68 |             return_dict_in_generate=True,
 69 |             output_scores=True,
 70 |         )
 71 |     
 72 |     t = time.perf_counter() - t0
 73 |     if not input_data.echo_prompt:
 74 |         output = tokenizer.decode(outputs.sequences[0][prompt_length:], skip_special_tokens=True)
 75 |     else:
 76 |         output = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
 77 |         
 78 |     tokens_generated = outputs.sequences[0].size(0) - prompt_length
 79 |     logger.info(
 80 |         f"Time for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec"
 81 |     )
 82 | 
 83 |     logger.info(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB")
 84 |     generated_tokens = []
 85 |     
 86 |     log_probs = torch.log(torch.stack(outputs.scores, dim=1).softmax(-1))
 87 | 
 88 |     gen_sequences = outputs.sequences[:, encoded["input_ids"].shape[-1]:]
 89 |     gen_logprobs = torch.gather(log_probs, 2, gen_sequences[:, :, None]).squeeze(-1)
 90 | 
 91 |     top_indices = torch.argmax(log_probs, dim=-1)
 92 |     top_logprobs = torch.gather(log_probs, 2, top_indices[:,:,None]).squeeze(-1)
 93 |     top_indices = top_indices.tolist()[0]
 94 |     top_logprobs = top_logprobs.tolist()[0]
 95 | 
 96 |     for t, lp, tlp in zip(gen_sequences.tolist()[0], gen_logprobs.tolist()[0], zip(top_indices, top_logprobs)):
 97 |         idx, val = tlp
 98 |         tok_str = tokenizer.decode(idx)
 99 |         token_tlp = {tok_str: val}
100 |         generated_tokens.append(
101 |             Token(text=tokenizer.decode(t), logprob=lp, top_logprob=token_tlp)
102 |         )
103 |     logprob_sum = gen_logprobs.sum().item()
104 |     
105 |     return ProcessResponse(
106 |         text=output, tokens=generated_tokens, logprob=logprob_sum, request_time=t
107 |     )
108 | 
109 | 
110 | @app.post("/tokenize")
111 | async def tokenize(input_data: TokenizeRequest) -> TokenizeResponse:
112 |     t0 = time.perf_counter()
113 |     encoded = tokenizer(
114 |         input_data.text
115 |     )
116 |     t = time.perf_counter() - t0
117 |     tokens = encoded["input_ids"]
118 |     return TokenizeResponse(tokens=tokens, request_time=t)
119 | 


--------------------------------------------------------------------------------
/sample-submissions/llama_recipes/train.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from huggingface_hub import login, HfApi 
 4 | from llama_recipes.finetuning import main as finetuning
 5 | 
 6 | def main():
 7 |     login(token=os.environ["HUGGINGFACE_TOKEN"])
 8 |     
 9 |     kwargs = {
10 |         "model_name": "meta-llama/Llama-2-7b-hf",
11 |         "use_peft": True,
12 |         "peft_method": "lora",
13 |         "quantization": True,
14 |         "batch_size_training": 2,
15 |         "dataset": "custom_dataset",
16 |         "custom_dataset.file": "./custom_dataset.py",
17 |         "output_dir": "./output_dir",
18 |     }
19 |     
20 |     finetuning(**kwargs)
21 | 
22 |     api = HfApi() 
23 | 
24 |     api.upload_folder( 
25 |         folder_path='./output_dir/', 
26 |         repo_id=os.environ["HUGGINGFACE_REPO"], 
27 |         repo_type='model', 
28 |     )
29 | 
30 | if __name__ == "__main__":
31 |     main()


--------------------------------------------------------------------------------