├── README.md ├── baseline.py └── scrambled_sampled_dataset.json /README.md: -------------------------------------------------------------------------------- 1 | # ASC23-LLM inference optimization 2 | 3 | The dataset and baseline code for ASC23 LLM inference optimization challenge. 4 | 5 | ## Challenge description 6 | 7 | This competition focuses on the LLM inference optimization, which requires participating teams to build an inference engine based on LLaMA-70B to achieve high throughput on the 10,000-sample dataset provided by the ASC24 Committees. 8 | 9 | ## Dateset 10 | 11 | The dataset used in the preliminary has 10k samples with prompts and prompt length. This dataset has the following characteristics: 12 | 13 | + Multi-domain Coverage: The dataset contains text data from various domains, including news, encyclopedias, novels, forums, and more, and covering different topics, styles, and viewpoints, enabling the model to have better generalization across different domain tasks. 14 | 15 | + Multilingual Support: The dataset includes text data from multiple languages such as English, Chinese, Korean, Spanish, etc., which allows the model to understand and generate text across different languages. 16 | 17 | + Large-scale Data: The dataset is sampled from a massive amount of text data, which helps improve the language understanding and generation capabilities of the model. 18 | 19 | + Length Diversity: Too long and too short sequences are filtered out. The dataset contains 10k samples with length range from 4 to 1024, covering a vast majority of the length range for everyday use. 20 | 21 | In summary, the dataset provides high-quality language samples with multi-domain coverage, multilingual support, large-scale data, and diversity characteristics. 22 | 23 | ## Baseline code 24 | 25 | The ASC24 committees supply a baseline code to benchmark throughout and total tokens. And participants could start from it and modified it for high inference performance. 26 | 27 | Usage:After download the dataset one can use the following example to run the script. More info can be obtained using `-h`. 28 | 29 | The parameter `--num-samples` is only used for test. The participants should test with the whole 10k dataset for this challenge. 30 | 31 | ```bash 32 | CUDA_VISIBLE_DEVICES=0 python baseline.py --dataset /your_data_path/scrambled_sampled_dataset.json --model /your_model_path/hf_model_weights --num-samples=10 33 | ``` 34 | 35 | Besides, the model weight of LLaMA2-70B can be downloaded from: https://huggingface.co/meta-llama/Llama-2-70b or https://huggingface.co/meta-llama/Llama-2-70b-hf. 36 | 37 | ## Contact 38 | email info@asc-events.org 39 | -------------------------------------------------------------------------------- /baseline.py: -------------------------------------------------------------------------------- 1 | """Benchmark offline inference throughput.""" 2 | import argparse 3 | import json 4 | import random 5 | import time 6 | from typing import List, Optional, Tuple 7 | 8 | import torch 9 | from transformers import (AutoModelForCausalLM, AutoTokenizer, 10 | PreTrainedTokenizerBase) 11 | from tqdm import tqdm 12 | 13 | 14 | def run_hf( 15 | requests: List[Tuple[str, int, int]], 16 | model: str, 17 | tokenizer: PreTrainedTokenizerBase, 18 | trust_remote_code: bool, 19 | ) -> float: 20 | 21 | llm = AutoModelForCausalLM.from_pretrained( 22 | model, torch_dtype=torch.float16, trust_remote_code=trust_remote_code) 23 | if llm.config.model_type == "llama": 24 | # To enable padding in the HF backend. 25 | tokenizer.pad_token = tokenizer.eos_token 26 | llm = llm.cuda() 27 | 28 | input_num_tokens = [] 29 | output_num_tokens = [] 30 | start = time.perf_counter() 31 | for i in tqdm(range(len(requests))): 32 | prompt, prompt_len, output_len = requests[i] 33 | # Generate the sequences. 34 | input_ids = tokenizer(prompt, return_tensors="pt", 35 | padding=True).input_ids 36 | llm_outputs = llm.generate( 37 | input_ids=input_ids.cuda(), 38 | do_sample=False, 39 | num_return_sequences=1, 40 | num_beams=1, 41 | temperature=1.0, 42 | top_p=1.0, 43 | use_cache=True, 44 | max_new_tokens=output_len, 45 | ) 46 | # Include the decoding time. 47 | tokenizer.decode(llm_outputs[0], skip_special_tokens=True) 48 | input_num_tokens.append(len(input_ids[0])) 49 | output_num_tokens.append(len(llm_outputs[0])) 50 | 51 | 52 | end = time.perf_counter() 53 | return end - start, input_num_tokens, output_num_tokens 54 | 55 | 56 | 57 | def main(args: argparse.Namespace): 58 | print(args) 59 | random.seed(args.seed) 60 | 61 | # Sample the requests. 62 | tokenizer = AutoTokenizer.from_pretrained( 63 | args.tokenizer, trust_remote_code=args.trust_remote_code) 64 | if args.dataset is None: 65 | # Synthesize a prompt with the given input length. 66 | prompt = "hi" * (args.input_len - 1) 67 | requests = [(prompt, args.input_len, args.output_len) 68 | for _ in range(args.num_samples)] 69 | 70 | else: 71 | with open(args.dataset) as f: 72 | requests = json.load(f) 73 | 74 | if args.num_samples is not None: 75 | requests = requests[0:args.num_samples] 76 | 77 | elapsed_time, input_num_tokens, output_num_tokens = run_hf(requests, args.model, tokenizer, args.trust_remote_code) 78 | prompt_num_tokens = sum(prompt_len for prompt_len in input_num_tokens) 79 | total_num_tokens = sum(output_len for output_len in output_num_tokens) 80 | print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s \n" 81 | f"Tokens/s: {total_num_tokens / elapsed_time:.2f} tokens/s \n" 82 | f"Prompt_num_tokens:{prompt_num_tokens:.2f} tokens \n" 83 | f"Total_num_tokens:{total_num_tokens:.2f} tokens \n") 84 | 85 | 86 | if __name__ == "__main__": 87 | parser = argparse.ArgumentParser(description="Benchmark the throughput.") 88 | parser.add_argument("--dataset", type=str, default=None, help="Path to the dataset.") 89 | parser.add_argument("--model", type=str, default="meta/llama2-70b") 90 | parser.add_argument("--tokenizer", type=str, default=None) 91 | parser.add_argument("--input-len", type=int, default=None, help="Input prompt length for each request") 92 | parser.add_argument("--output-len", type=int, default=None, help="Output length for each request") 93 | parser.add_argument("--num-samples", type=int, default=None, help="Number of first few samples used for inference test") 94 | parser.add_argument("--seed", type=int, default=0) 95 | parser.add_argument('--trust-remote-code', 96 | action='store_true', 97 | help='trust remote code from huggingface') 98 | parser.add_argument( 99 | '--dtype', 100 | type=str, 101 | default='auto', 102 | choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'], 103 | help='data type for model weights and activations. ' 104 | 'The "auto" option will use FP16 precision ' 105 | 'for FP32 and FP16 models, and BF16 precision ' 106 | 'for BF16 models.') 107 | args = parser.parse_args() 108 | if args.tokenizer is None: 109 | args.tokenizer = args.model 110 | if args.dataset is None: 111 | assert args.input_len is not None 112 | assert args.output_len is not None 113 | else: 114 | assert args.input_len is None 115 | assert args.output_len is None 116 | 117 | main(args) 118 | --------------------------------------------------------------------------------