├── HAE-RAE Bench (Preview).pdf └── README.md /HAE-RAE Bench (Preview).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HAE-RAE/HAE-RAE-BENCH/HEAD/HAE-RAE Bench (Preview).pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # HAE-RAE BENCH 2 | 3 | ## About HAE-RAE BENCH 4 | HAE-RAE Bench는 언어 모델의 한국어 능력(어휘, 역사, 상식, 독해)을 평가하기 위해 제작된 벤치마크 데이터셋입니다. 5 | Dataset available at : [huggingface](https://huggingface.co/datasets/HAERAE-HUB/HAE_RAE_BENCH) 6 | 7 | ## Update Logs 8 | ### 2023.09.26 9 | [HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models](https://arxiv.org/abs/2309.02706) 페이퍼 아카이브 업데이트 10 | 11 | #### 2023.05.11 12 | 한국어 어휘, 독해, 문법, 지식 총 4가지 영역에 걸쳐 언어모델의 능력을 평가하는 벤치마크인 HAE-RAE Bench 프리뷰 공개 13 | 14 | #### 2023.05.23 15 | 논문 작성 및 투고 16 | 17 | #### 2023.06.02 18 | mT5, KULLM, KoAlpaca 벤치 마크 결과 추가. 19 | 20 | ## HAE-RAE BENCH (v1) 21 | 데이터셋 구성은 아래와 같습니다. 22 | 23 | | Category | Sample Size | Source | 24 | |---------------------------|-------------|--------| 25 | | Loan Words (LW) | 169 | NIKL | 26 | | Rare Words (RW) | 405 | - | 27 | | Standard Nomenclature (SN)| 153 | NIKL | 28 | | Reading Comprehension (RC)| 447 | KLAT | 29 | | General Knowledge (GK) | 176 | - | 30 | | History (HI) | 188 | - | 31 | 32 | #### Benchmark Results (Ongoing) 33 | 34 | ##### Evaluation Methods 35 | 36 | - 3-shot: For LLMs (Text-Davinci-003 & HyperClova LK-D2) we provide three instructive examples and an initially incomplete example, with the model expected to respond with a number from one to five, indicative of its most probable answer. 37 | 38 | - log-likelihood: For the remaining models, we evaluate the probability of an instruction-answer pair for each available option in every question. The option with the highest log-likelihood indicates the most probable answer 39 | 40 | - **Disclaimer: 3-shot is more challenging than the log-likelihood method, therefore direct comparisons between models using these different evaluation methods can be misleading.** 41 | 42 | 43 | 44 | 45 | | Models | LW | RW | SN | RC | HI | GK | Av. | 46 | |-------------------|------------|------------|----------------------|-----------------------|---------|-------------------|---------------| 47 | | Text-Davinci-003 | 62.2 | 62.2 | 58.7 | 60.1 | 30.3 | 21.4 | 49.2 | 48 | | HyperClova LK-D2 | 83.1 | 82.3 | 78.0 | 54.5 | 81.6 | 45.1 | 70.8 | 49 | | Polyglot-Ko 1.3B | 76.33 | 48.15 | 58.82 | 34.45 | 60.64 | 26.14 | 50.76 | 50 | | Polyglot-Ko 3.8B | 78.7 | 47.41 | 64.71 | 40.72 | 69.68 | 28.41 | 54.94 | 51 | | Polyglot-Ko 5.8B | 82.84 | 57.04 | 67.32 | 40.72 | 79.79 | 29.55 | 59.54 | 52 | | Polyglot-Ko 12.8B | 87.57 | 53.33 | 61.44 | 41.61 | 80.32 | 33.2 | 59.58 | 53 | | UMT5-Small | 43.79 | 23.95 | 26.80 | 23.27 | 20.74 | 15.91 | 25.74 | 54 | | UMT5-Base | 50.30 | 23.21 | 25.49 | 22.82 | 17.55 | 17.61 | 26.16 | 55 | | UMT5-XL | 58.58 | 25.68 | 41.83 | 24.83 | 14.36 | 22.16 | 31.24 | 56 | | UMT5-XXL | 58.58 | 33.09 | 41.83 | 29.75 | 21.81 | 21.59 | 34.44 | 57 | 58 | 59 | 60 | 61 | ## Contact 62 | For any inquiry regarding evaluation, dataset, or etc please contact us at our email addresses: [spthsrbwls123@yonsei.ac.kr](spthsrbwls123@yonsei.ac.kr) 63 | 64 | ## Acknowledgement 65 | This project is sponsored by OnelineAI(https://www.onelineai.com/) 66 | 67 | ## Contributors (가나다순) 68 | 김송성 (Benchmark Team 팀원) 69 | 김수완 (Benchmark Team 팀원) 70 | 김정우 (Benchmark Team 팀원) 71 | 김휘서 (Baseline Team 팀원) 72 | 손규진 (Benchmark Team 팀장) 73 | 염제원 (Baseline Team 팀원) 74 | 이재철 (Benchmark Team 팀원) 75 | 이한울 (Baseline Team 팀장) 76 | 정지휴 (Benchmark Team 팀원) 77 | 78 | ## References 79 | 80 | ```bibtex 81 | @misc{polyglot-ko, 82 | title = {{Polyglot-Ko: Open-Source Korean Autoregressive Language Model}}, 83 | author = {Ko, Hyunwoong and Yang, Kichang and Ryu, Minho and Choi, Taekyoon and Yang, Seungmu and Hyun, jiwung and Park, Sungho}, 84 | url = {https://www.github.com/eleutherai/polyglot}, 85 | month = {9}, 86 | year = {2022}, 87 | } 88 | ``` 89 | 90 | ```bibtex 91 | @misc{alpaca, 92 | author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, 93 | title = {Stanford Alpaca: An Instruction-following LLaMA model}, 94 | year = {2023}, 95 | publisher = {GitHub}, 96 | journal = {GitHub repository}, 97 | howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, 98 | } 99 | ``` 100 | 101 | ```bibtex 102 | @misc{kullm, 103 | author = {NLP & AI Lab and Human-Inspired AI research}, 104 | title = {KULLM: Korea University Large Language Model Project}, 105 | year = {2023}, 106 | publisher = {GitHub}, 107 | journal = {GitHub repository}, 108 | howpublished = {\url{https://github.com/nlpai-lab/kullm}}, 109 | } 110 | ``` 111 | 112 | 113 | 114 | 115 | --------------------------------------------------------------------------------