├── .DS_Store ├── HAE-RAE Bench └── README.md ├── README.md └── blog ├── CSAT-QA.md └── assets ├── .DS_Store ├── csat_histogram.png ├── csat_spyder.png ├── csat_token.png ├── csatqa.xlsx └── test_results.csv /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EleutherAI/hae-rae/26c4588910ffadae7ccd6015a378b3806f5868c4/.DS_Store -------------------------------------------------------------------------------- /HAE-RAE Bench/README.md: -------------------------------------------------------------------------------- 1 | # HAE-RAE BENCH 2 | 3 | ## About HAE-RAE BENCH 4 | HAE-RAE Bench is a specialized benchmark developed to assess the proficiency of language models within the Korean context. 5 | This benchmark, encompassing six comprehensive tasks spanning vocabulary, history, and general knowledge, is tailored to test the ability of a language model 6 | to both comprehend and retain information exclusive to Korean corpora. 7 | 8 | ## HAE-RAE BENCH (v1) 9 | Following table is an overview of the HAE-RAE benchmark. 10 | "-" for source denotes that the questions were crafted by the contributors of the project. 11 | 12 | | Category | Sample Size | Source | 13 | |---------------------------|-------------|--------| 14 | | Loan Words (LW) | 169 | NIKL | 15 | | Rare Words (RW) | 405 | - | 16 | | Standard Nomenclature (SN)| 153 | NIKL | 17 | | Reading Comprehension (RC)| 447 | KLAT | 18 | | General Knowledge (GK) | 176 | - | 19 | | History (HI) | 188 | - | 20 | 21 | #### Benchmark Results (Ongoing) 22 | 23 | ##### Evaluation Methods 24 | 25 | - 3-shot: For LLMs (Text-Davinci-003 & HyperClova LK-D2) we provide three instructive examples and an initially incomplete example, with the model expected to respond with a number from one to five, indicative of its most probable answer. 26 | 27 | - log-likelihood: For the remaining models, we evaluate the probability of an instruction-answer pair for each available option in every question. The option with the highest log-likelihood indicates the most probable answer 28 | 29 | - **Disclaimer: 3-shot is more challenging than the log-likelihood method, therefore direct comparisons between models using these different evaluation methods can be misleading.** 30 | 31 | 32 | 33 | 34 | | Models | LW | RW | SN | RC | HI | GK | Av. (w/o KGK) | 35 | |-------------------|------------|------------|----------------------|-----------------------|---------|-------------------|---------------| 36 | | Text-Davinci-003 | 62.2 | 62.2 | 58.7 | 60.1 | 30.3 | 21.4 | 49.2 (54.7) | 37 | | HyperClova LK-D2 | 83.1 | 82.3 | 78.0 | 54.5 | 81.6 | 45.1 | 70.8 (75.9) | 38 | | Polyglot-Ko 1.3B | 42.0 | 41.0 | 38.6 | 33.1 | 48.4 | 25.6 | 38.1 (40.6) | 39 | | Polyglot-Ko 3.8B | 48.5 | 42.5 | 41.8 | 36.0 | 58.0 | 22.7 | 41.6 (45.4) | 40 | | Polyglot-Ko 5.8B | 52.1 | 51.6 | 46.4 | 43.2 | 67.0 | 27.8 | 48.0 (52.1) | 41 | | Polyglot-Ko 12.8B | 68.6 | 48.6 | 41.2 | 40.3 | 68.6 | 28.4 | 49.3 (53.5) | 42 | | KoGPT-ryan-6B | 49.1 | 46.7 | 45.8 | 37.4 | 62.2 | 25.6 | 44.4 (48.2) | 43 | | XGLM-1.7B | 21.3 | 22.2 | 29.4 | 28.9 | 17.6 | 26.1 | 24.2 (23.9) | 44 | | XGLM-2.9B | 26.6 | 24.4 | 32.7 | 32.0 | 23.4 | 26.7 | 27.6 (27.8) | 45 | | XGLM-7.5B | 37.9 | 25.7 | 41.2 | 33.6 | 27.7 | 26.1 | 32.0 (33.2) | 46 | | mT5-Base | 36.1 | 18.8 | 22.9 | 23.7 | 19.1 | 25.6 | 24.4 (24.7) | 47 | 48 | 49 |
50 | Polyglot-12.8B Variants (KoAlpaca, KuLLM) 51 |
52 | 53 | | Models | LW | RW | SN | RC | HI | GK | Av. (w/o KGK) | 54 | |-------------------|------------|------------|----------------------|-----------------------|---------|-------------------|---------------| 55 | | Polyglot-Ko 12.8B | 68.6 | 48.6 | 41.2 | 40.3 | 68.6 | 28.4 | 49.3 (53.5) | 56 | | kullm-v2 (w/o template) | 57.4 | 40.0 | 48.4 | 38.3 | 71.8 | 29.0 | 47.5 (51.2) | 57 | | kullm-v2 (w template) | 85.2 | 40.0 | 44.9 | 38.7 | 60.6 | 27.8 | 49.5 (53.9) | 58 | | KoAlpaca-Polyglot-12.8B (w/o template) | 67.5 | 63.2 | 61.4 | 44.3 | 80.3 | 30.0 | **57.8 (63.3)** | 59 | 60 | 61 | - We used the template from [prompt_no_input](https://github.com/nlpai-lab/KULLM/blob/master/templates/kullm.json) for kullm-v2 (w template). 62 | 63 |
64 |
65 | 66 | ## How to Use 67 | 68 | If you are interested in accessing, using this dataset for your research, or being included in the HAE-RAE BENCH leaderboard, please reach out to us. 69 | You can contact us via email at [spthsrbwls123@yonsei.ac.kr](mailto:spthsrbwls123@yonsei.ac.kr). 70 | 71 | ## Acknowledgement 72 | This project was made possible thanks to OnelineAI(https://www.onelineai.com/). 73 | 74 | ## Citation and Related Information 75 | ### BibTeX entry 76 | 77 | If you find our work useful, please consider citing: 78 | 79 | ```bibtext 80 | @misc{haeraebench, 81 | author = {Son, Guijin and Lee, Hanwool and Kim, Suwan and Kim, Huiseo and Lee, Jae Cheol and Yeom, Je Won and Jung, Jihyu and Kim, Jung Woo and Kim, Songseong}, 82 | title = {HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models}, 83 | year = {2023}, 84 | publisher = {GitHub}, 85 | journal = {GitHub repository} 86 | howpublished = {\url{https://github.com/EleutherAI/hae-rae/tree/main/HAE-RAE%20Bench}}, 87 | } 88 | ``` 89 | 90 |
91 | References 92 |
93 | 94 | ```bibtex 95 | @misc{polyglot-ko, 96 | title = {{Polyglot-Ko: Open-Source Korean Autoregressive Language Model}}, 97 | author = {Ko, Hyunwoong and Yang, Kichang and Ryu, Minho and Choi, Taekyoon and Yang, Seungmu and Hyun, jiwung and Park, Sungho}, 98 | url = {https://www.github.com/eleutherai/polyglot}, 99 | month = {9}, 100 | year = {2022}, 101 | } 102 | ``` 103 | 104 | ```bibtex 105 | @misc{alpaca, 106 | author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, 107 | title = {Stanford Alpaca: An Instruction-following LLaMA model}, 108 | year = {2023}, 109 | publisher = {GitHub}, 110 | journal = {GitHub repository}, 111 | howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, 112 | } 113 | ``` 114 | 115 | ```bibtex 116 | @misc{kullm, 117 | author = {NLP & AI Lab and Human-Inspired AI research}, 118 | title = {KULLM: Korea University Large Language Model Project}, 119 | year = {2023}, 120 | publisher = {GitHub}, 121 | journal = {GitHub repository}, 122 | howpublished = {\url{https://github.com/nlpai-lab/kullm}}, 123 | } 124 | ``` 125 | 126 | ```bibtex 127 | @article{lin2021few, 128 | title={Few-shot learning with multilingual language models}, 129 | author={Lin, Xi Victoria and Mihaylov, Todor and Artetxe, Mikel and Wang, Tianlu and Chen, Shuohui and Simig, Daniel and Ott, Myle and Goyal, Naman and Bhosale, Shruti and Du, Jingfei and others}, 130 | journal={arXiv preprint arXiv:2112.10668}, 131 | year={2021} 132 | } 133 | ``` 134 | 135 | ```bibtex 136 | @inproceedings{kim-etal-2021-changes, 137 | title = "What Changes Can Large-scale Language Models Bring? Intensive Study on {H}yper{CLOVA}: Billions-scale {K}orean Generative Pretrained Transformers", 138 | author = "Kim, Boseop and 139 | Kim, HyoungSeok and 140 | Lee, Sang-Woo and 141 | Lee, Gichang and 142 | Kwak, Donghyun and 143 | Dong Hyeon, Jeon and 144 | Park, Sunghyun and 145 | Kim, Sungju and 146 | Kim, Seonhoon and 147 | Seo, Dongpil and 148 | Lee, Heungsub and 149 | Jeong, Minyoung and 150 | Lee, Sungjae and 151 | Kim, Minsub and 152 | Ko, Suk Hyun and 153 | Kim, Seokhun and 154 | Park, Taeyong and 155 | Kim, Jinuk and 156 | Kang, Soyoung and 157 | Ryu, Na-Hyeon and 158 | Yoo, Kang Min and 159 | Chang, Minsuk and 160 | Suh, Soobin and 161 | In, Sookyo and 162 | Park, Jinseong and 163 | Kim, Kyungduk and 164 | Kim, Hiun and 165 | Jeong, Jisu and 166 | Yeo, Yong Goo and 167 | Ham, Donghoon and 168 | Park, Dongju and 169 | Lee, Min Young and 170 | Kang, Jaewook and 171 | Kang, Inho and 172 | Ha, Jung-Woo and 173 | Park, Woomyoung and 174 | Sung, Nako", 175 | booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", 176 | month = nov, 177 | year = "2021", 178 | address = "Online and Punta Cana, Dominican Republic", 179 | publisher = "Association for Computational Linguistics", 180 | url = "https://aclanthology.org/2021.emnlp-main.274", 181 | doi = "10.18653/v1/2021.emnlp-main.274", 182 | pages = "3405--3424", 183 | ``` 184 | 185 | ```bibtex 186 | @inproceedings{xue-etal-2021-mt5, 187 | title = "m{T}5: A Massively Multilingual Pre-trained Text-to-Text Transformer", 188 | author = "Xue, Linting and 189 | Constant, Noah and 190 | Roberts, Adam and 191 | Kale, Mihir and 192 | Al-Rfou, Rami and 193 | Siddhant, Aditya and 194 | Barua, Aditya and 195 | Raffel, Colin", 196 | booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", 197 | month = jun, 198 | year = "2021", 199 | address = "Online", 200 | publisher = "Association for Computational Linguistics", 201 | url = "https://aclanthology.org/2021.naacl-main.41", 202 | doi = "10.18653/v1/2021.naacl-main.41", 203 | pages = "483--498", 204 | } 205 | ``` 206 | 207 | ```bibtex 208 | @article{ouyang2022training, 209 | title={Training language models to follow instructions with human feedback}, 210 | author={Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others}, 211 | journal={Advances in Neural Information Processing Systems}, 212 | volume={35}, 213 | pages={27730--27744}, 214 | year={2022} 215 | } 216 | ``` 217 |
218 |
219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # HAE-RAE 2 | Repository for the HAE-RAE project, a project to improve the reasoning and instruction-following abilities of Polyglot-Ko. This repository hosts datasets, blog posts, and code from our progress. 3 | -------------------------------------------------------------------------------- /blog/CSAT-QA.md: -------------------------------------------------------------------------------- 1 | # CSAT-QA: How Far Can LLMs Reach in Korean Language Understanding? 2 | 3 | ### Introduction 4 | 5 | In this blog post, we release CSAT-QA, a multiple choice question answering dataset for the Korean language. The dataset includes questions collected from the College Scholastic Ability Test (CSAT), also known as the 대학수학능력시험 in South Korea, a standardized test required for university admissions in the country. In this project, we have gathered and made available 936 question-and-answer pairs from CSAT exams held between 2007 and 2022. These resources are now open-source for public use. 6 | 7 | ### Dataset Collection 8 | 9 | The CSAT-QA dataset, encompasses four distinct curriculums: the 7th National Curriculum, the 2007 Revised Curriculum, the 2009 Revised Curriculum, and the 2015 Revised Curriculum. For the collected dataset, we implemented the following preprocessing steps: 10 | 11 | Initially, due to the unreliability of publicly accessible Korean OCR systems, we opted to manually transcribe the CSAT test questions to ensure the quality of our dataset. 12 | 13 | Second, we excluded questions related to "Middle Korean," an ancient form of the language. This was necessary as the majority of language models cannot encode such vocabulary. 14 | 15 | In the subsequent phase, we manually converted all tables and graphs into the LaTeX format, while translating images into descriptive, alternative texts. This is done to fully represent complex questions in a language model-friendly manner. 16 | 17 | Finally, we introduced four unique token pairs: \ \, \ \, \ \, and \ \. These tokens were incorporated to guide language models in comprehending questions that reference specific parts of the provided context, normally conveyed through italics or bold fonts. For evaluation, we provide both versions with and without the introduced unique tokens. 18 | 19 | ### Dataset Analysis. 20 | 21 | CSAT serves as a comprehensive assessment tool for evaluating various aspects of Korean proficiency, encompassing vocabulary, reading comprehension, literature, conversational situations, and more. As a result, the length of each sample in the dataset varies. 22 | 23 | In the subsequent analysis, we applied tokenizers provided by Polyglot-Ko and GPT-4 to measure the length of these samples and counted the number of tokens each one contains. The following results show the average lengths of tokens for each category. 24 | 25 | - The total_length, which is implemented using simple len() function in python, averages approximately 1895.31 tokens. 26 | - The total_length_polyglot, which is the number of tokens by the Polyglot-Ko tokenizer, averages approximately 1084.72 tokens. 27 | - The total_length_gpt4, which is the number of tokens by the GPT-4 tokenizer, averages approximately 1855.63 tokens. 28 | 29 | Interestingly, it was found that on average, the total_length_gpt4 is approximately 1.71 times longer than total_length_polyglot. This suggests that the GPT-4 tokenizer is not as efficient in processing Korean language text compared to Polyglot-Ko, highlighting the need for native LLMs for optimized inference. 30 | 31 | ![Untitled](https://github.com/guijinSON/hae-rae/blob/main/blog/assets/csat_token.png) 32 | 33 | In addition, we narrowed our focus and conducted an evaluation on a specific subset of 188 questions selected based on the availability of students' response accuracy. The subset comprises six distinct categories, namely Writing (WR), Grammar (GR), Reading Comprehension: Science (RCS), Reading Comprehension: Social Science (RCSS), Reading Comprehension: Humanities (RCH), and Literature (LI). 34 | 35 | Rather than a conventional approach of balanced sampling, we filtered based on the availability of the response accuracy. As a result, the distribution of questions within our subset are imbalanced, as shown in the following figure. 36 | 37 | ![Untitled](https://github.com/guijinSON/hae-rae/blob/main/blog/assets/csat_histogram.png) 38 | 39 | ### Evaluation: 40 | 41 | For evaluation, we compared two proprietary language models, GPT-4 and GPT-3.5-Turbo-16K, and one open-source language model, [Polyglot-Ko-12.8B](https://huggingface.co/EleutherAI/polyglot-ko-12.8b). 42 | 43 | For GPT-4 and GPT-3.5-Turbo-16K, we used the following instruction with the model's temperature fixed at 0.01 to prompt the language model to generate the most probable answer. 44 | ``` 45 | instruction = f"""다음을 읽고 정답으로 알맞은 것을 고르시요. [Please read the following passage and choose the correct answer.] 46 | ### Context: 47 | ### Question: 48 | ### Options: 49 | (1) 50 | (2) 51 | (3) 52 | (4) 53 | (5) 54 | ### Answer: 주어진 문제의 정답은[****The correct answer to the given question is****]""" 55 | ``` 56 | Note that the texts within the square brackets are not included in the actual prompt. They are translations for international researchers that see this post 57 | 58 | For Polyglot-Ko-12.8B, we employed the [LM-Eval-Harness](https://github.com/EleutherAI/lm-evaluation-harness) framework for evaluation. Although all the evaluations were conducted in a zero-shot setting, the methodology used for GPT-4 and GPT-3.5-Turbo-16K is more challenging compared to the one used for Polyglot-Ko-12.8B. Consequently, direct comparisons among models using these distinct evaluation methods may potentially lead to misleading interpretations. 59 | 60 | The evaluation results are as follows. 61 | 62 | | **Models** | **GR** | **LI** | **RCH** | **RCS** | **RCSS** | **WR** | **Average** | 63 | |:-----------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----------:| 64 | | polyglot-ko-12.8B | 16.0 | 16.22 | 2.86 | 10.81 | 7.14 | 9.09 | 13.68 | 65 | | gpt-3.5-wo-token | 16.0 | 32.43 | 42.86 | 18.92 | 35.71 | 0.00 | 24.32 | 66 | | gpt-3.5-w-token | 16.0 | 35.14 | 42.86 | 18.92 | 35.71 | 9.09 | 26.29 | 67 | | gpt-4-wo-token | 40.0 | 54.05 | **68.57** | **59.46** | **69.05** | 36.36 | **54.58** | 68 | | gpt-4-w-token | 36.0 | **56.76** | **68.57** | **59.46** | **69.05** | 36.36 | 54.37 | 69 | | Human Performance | **45.41** | 54.38 | 48.7 | 39.93 | 44.54 | **54.0** | 47.83 | 70 | 71 | 72 | Our analysis shows an intriguing disparity between Language Models (LMs) and human proficiency across various tasks. Notably, while humans excel in writing (WR) questions, they comparatively underperformed in Reading Comprehension: Science (RCS) tasks. GPT-4, the latest AI model in this comparison, exhibits the opposite pattern. Despite struggling with writing, it demonstrated considerable strength in RCS tasks. 73 | 74 | The GPT-4 also significantly outperformed GPT-3.5-16k and Polyglot-12.8B. This substantial leap in performance is aligned with expectations, given the much larger size of GPT-4, indicating the potential benefits of scaling up these models. 75 | 76 | Moreover, though the addition of unique tokens changes the performance of the model we do not observe any significant tendencies in our experiments. 77 | 78 | Lastly, the underperformance of Polyglot-Ko-12.8B is noteworthy. Its scores were below random guessing (20%), indicating limitations in the model's capabilities. 79 |

80 | 81 |

82 | 83 | ### Release Notes 84 | 85 | We are happy to release two versions of the dataset: CSAT-QA(FULL) and CSAT-QA(EVAL). CSAT-QA(EVAL) includes the specific 188 questions that were utilized for our evaluation, whereas CSAT-QA(FULL) includes all 936 questions contained in the complete dataset. 86 | 87 | The CSAT-QA includes two subsets. The full version with 936 questions can be downloaded using the following code: 88 | 89 | ``` 90 | from datasets import load_dataset 91 | dataset = load_dataset("EleutherAI/CSAT-QA", "full") 92 | ``` 93 | 94 | A more condensed version, which includes human accuracy data, can be downloaded using the following code: 95 | ``` 96 | from datasets import load_dataset 97 | import pandas as pd 98 | 99 | dataset = load_dataset("EleutherAI/CSAT-QA", "GR") # Choose from either WR, GR, LI, RCH, RCS, RCSS, 100 | 101 | ``` 102 | 103 | For the reproducibility of our research, we also release our instructions along with the responses generated by both the GPT-3.5-16K and GPT-4 models. 104 | [Download File](https://github.com/guijinSON/hae-rae/blob/main/blog/assets/csatqa.xlsx) 105 | 106 | ### Evaluate using LM-Eval-Harness 107 | To evaluate your model simply by using the LM-Eval-Harness by EleutherAI follow the steps below. 108 | 109 | 1. To install lm-eval from the github repository main branch, run: 110 | ``` 111 | git clone https://github.com/EleutherAI/lm-evaluation-harness 112 | cd lm-evaluation-harness 113 | pip install -e . 114 | ``` 115 | 116 | 2. To install additional multilingual tokenization and text segmentation packages, you must install the package with the multilingual extra: 117 | ``` 118 | pip install -e ".[multilingual]" 119 | ``` 120 | 121 | 3. Run the evaluation by: 122 | ``` 123 | python main.py \ 124 | --model hf-causal \ 125 | --model_args pretrained=EleutherAI/polyglot-ko-1.3b \ 126 | --tasks csatqa_wr,csatqa_gr,csatqa_rcs,csatqa_rcss,csatqa_rch,csatqa_li \ 127 | --device cuda:0 128 | ``` 129 | 130 | ### License 131 | The copyright of this material belongs to the Korea Institute for Curriculum and Evaluation(한국교육과정평가원) and may be used for research purposes only. 132 | 133 | ### Contributors 134 | [Na Keonju](https://www.linkedin.com/in/%EA%B1%B4%EC%A3%BC-%EB%82%98-1b7930218) 135 | [Park EunWoo](https://www.linkedin.com/in/eunwoo-park-468387224/) 136 | Subin Park 137 | [Guijin Son](https://github.com/guijinSON) 138 | [Yeom Je Won](https://www.linkedin.com/in/jewon-yeom-902185230/), 139 | [Yoo Soobin]( www.linkedin.com/in/yoosoobin123) 140 | [Cho Haneul](https://www.linkedin.com/in/haneul-cho-a30036166) 141 | [Jin Hyewon](https://www.linkedin.com/in/hyewon-jin04) 142 | 143 | -------------------------------------------------------------------------------- /blog/assets/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EleutherAI/hae-rae/26c4588910ffadae7ccd6015a378b3806f5868c4/blog/assets/.DS_Store -------------------------------------------------------------------------------- /blog/assets/csat_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EleutherAI/hae-rae/26c4588910ffadae7ccd6015a378b3806f5868c4/blog/assets/csat_histogram.png -------------------------------------------------------------------------------- /blog/assets/csat_spyder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EleutherAI/hae-rae/26c4588910ffadae7ccd6015a378b3806f5868c4/blog/assets/csat_spyder.png -------------------------------------------------------------------------------- /blog/assets/csat_token.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EleutherAI/hae-rae/26c4588910ffadae7ccd6015a378b3806f5868c4/blog/assets/csat_token.png -------------------------------------------------------------------------------- /blog/assets/csatqa.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EleutherAI/hae-rae/26c4588910ffadae7ccd6015a378b3806f5868c4/blog/assets/csatqa.xlsx --------------------------------------------------------------------------------