8 |
9 | For example, llm-compressive test open source LLMs on wikipedia across 83 months from 2017 to 2023.
10 |
11 | **Mistral** and **Baichuan2** show steady performance across all time periods, indicating promissing generalization over time. In contrast, other models demonstrate linearly-worsen curves.
12 |
13 | More results on coding, arxiv, news, image, and audio in the paper: [Evaluating Large Language Models for Generalization and Robustness via Data Compression
14 | ](https://arxiv.org/pdf/2402.00861.pdf).
15 |
16 | **Updates**:
17 | - 27 Feb 2024, try the interactive leaderboard at [LLM-Compressive](https://liyucheng09.github.io/llm-compressive/).
18 |
19 | # Getting Started
20 |
21 | 0. Clone and install requirements.
22 |
23 | ```
24 | git clone https://github.com/liyucheng09/llm-compressive.git
25 | cd llm-compressive
26 | pip install -r requirements.txt
27 | ```
28 |
29 | 1. Run the main test script.
30 |
31 | ```
32 | python main.py
66 |
67 | see the explaination of the figure in the [paper](https://arxiv.org/pdf/2402.00861.pdf).
68 |
69 | # Models
70 |
71 | We have tested the following models:
72 | - codellama/CodeLlama-7b-hf
73 | - baichuan-inc/Baichuan2-7B-Base
74 | - mistralai/Mistral-7B-v0.1
75 | - huggyllama/llama-7b
76 | - huggyllama/llama-13b
77 | - huggyllama/llama-65b
78 | - meta-llama/Llama-2-7b-hf
79 | - meta-llama/Llama-2-13b-hf
80 | - meta-llama/Llama-2-70b-hf
81 | - Qwen/Qwen-7B
82 | - internlm/internlm-7b
83 | - THUDM/chatglm3-6b-base
84 | - 01-ai/Yi-6B-200K
85 | - 01-ai/Yi-34B-200K
86 | - google/gemma-7b
87 | - Qwen/Qwen1.5-7B
88 |
89 | And any GPTQ version of the above models, such as:
90 |
91 | - TheBloke/CodeLlama-70B-hf-GPTQ
92 | - TheBloke/Llama-2-70B-GPTQ
93 | - TheBloke/Yi-34B-200K-GPTQ
94 | - ...
95 |
96 | # Issues
97 |
98 | send me emails or open issues if you have any questions.
99 |
100 | # Citation
101 |
102 | If you find this repo helpful, please consider citing our paper:
103 |
104 | ```
105 | @article{Li2024EvaluatingLL,
106 | title={Evaluating Large Language Models for Generalization and Robustness via Data Compression},
107 | author={Yucheng Li and Yunhao Guo and Frank Guerin and Chenghua Lin},
108 | year={2024},
109 | journal={arXiv preprint arXiv:2402.00861}
110 | }
111 | ```
112 |
--------------------------------------------------------------------------------
/visualise/timeline_vis.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | import json
3 | import numpy as np
4 | import matplotlib.ticker as ticker
5 |
6 | task = 'wikitext'
7 | with open(f'results/{task}_results.json') as f:
8 | data = json.load(f)
9 |
10 | # we have two fig_types now: 'llama' and '7B'
11 | fig_type = '7B'
12 |
13 | # fine-grained control over which models to include
14 | code_llama = True
15 | large_llama = False
16 | internlm = False
17 |
18 | x_ticks = True
19 | no_ylabel = False
20 |
21 | # Create the plot, and set the font sizes
22 | if no_ylabel:
23 | plt.figure(figsize=(8, 5), dpi=180)
24 | else:
25 | plt.figure(figsize=(8.2, 5), dpi=180)
26 | markersize = 3
27 | legend_fontsize = 10
28 | tick_fontsize = 12
29 | label_fontsize = 14
30 |
31 | # Colorblind-friendly colors palette
32 | # Source: https://jfly.uni-koeln.de/color/
33 | colors = ['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7']
34 |
35 | # Different line styles
36 | line_styles = ['-', '--', '-.', ':']
37 |
38 | # Different markers
39 | markers = ['o', 's', 'D', '^', 'v', '<', '>']
40 |
41 | model_name_to_label = {
42 | 'Baichuan2-7B-Base': 'Baichuan2-7B',
43 | 'internlm-7B': 'Internlm-7B',
44 | 'Qwen-7B': 'Qwen-7B',
45 | 'Yi-6B': 'Yi-6B',
46 | 'chatglm3-6b-base': 'Chatglm3-6B',
47 | 'Mistral-7B': 'Mistral-7B',
48 | 'LLaMA-7B-HF': 'LLaMA-7B',
49 | 'LLaMA-13B': 'LLaMA-13B',
50 | 'Llama-2-13B': 'Llama-2-13B',
51 | 'Llama-2-7B-HF': 'Llama-2-7B',
52 | 'CodeLlama-7B': 'CodeLlama-7B',
53 | 'Llama-2-70B': 'Llama-2-70B',
54 | 'LLaMA-30B': 'LLaMA-30B',
55 | 'LLaMA-65B': 'LLaMA-65B',
56 | 'Yi-34B-200K': 'Yi-34B',
57 | }
58 |
59 | # Loop through each model's data and plot it
60 | counter = 0
61 | for model_name, model_data in data.items():
62 | if model_name not in model_name_to_label:
63 | continue
64 | if fig_type == 'llama':
65 | if not code_llama and model_name == 'CodeLlama-7B':
66 | continue
67 | if not large_llama and model_name in ['Llama-2-70B', 'LLaMA-30B', 'LLaMA-65B']:
68 | continue
69 | if model_name not in ['Llama-2-7B-HF', 'LLaMA-7B-HF', 'LLaMA-13B', 'Llama-2-13B', 'CodeLlama-7B', 'Llama-2-70B', 'LLaMA-30B', 'LLaMA-65B']:
70 | continue
71 | elif fig_type == '7B':
72 | if not ('7B' in model_name or '6B' in model_name):
73 | continue
74 | if task not in ['code', 'bbc_image', 'arxiv']:
75 | if 'internlm' in model_name.lower() and not internlm:
76 | continue
77 | if 'code' in model_name.lower() and not code_llama:
78 | continue
79 | else:
80 | raise ValueError(f'Unknown fig_type: {fig_type}')
81 |
82 | labels = list(model_data.keys())
83 | values = np.array([metrics['ratio'] for metrics in model_data.values()]) * 100
84 |
85 | # remove 2021-03
86 | remove_index = labels.index('2021-03')
87 | values[remove_index] = (values[remove_index - 1] + values[remove_index + 1]) / 2
88 |
89 | # Use color and line style from our predefined lists
90 | color = colors[counter % len(colors)]
91 | line_style = line_styles[counter % len(line_styles)]
92 | marker = markers[counter % len(markers)]
93 |
94 | plt.plot(labels, values, color=color, linestyle=line_style, marker=marker, label=model_name_to_label[model_name], markersize=markersize)
95 | counter += 1
96 |
97 | # Adding title and labels
98 | plt.title(f'{task}')
99 | if not no_ylabel:
100 | plt.ylabel(f'Compression Rates (%)', fontsize=label_fontsize)
101 |
102 | # Adding a legend to differentiate the lines from each model
103 | plt.legend(fontsize=legend_fontsize)
104 |
105 | # Display the plot
106 | plt.grid(True)
107 |
108 | # x ticks are too dense, so we only show every 3rd tick
109 | plt.xticks(list(range(0, len(labels), 6)), rotation=45)
110 |
111 | if not x_ticks:
112 | plt.xticks([])
113 |
114 | plt.gca().yaxis.set_major_locator(ticker.MultipleLocator(0.4))
115 | plt.tight_layout()
116 |
117 | plt.tick_params(axis='x', labelsize=tick_fontsize)
118 | plt.tick_params(axis='y', labelsize=tick_fontsize)
119 |
120 | plt.savefig(f'figs/{fig_type}-{task}.png')
--------------------------------------------------------------------------------
/page/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | 1. Intro:
90 | LLM-Compressive evaluates LLMs via data compression on data collected every month from 2017 to 2024. 91 |2. Issues:
101 | If you have problems or want to request results of a new model, please head to our project page and open an issue. 102 |3. Benchmark Performance:
106 |4. Context Length Performance:
116 |