├── .github
└── workflows
│ └── ci.yml
├── .gitignore
├── LICENSE
├── README.md
├── models
└── .gitkeep
├── patch
├── .gitkeep
├── fixed_kv_patch.diff
├── kvsplit.patch
├── kvsplit_fixed.patch
└── split_kv_quant.diff
├── perplexity_test_data.txt
├── plots
├── .gitkeep
├── configuration_summary.png
├── inference_speed.png
├── key_value_sensitivity.png
├── kv_cache_memory_usage.png
├── memory_vs_quality.png
└── perplexity_change.png
├── results
└── .gitkeep
└── scripts
├── benchmark_kvsplit.py
├── capture_memory.sh
├── install_kvsplit.sh
├── quick_compare.py
├── run_benchmark.sh
├── test_kvsplit.sh
├── validate_kvsplit.sh
└── visualize_results.py
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
1 | name: Build & Validate
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | pull_request:
7 | branches: [ main ]
8 |
9 | jobs:
10 | build:
11 | runs-on: macos-14
12 |
13 | steps:
14 | - uses: actions/checkout@v3
15 |
16 | - name: Install dependencies
17 | run: brew install cmake
18 |
19 | - name: Cache llama.cpp
20 | uses: actions/cache@v3
21 | with:
22 | path: llama.cpp
23 | key: llama-cpp-${{ hashFiles('patch/fixed_kv_patch.diff') }}
24 |
25 | - name: Clone llama.cpp
26 | run: |
27 | git clone https://github.com/ggerganov/llama.cpp || echo "llama.cpp already exists"
28 |
29 | - name: Apply patch
30 | run: |
31 | cd llama.cpp
32 | git apply ../patch/fixed_kv_patch.diff || echo "Could not apply patch, checking file exists"
33 | cat ../patch/fixed_kv_patch.diff
34 | ls -la common/common.cpp common/common.h
35 |
36 | - name: Build llama.cpp
37 | run: |
38 | cd llama.cpp
39 | mkdir -p build
40 | cd build
41 | cmake .. -DLLAMA_METAL=OFF -DLLAMA_AVX=ON -DLLAMA_AVX2=ON
42 | cmake --build . --config Release -j
43 |
44 | - name: Verify build
45 | run: |
46 | cd llama.cpp/build
47 | ls -la bin
48 | if [ -f "bin/llama-cli" ]; then
49 | echo "✅ llama-cli built successfully"
50 | ./bin/llama-cli -h
51 | else
52 | echo "❌ Failed to build llama-cli"
53 | exit 1
54 | fi
55 |
56 | - name: Python syntax check
57 | run: |
58 | python -m py_compile scripts/benchmark_kvsplit.py
59 | python -m py_compile scripts/quick_compare.py
60 | python -m py_compile scripts/visualize_results.py
61 | echo "✅ Python scripts pass syntax check"
62 |
63 | - name: Shell script check
64 | run: |
65 | bash -n scripts/install_kvsplit.sh
66 | bash -n scripts/capture_memory.sh
67 | echo "✅ Shell scripts pass syntax check"
68 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore llama.cpp directory (users will clone it during installation)
2 | /llama.cpp/
3 |
4 | # Ignore models directory (large files)
5 | /models/*
6 | !models/.gitkeep
7 |
8 | # Python virtual environment
9 | /venv/
10 | /__pycache__/
11 | *.py[cod]
12 | *$py.class
13 |
14 | # Build artifacts
15 | *.o
16 | *.so
17 | *.dylib
18 | *.a
19 | *.exe
20 | *.out
21 |
22 | # Logs and results (except examples)
23 | /results/*
24 | !results/.gitkeep
25 |
26 | # OS specific files
27 | .DS_Store
28 | Thumbs.db
29 |
30 | # Editor files
31 | .vscode/
32 | .idea/
33 | *.swp
34 | *.swo
35 |
36 | # Temporary files
37 | /capture_frames/
38 | *.tmp
39 | *~
40 |
41 | # Keep directory structure with .gitkeep files
42 | !*/
43 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 dipampaul17
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
23 | ---
24 |
25 | Implementation based on concepts from:
26 | - "Unifying KV Cache Compression for Large Language Models with LeanKV" (Zhang et al., 2025)
27 | - "More for Keys, Less for Values: Adaptive KV Cache Quantization" (2024)
28 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # 🚀 KVSplit
4 |
5 | **Differentiated KV Cache Quantization for Apple Silicon**
6 |
7 | [](https://github.com/dipampaul17/KVSplit/stargazers)
8 | [](LICENSE)
9 | []()
10 |
11 |

12 |
13 |
14 |
15 | ## 📌 Overview
16 |
17 | Run **larger context windows** and **heavier LLMs** on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to:
18 |
19 | - **Reduce memory usage by up to 72%** with minimal quality loss
20 | - **Run 2-3x longer contexts** in the same memory budget
21 | - **Maintain or improve inference speed** compared to FP16
22 | - **Optimize for Apple Silicon** with full Metal support
23 |
24 | ## Key Findings
25 |
26 | | Configuration | VRAM @ 8K tokens | Tokens/sec | Perplexity Change |
27 | |---------------|-----------------|------------|-------------------|
28 | | FP16 (base) | 176.00 MB (100%)| 54,360 | -- |
29 | | K8V8 (8-bit) | 93.50 MB (47%) | 51,503 | +0.03% |
30 | | **K8V4** | **71.50 MB (41%)** | **57,438** | **+0.86%** |
31 | | K4V8 | 71.50 MB (41%) | 58,690 | +6.06% |
32 | | K4V4 (4-bit) | 49.50 MB (28%) | 55,193 | +6.15% |
33 |
34 | ### Memory Savings by Sequence Length
35 |
36 | | Configuration | 128 tokens | 2048 tokens | 4096 tokens | 8192 tokens |
37 | |---------------|------------|-------------|-------------|-------------|
38 | | FP16 (baseline) | 5.50 MB | 44.00 MB | 88.00 MB | 176.00 MB |
39 | | K8V8 (8-bit) | 2.92 MB | 23.38 MB | 46.75 MB | 93.50 MB |
40 | | K8V4 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB |
41 | | K4V8 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB |
42 | | K4V4 (4-bit) | 1.55 MB | 12.38 MB | 24.75 MB | 49.50 MB |
43 |
44 | ## Features
45 |
46 | - Independent quantization of keys and values in the KV cache
47 | - Optimized for Apple Silicon with Metal support
48 | - Comprehensive benchmarking suite with perplexity measurement
49 | - Memory usage and performance analysis tools
50 | - Publication-quality visualization tools
51 | - Easy setup and usage
52 |
53 | ## Prerequisites
54 |
55 | - macOS (tested on Apple Silicon)
56 | - Homebrew package manager
57 | - Xcode Command Line Tools
58 |
59 | ## ⚡ Flexible Installation
60 |
61 | ```bash
62 | # Clone the repository
63 | git clone https://github.com/dipampaul17/KVSplit.git
64 | cd kvsplit
65 |
66 | # Run the installer script
67 | chmod +x scripts/install_kvsplit.sh
68 | ./scripts/install_kvsplit.sh
69 | ```
70 |
71 | The installer provides flexible options:
72 |
73 | ### 🐍 Python Setup Options
74 | - **Virtual Environment** (default): Creates a standalone Python environment in the project folder
75 | - **System Python**: Uses your existing Python installation instead of creating a virtual environment
76 | - **Skip Python Setup**: For users who prefer to manage their Python environment manually
77 |
78 | ### 🔄 llama.cpp Integration Options
79 | - **Standard Method** (default): Clones llama.cpp and applies the KV split patch
80 | - **Git Submodule Method**: Adds llama.cpp as a git submodule (ideal for advanced users or development)
81 |
82 | The installer will:
83 | - Set up the project structure with your preferred configuration
84 | - Configure llama.cpp with Metal support optimized for Apple Silicon
85 | - Enable differentiated KV cache quantization
86 | - Offer to download a small test model (optional)
87 | - Set up visualization tools based on your Python preferences
88 |
89 | ## 🏎️ Quick Comparison
90 |
91 | Want to see the benefits immediately? Run a quick comparison with your model:
92 |
93 | ```bash
94 | # Run quick comparison with different configurations
95 | python scripts/quick_compare.py --model models/your-model.gguf
96 | ```
97 |
98 | This will show you a side-by-side comparison of FP16, K8V8, K8V4, K4V8, and K4V4 with memory usage, speed, and quality metrics.
99 |
100 | ## 📊 Impressive Results
101 |
102 |
103 |

104 |
105 |
106 | ### 📉 Memory Reduction
107 |
108 | | Configuration | VRAM @ 8K tokens | Memory Savings | Quality Impact |
109 | |---------------|-----------------|----------------|----------------|
110 | | FP16 (base) | 176.00 MB | — | — |
111 | | K8V8 (8-bit) | 93.50 MB | 47% | +0.03% |
112 | | **K8V4** | **71.50 MB** | **59%** | **+0.86%** |
113 | | K4V8 | 71.50 MB | 59% | +6.06% |
114 | | K4V4 (4-bit) | 49.50 MB | 72% | +6.15% |
115 |
116 | ### 📈 Performance Impact
117 |
118 | Using KVSplit doesn't just save memory—it often **improves inference speed** by 5-15%!
119 |
120 | | Configuration | Tokens/sec (8K ctx) | Speedup vs FP16 |
121 | |---------------|---------------------|----------------|
122 | | FP16 | 54,360 | — |
123 | | K8V8 | 51,503 | -5.3% |
124 | | **K8V4** | **57,438** | **+5.7%** |
125 | | K4V8 | 58,690 | +8.0% |
126 | | K4V4 | 55,193 | +1.5% |
127 |
128 | ## 🧠 Project Structure
129 |
130 | ```
131 | kvsplit/
132 | ├── llama.cpp/ # Optimized llama.cpp build
133 | ├── models/ # LLM model files
134 | ├── scripts/ # Utility scripts
135 | │ ├── benchmark_kvsplit.py # Comprehensive benchmark tool
136 | │ ├── install_kvsplit.sh # One-command installer
137 | │ ├── quick_compare.py # Quick comparison utility
138 | │ ├── capture_memory.sh # GIF creation for memory visualization
139 | │ └── visualize_results.py # Generate publication-quality plots
140 | ├── results/ # Benchmark results (CSV/JSON)
141 | ├── plots/ # Generated visualizations
142 | └── README.md # This file
143 | ```
144 |
145 | ## 🔬 Scientific Insight
146 |
147 |
148 |

149 |
150 |
151 | KV cache memory is dominated by storing key and value vectors for each token. Our research has revealed a critical insight: **keys are significantly more sensitive to quantization than values**.
152 |
153 | ### 🔑 Key Findings
154 |
155 | - **Asymmetric Impact**: Keys require higher precision than values for maintaining quality
156 | - **Sweet Spot**: K8V4 (8-bit keys, 4-bit values) provides optimal balance
157 | - Only 0.86% perplexity degradation vs. FP16
158 | - 59% memory reduction
159 | - Faster inference than FP16
160 | - **Confirmation**: K4V8 configuration shows 7x more quality degradation than K8V4, despite using the same total bits
161 |
162 | This asymmetry allows for more efficient memory usage without compromising model quality, enabling longer context windows and larger models on consumer hardware.
163 |
164 | ## 💻 Usage Examples
165 |
166 | ### Running with Different KV Cache Precisions
167 |
168 | ```bash
169 | # Baseline (FP16)
170 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
171 | -t 8 --flash-attn
172 |
173 | # ⭐ RECOMMENDED: 8-bit keys, 4-bit values (K8V4)
174 | # Best balance of quality and memory savings
175 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
176 | -t 8 --flash-attn --kvq 8
177 |
178 | # 4-bit keys, 8-bit values (K4V8)
179 | # Shows why key precision matters more than value precision
180 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
181 | -t 8 --flash-attn --kvq-key 4 --kvq-val 8
182 |
183 | # 4-bit keys and values (K4V4)
184 | # Maximum memory savings (72% reduction) with acceptable quality
185 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
186 | -t 8 --flash-attn --kvq 4
187 | ```
188 |
189 | ### Long Context Example (32K)
190 |
191 | ```bash
192 | # Run with a 32K context (would require ~1.4GB in FP16, only ~400MB with K8V4)
193 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf \
194 | -c 32768 -n 4096 -t 8 --flash-attn --kvq 8 \
195 | -f your-long-document.txt
196 | ```
197 |
198 | ### 🚩 Command-Line Arguments
199 |
200 | | Flag | Description | Recommendation |
201 | |------|-------------|---------------|
202 | | `-t 8` | Number of threads | 8 is optimal for most Apple Silicon chips |
203 | | `--flash-attn` | Enables optimized attention | Recommended for Apple Silicon |
204 | | `--kvq N` | Sets both key and value bits to N | Use `--kvq 8` for K8V4 configuration |
205 | | `--kvq-key N` | Sets key bits only | Key precision has major quality impact |
206 | | `--kvq-val N` | Sets value bits only | Value precision has minor quality impact |
207 | | `-c N` | Context size in tokens | Longer contexts benefit more from KVSplit |
208 | | `-n N` | Number of tokens to generate | Adjust based on your needs |
209 | | `-f FILE` | Input file | For processing documents |
210 | | `-m MODEL` | Model path | Path to your .gguf model file |
211 |
212 | ## 📏 Advanced Benchmarking
213 |
214 | For comprehensive performance analysis, use our full benchmark suite:
215 |
216 | ```bash
217 | # Run the full benchmark suite (all configurations and sequence lengths)
218 | python scripts/benchmark_kvsplit.py
219 |
220 | # Run a specific configuration test
221 | python scripts/benchmark_kvsplit.py --config K8V4 --seq-len 4096
222 |
223 | # Generate publication-quality visualizations
224 | python scripts/visualize_results.py
225 | ```
226 |
227 | The benchmarking script provides thorough measurements of:
228 |
229 | - 📊 **Memory Usage**: VRAM and KV cache specifically
230 | - ⚡ **Performance**: Tokens per second across different sequence lengths
231 | - 🎯 **Quality**: Perplexity measurement using llama-perplexity
232 | - 📈 **Scaling**: How memory usage and performance scale with sequence length
233 |
234 | Results are saved in CSV/JSON formats with automatic summary statistics, and the visualization script generates publication-quality plots showing key insights.
235 |
236 | ## License
237 |
238 | MIT
239 |
240 | ## 🎬 Visual Memory Savings
241 |
242 | You can visualize memory savings with our capture tool:
243 |
244 | ```bash
245 | # Capture memory reduction in Activity Monitor
246 | ./scripts/capture_memory.sh
247 | ```
248 |
249 |
261 |
262 | ## 🍎 Apple Silicon Optimization
263 |
264 | - **Metal Performance**: Fully optimized for Apple's Metal framework
265 | - **Memory Efficiency**: Critical for memory-constrained M series Apple silicon devices
266 | - **Activity Monitor**: Use our `capture_memory.sh` script to visualize real-time memory reductions
267 | - **Alignment**: 256B page alignment in llama.cpp means actual memory savings might differ slightly from theoretical calculations
268 |
269 | ## ⭐ Key Features
270 |
271 | - **Differentiated Precision**: Independent key and value bit precision (K8V4, K4V8, etc)
272 | - **Apple Silicon Optimization**: Full Metal support for M1/M2/M3/M4 chips
273 | - **Comprehensive Benchmarking**: Memory, speed, and quality metrics
274 | - **Publication-Quality Visualization**: Beautiful plots for analysis
275 | - **Simple User Interface**: One-command install and quick comparison tools
276 | - **Memory Visualization**: Tools to capture and visualize memory savings
277 |
278 | ## 🙏 Acknowledgments
279 |
280 | This project implements ideas from recent research including:
281 | - "More for Keys, Less for Values: Adaptive KV Cache Quantization" (2024)
282 | - "Unifying KV Cache Compression for Large Language Models with LeanKV" (2025)
283 |
284 | Additional credits:
285 | - [llama.cpp](https://github.com/ggerganov/llama.cpp) - Base implementation
286 | - [TinyLlama](https://huggingface.co/TinyLlama) - Test model
287 |
288 | ## Contributing
289 |
290 | Contributions are welcome! Please open an issue or submit a pull request.
291 |
292 | ## 🧠 Configuration Recommendations
293 |
294 | - **Best Overall**: 🌟 **K8V4** 🌟 (8-bit keys, 4-bit values)
295 | - 59% memory reduction with only 0.86% quality loss
296 | - Improved inference speed (+5.7% vs FP16)
297 | - Great balance of quality and efficiency
298 |
299 | - **Absolute Maximum Memory Savings**: K4V4 (4-bit keys and values)
300 | - 72% memory reduction with ~6% quality loss
301 | - Good for memory-constrained devices
302 | - Acceptable for less sensitive applications
303 |
304 | - **Best for Very Long Contexts**: K8V4 or K4V4
305 | - Memory savings compound with context length
306 | - Run 2-3x longer contexts in the same memory budget
307 |
308 | ## 🔮 Future Roadmap
309 |
310 | - [ ] **Adaptive Precision**: Dynamic precision based on token importance
311 | - [ ] **Layer-Specific Quantization**: Different precision for different model layers
312 | - [ ] **Model-Specific Optimizations**: Tailored for Mistral, Phi-3, etc.
313 | - [ ] **Web Demo**: Interactive testing environment
314 | - [ ] **Mobile Support**: Adapting for iOS and iPadOS
315 |
316 | ## 📜 License
317 |
318 | MIT
319 |
320 | ## 🤝 Contributing
321 |
322 | Contributions are welcome! Please open an issue or submit a pull request.
323 |
--------------------------------------------------------------------------------
/models/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/models/.gitkeep
--------------------------------------------------------------------------------
/patch/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/patch/.gitkeep
--------------------------------------------------------------------------------
/patch/fixed_kv_patch.diff:
--------------------------------------------------------------------------------
1 | diff --git a/common/common.h b/common/common.h
2 | index abcdef1..1234567 100644
3 | --- a/common/common.h
4 | +++ b/common/common.h
5 | @@ -100,6 +100,10 @@ struct gpt_params {
6 | // use custom KV quantization for keys
7 | enum llama_kv_cache_type cache_type_k = LLAMA_KV_CACHE_TYPE_F16;
8 |
9 | + // use custom KV quantization for values
10 | + enum llama_kv_cache_type cache_type_v = LLAMA_KV_CACHE_TYPE_F16;
11 | +
12 | +
13 | };
14 |
15 | // for CLI arguments that are common across all models
16 | diff --git a/common/common.cpp b/common/common.cpp
17 | index abcdef1..1234567 100644
18 | --- a/common/common.cpp
19 | +++ b/common/common.cpp
20 | @@ -1290,6 +1290,30 @@ struct cli_params {
21 | "KV cache quantization for keys. If not specified, defaults to F16",
22 | {"--cache-type-k", "-ctk"}
23 | );
24 | +
25 | + add_param(
26 | + ¶ms.cache_type_v,
27 | + [](enum llama_kv_cache_type & val, const std::string & arg) {
28 | + val = llama_model_kv_cache_type_from_str(arg.c_str());
29 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) {
30 | + return CLI_PARAM_CONVERSION_ERROR;
31 | + }
32 | + return CLI_PARAM_CONVERSION_OK;
33 | + },
34 | + "KV cache quantization for values. If not specified, defaults to F16",
35 | + {"--cache-type-v", "-ctv"}
36 | + );
37 | +
38 | + // Combined KV cache quantization (sets both key and value)
39 | + add_param(
40 | + [&](const std::string & arg) {
41 | + enum llama_kv_cache_type val = llama_model_kv_cache_type_from_str(arg.c_str());
42 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) {
43 | + return CLI_PARAM_CONVERSION_ERROR;
44 | + }
45 | + params.cache_type_k = params.cache_type_v = val;
46 | + return CLI_PARAM_CONVERSION_OK;
47 | + },
48 | + "--kvq", "-kvq"
49 | + );
50 | }
51 |
52 |
--------------------------------------------------------------------------------
/patch/kvsplit.patch:
--------------------------------------------------------------------------------
1 | --- a/common/arg.cpp
2 | +++ b/common/arg.cpp
3 | @@ -802,17 +802,31 @@
4 |
5 | // KV Cache quantization types
6 | const std::vector kv_cache_types = {
7 | - GGML_TYPE_F16,
8 | - GGML_TYPE_F32,
9 | - GGML_TYPE_Q8_0,
10 | - GGML_TYPE_Q4_0,
11 | + GGML_TYPE_F16, // Default (FP16)
12 | + GGML_TYPE_F32, // Full precision (FP32)
13 | + GGML_TYPE_Q8_0, // 8-bit quantization
14 | + GGML_TYPE_Q4_0, // 4-bit quantization
15 | +};
16 | +
17 | +// Mapping of bit sizes to quantization types
18 | +const std::unordered_map kv_quant_bit_to_type = {
19 | + {16, GGML_TYPE_F16}, // 16-bit = FP16
20 | + {32, GGML_TYPE_F32}, // 32-bit = FP32
21 | + {8, GGML_TYPE_Q8_0}, // 8-bit = Q8_0
22 | + {4, GGML_TYPE_Q4_0}, // 4-bit = Q4_0
23 | };
24 |
25 | static ggml_type kv_cache_type_from_str(const std::string & s) {
26 | for (const auto & type : kv_cache_types) {
27 | if (s == ggml_type_name(type)) return type;
28 | }
29 | - return GGML_TYPE_COUNT; //invalid
30 | +
31 | + // Also try parsing bit sizes (4 or 8)
32 | + try {
33 | + int bits = std::stoi(s);
34 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
35 | + return kv_quant_bit_to_type.at(bits);
36 | + }
37 | + } catch (...) {}
38 | +
39 | + return GGML_TYPE_COUNT; // invalid
40 | }
41 |
42 | static std::string get_all_kv_cache_types() {
43 | @@ -823,6 +837,30 @@ static std::string get_all_kv_cache_types() {
44 | return msg.str();
45 | }
46 |
47 | +static std::string get_kv_quant_bit_options() {
48 | + // Return the supported bit sizes only (for --kvq-key and --kvq-val)
49 | + std::stringstream msg;
50 | + bool first = true;
51 | + for (const auto& pair : kv_quant_bit_to_type) {
52 | + if (!first) {
53 | + msg << ", ";
54 | + }
55 | + msg << pair.first;
56 | + first = false;
57 | + }
58 | + return msg.str();
59 | +}
60 | +
61 | +// Helper to convert bit size to quantization type
62 | +static ggml_type kv_quant_bits_to_type(int bits) {
63 | + auto it = kv_quant_bit_to_type.find(bits);
64 | + if (it != kv_quant_bit_to_type.end()) {
65 | + return it->second;
66 | + }
67 | + // Default to FP16 if invalid
68 | + return GGML_TYPE_F16;
69 | +}
70 | +
71 | static ggml_backend_buffer_type_config ggml_backend_buffer_type_config_from_str(const std::string & s) {
72 | auto config = ggml_backend_buffer_type_config_init();
73 |
74 | @@ -2086,6 +2124,40 @@ static void common_params_parser_init(arg_parser & parser, common_params & param
75 | ).set_env("LLAMA_ARG_CACHE_TYPE_K"));
76 |
77 | parser.add_arg(
78 | + add_arg_type::opt, "kvq-key", &kv_quant_key,
79 | + add_arg_handler([&](std::string_view value) {
80 | + try {
81 | + int bits = std::stoi(std::string(value));
82 | + // Set key cache quantization type
83 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
84 | + params.cache_type_k = kv_quant_bit_to_type.at(bits);
85 | + } else {
86 | + LOG_ERROR("Invalid KV cache key quantization bits: %d (valid options: %s)\n",
87 | + bits, get_kv_quant_bit_options().c_str());
88 | + return false;
89 | + }
90 | + } catch (...) {
91 | + LOG_ERROR("Invalid KV cache key quantization bits: '%s' (valid options: %s)\n",
92 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str());
93 | + return false;
94 | + }
95 | + return true;
96 | + }, [&]() -> std::string { return ""; }),
97 | + "",
98 | + "Set KV cache key quantization bits (options: " + get_kv_quant_bit_options() + ")"
99 | + ).set_env("LLAMA_ARG_KVQ_KEY");
100 | +
101 | + parser.add_arg(
102 | + add_arg_type::opt, "kvq-val", &kv_quant_val,
103 | + add_arg_handler([&](std::string_view value) {
104 | + try {
105 | + int bits = std::stoi(std::string(value));
106 | + // Set value cache quantization type
107 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
108 | + params.cache_type_v = kv_quant_bit_to_type.at(bits);
109 | + } else {
110 | + LOG_ERROR("Invalid KV cache value quantization bits: %d (valid options: %s)\n",
111 | + bits, get_kv_quant_bit_options().c_str());
112 | + return false;
113 | + }
114 | + } catch (...) {
115 | + LOG_ERROR("Invalid KV cache value quantization bits: '%s' (valid options: %s)\n",
116 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str());
117 | + return false;
118 | + }
119 | + return true;
120 | + }, [&]() -> std::string { return ""; }),
121 | + "",
122 | + "Set KV cache value quantization bits (options: " + get_kv_quant_bit_options() + ")"
123 | + ).set_env("LLAMA_ARG_KVQ_VAL");
124 | +
125 | + parser.add_arg(
126 | + add_arg_type::opt, "kvq", &kv_quant_general,
127 | + add_arg_handler([&](std::string_view value) {
128 | + try {
129 | + int bits = std::stoi(std::string(value));
130 | + // Set both key and value cache quantization to the same type for backwards compatibility
131 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
132 | + params.cache_type_k = kv_quant_bit_to_type.at(bits);
133 | + params.cache_type_v = kv_quant_bit_to_type.at(bits);
134 | + } else {
135 | + LOG_ERROR("Invalid KV cache quantization bits: %d (valid options: %s)\n",
136 | + bits, get_kv_quant_bit_options().c_str());
137 | + return false;
138 | + }
139 | + } catch (...) {
140 | + LOG_ERROR("Invalid KV cache quantization bits: '%s' (valid options: %s)\n",
141 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str());
142 | + return false;
143 | + }
144 | + return true;
145 | + }, [&]() -> std::string { return ""; }),
146 | + "",
147 | + "Set both KV cache key and value quantization bits (options: " + get_kv_quant_bit_options() + ")"
148 | + ).set_env("LLAMA_ARG_KVQ");
149 | +
150 | + parser.add_arg(
151 | add_arg_type::opt, "cache-type-v",
152 | add_arg_handler([&](std::string_view value) {
153 | params.cache_type_v = kv_cache_type_from_str(value);
154 | @@ -2097,6 +2169,34 @@ static void common_params_parser_init(arg_parser & parser, common_params & param
155 | "Type for V cache: " + get_all_kv_cache_types() + " (default: " + ggml_type_name(params.cache_type_v) + ")"
156 | ).set_env("LLAMA_ARG_CACHE_TYPE_V"));
157 |
158 | + // Add variable declarations at the top of the function
159 | +}
160 | +
161 | +// Add these declarations before the common_params_parser_init function
162 | +static std::string kv_quant_key;
163 | +static std::string kv_quant_val;
164 | +static std::string kv_quant_general;
165 | +
166 | +// Add this to the llama-kv-cache.cpp file to document memory usage impact
167 | +/**
168 | + * Note: Memory measurements may show slight differences from theoretical calculations
169 | + * due to 256B page alignment in the llama.cpp allocator.
170 | + *
171 | + * When using differentiated precision for keys and values:
172 | + * - Keys are more sensitive to quantization than values
173 | + * - K8V4 (8-bit keys, 4-bit values) generally provides a good balance
174 | + * - Using K4V8 (4-bit keys, 8-bit values) typically reduces quality more
175 | + * - Configuration performance will vary by model, hardware, and usage
176 | + *
177 | + * Examples:
178 | + * --kvq-key 8 --kvq-val 4 # Use 8-bit keys, 4-bit values
179 | + * --kvq-key 4 --kvq-val 8 # Use 4-bit keys, 8-bit values
180 | + * --kvq 4 # Use 4-bit for both keys and values (equivalent to --kvq-key 4 --kvq-val 4)
181 | + */
182 |
--------------------------------------------------------------------------------
/patch/kvsplit_fixed.patch:
--------------------------------------------------------------------------------
1 | diff --git a/common/arg.cpp b/common/arg.cpp
2 | index 0000000..0000000 100644
3 | --- a/common/arg.cpp
4 | +++ b/common/arg.cpp
5 | @@ -802,17 +802,31 @@ static std::vector arch_vec() {
6 |
7 | // KV Cache quantization types
8 | const std::vector kv_cache_types = {
9 | - GGML_TYPE_F16,
10 | - GGML_TYPE_F32,
11 | - GGML_TYPE_Q8_0,
12 | - GGML_TYPE_Q4_0,
13 | + GGML_TYPE_F16, // Default (FP16)
14 | + GGML_TYPE_F32, // Full precision (FP32)
15 | + GGML_TYPE_Q8_0, // 8-bit quantization
16 | + GGML_TYPE_Q4_0, // 4-bit quantization
17 | +};
18 | +
19 | +// Mapping of bit sizes to quantization types
20 | +const std::unordered_map kv_quant_bit_to_type = {
21 | + {16, GGML_TYPE_F16}, // 16-bit = FP16
22 | + {32, GGML_TYPE_F32}, // 32-bit = FP32
23 | + {8, GGML_TYPE_Q8_0}, // 8-bit = Q8_0
24 | + {4, GGML_TYPE_Q4_0}, // 4-bit = Q4_0
25 | };
26 |
27 | static ggml_type kv_cache_type_from_str(const std::string & s) {
28 | for (const auto & type : kv_cache_types) {
29 | if (s == ggml_type_name(type)) return type;
30 | }
31 | - return GGML_TYPE_COUNT; //invalid
32 | +
33 | + // Also try parsing bit sizes (4 or 8)
34 | + try {
35 | + int bits = std::stoi(s);
36 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
37 | + return kv_quant_bit_to_type.at(bits);
38 | + }
39 | + } catch (...) {}
40 | +
41 | + return GGML_TYPE_COUNT; // invalid
42 | }
43 |
44 | static std::string get_all_kv_cache_types() {
45 | @@ -823,6 +837,30 @@ static std::string get_all_kv_cache_types() {
46 | return msg.str();
47 | }
48 |
49 | +static std::string get_kv_quant_bit_options() {
50 | + // Return the supported bit sizes only (for --kvq-key and --kvq-val)
51 | + std::stringstream msg;
52 | + bool first = true;
53 | + for (const auto& pair : kv_quant_bit_to_type) {
54 | + if (!first) {
55 | + msg << ", ";
56 | + }
57 | + msg << pair.first;
58 | + first = false;
59 | + }
60 | + return msg.str();
61 | +}
62 | +
63 | +// Helper to convert bit size to quantization type
64 | +static ggml_type kv_quant_bits_to_type(int bits) {
65 | + auto it = kv_quant_bit_to_type.find(bits);
66 | + if (it != kv_quant_bit_to_type.end()) {
67 | + return it->second;
68 | + }
69 | + // Default to FP16 if invalid
70 | + return GGML_TYPE_F16;
71 | +}
72 | +
73 | static ggml_backend_buffer_type_config ggml_backend_buffer_type_config_from_str(const std::string & s) {
74 | auto config = ggml_backend_buffer_type_config_init();
75 |
76 | @@ -2086,6 +2124,43 @@ static void common_params_parser_init(arg_parser & parser, common_params & param
77 | ).set_env("LLAMA_ARG_CACHE_TYPE_K"));
78 |
79 | parser.add_arg(
80 | + add_arg_type::opt, "kvq-key", &kv_quant_key,
81 | + add_arg_handler([&](std::string_view value) {
82 | + try {
83 | + int bits = std::stoi(std::string(value));
84 | + // Set key cache quantization type
85 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
86 | + params.cache_type_k = kv_quant_bit_to_type.at(bits);
87 | + } else {
88 | + LOG_ERROR("Invalid KV cache key quantization bits: %d (valid options: %s)\n",
89 | + bits, get_kv_quant_bit_options().c_str());
90 | + return false;
91 | + }
92 | + } catch (...) {
93 | + LOG_ERROR("Invalid KV cache key quantization bits: '%s' (valid options: %s)\n",
94 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str());
95 | + return false;
96 | + }
97 | + return true;
98 | + }, [&]() -> std::string { return ""; }),
99 | + "",
100 | + "Set KV cache key quantization bits (options: " + get_kv_quant_bit_options() + ")"
101 | + ).set_env("LLAMA_ARG_KVQ_KEY");
102 | +
103 | + parser.add_arg(
104 | + add_arg_type::opt, "kvq-val", &kv_quant_val,
105 | + add_arg_handler([&](std::string_view value) {
106 | + try {
107 | + int bits = std::stoi(std::string(value));
108 | + // Set value cache quantization type
109 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
110 | + params.cache_type_v = kv_quant_bit_to_type.at(bits);
111 | + } else {
112 | + LOG_ERROR("Invalid KV cache value quantization bits: %d (valid options: %s)\n",
113 | + bits, get_kv_quant_bit_options().c_str());
114 | + return false;
115 | + }
116 | + } catch (...) {
117 | + LOG_ERROR("Invalid KV cache value quantization bits: '%s' (valid options: %s)\n",
118 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str());
119 | + return false;
120 | + }
121 | + return true;
122 | + }, [&]() -> std::string { return ""; }),
123 | + "",
124 | + "Set KV cache value quantization bits (options: " + get_kv_quant_bit_options() + ")"
125 | + ).set_env("LLAMA_ARG_KVQ_VAL");
126 | +
127 | + parser.add_arg(
128 | + add_arg_type::opt, "kvq", &kv_quant_general,
129 | + add_arg_handler([&](std::string_view value) {
130 | + try {
131 | + int bits = std::stoi(std::string(value));
132 | + // Set both key and value cache quantization to the same type for backwards compatibility
133 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {
134 | + params.cache_type_k = kv_quant_bit_to_type.at(bits);
135 | + params.cache_type_v = kv_quant_bit_to_type.at(bits);
136 | + } else {
137 | + LOG_ERROR("Invalid KV cache quantization bits: %d (valid options: %s)\n",
138 | + bits, get_kv_quant_bit_options().c_str());
139 | + return false;
140 | + }
141 | + } catch (...) {
142 | + LOG_ERROR("Invalid KV cache quantization bits: '%s' (valid options: %s)\n",
143 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str());
144 | + return false;
145 | + }
146 | + return true;
147 | + }, [&]() -> std::string { return ""; }),
148 | + "",
149 | + "Set both KV cache key and value quantization bits (options: " + get_kv_quant_bit_options() + ")"
150 | + ).set_env("LLAMA_ARG_KVQ");
151 | +
152 | + parser.add_arg(
153 | add_arg_type::opt, "cache-type-v",
154 | add_arg_handler([&](std::string_view value) {
155 | params.cache_type_v = kv_cache_type_from_str(value);
156 | @@ -2096,5 +2171,17 @@ static void common_params_parser_init(arg_parser & parser, common_params & param
157 | "Type for V cache: " + get_all_kv_cache_types() + " (default: " + ggml_type_name(params.cache_type_v) + ")"
158 | ).set_env("LLAMA_ARG_CACHE_TYPE_V"));
159 | }
160 |
161 | +// Declarations for our new KV split parameters
162 | +static std::string kv_quant_key;
163 | +static std::string kv_quant_val;
164 | +static std::string kv_quant_general;
165 | +
166 | +/*
167 | + * Note: Memory measurements may show slight differences from theoretical calculations
168 | + * due to 256B page alignment in the llama.cpp allocator.
169 | + *
170 | + * When using differentiated precision for keys and values:
171 | + * - Keys are more sensitive to quantization than values
172 | + * - K8V4 (8-bit keys, 4-bit values) generally provides a good balance
173 | + */
174 |
--------------------------------------------------------------------------------
/patch/split_kv_quant.diff:
--------------------------------------------------------------------------------
1 | diff --git a/common/common.cpp b/common/common.cpp
2 | index abcdef1..1234567 100644
3 | --- a/common/common.cpp
4 | +++ b/common/common.cpp
5 | @@ -123,6 +123,18 @@ void common_params_parser_init(common_params * params) {
6 | params->cache_type_v = llama_model_kv_cache_type_from_str(value.c_str());
7 | });
8 | }
9 | +
10 | + {
11 | + const auto & argp = gpt_params_args.add_arg({
12 | + "--kvq", "-kvq"
13 | + }, "BITS", "Set both KV cache key and value quantization to same bits\nallowed values: 4, 8\n(default: 16 for FP16)");
14 | + argp.action = [&](const std::string & value) {
15 | + try {
16 | + int bits = std::stoi(value);
17 | + params->cache_type_k = bits == 4 ? LLAMA_KV_CACHE_TYPE_Q4_0 : LLAMA_KV_CACHE_TYPE_Q8_0;
18 | + params->cache_type_v = bits == 4 ? LLAMA_KV_CACHE_TYPE_Q4_0 : LLAMA_KV_CACHE_TYPE_Q8_0;
19 | + } catch (const std::exception & e) {}
20 | + };
21 | + }
22 |
23 | // Add batching arguments
24 | {
25 |
--------------------------------------------------------------------------------
/perplexity_test_data.txt:
--------------------------------------------------------------------------------
1 | The meaning of life is a profound philosophical question that has been debated for centuries by thinkers across cultures and disciplines. Some philosophers argue that meaning comes from purpose, while others suggest it emerges from human relationships and connections. Religious perspectives often point to divine purpose or spiritual fulfillment, whereas existentialists like Sartre propose that we must create our own meaning in an otherwise indifferent universe. Scientific materialists might suggest that seeking meaning itself is a biological adaptation that helped our species survive and thrive. In psychology, Viktor Frankl's logotherapy suggests that meaning comes from three sources: purposeful work, loving relationships, and the courage to endure suffering with dignity. Modern perspectives often blend these viewpoints, recognizing that meaning is both personally constructed and socially influenced.
2 |
3 | The theory of relativity explains the relationship between space and time, revolutionizing our understanding of physics. Einstein's work showed that time and space are not absolute but relative to the observer's frame of reference. This challenged Newton's laws by demonstrating that the speed of light is constant regardless of the observer's motion. Special relativity revealed that time dilates and space contracts for objects moving at high velocities. General relativity extended these insights to explain gravity as the curvature of spacetime caused by mass and energy. This framework predicted phenomena like gravitational waves and black holes, which were confirmed decades later through advanced observational techniques. The precise predictions made by relativity theory have proven accurate in numerous experiments and technological applications, including GPS satellites that must account for relativistic effects to maintain accuracy.
4 |
5 | The history of artificial intelligence begins with ancient myths of mechanical beings and philosophical inquiries about thinking machines. The field formally emerged in the mid-20th century, with the 1956 Dartmouth Conference marking its official birth. Early AI focused on symbolic logic and rule-based systems, aiming to replicate human reasoning. The field experienced cycles of optimism and disappointment known as "AI winters" when progress failed to meet expectations. The 1980s saw the rise of expert systems, while the 1990s brought renewed interest in neural networks. Machine learning gained momentum in the 2000s as computational power increased and datasets grew. Deep learning breakthroughs in 2012 catalyzed the current AI boom, enabling significant advances in image recognition, natural language processing, and reinforcement learning. Contemporary AI systems can generate creative content, drive vehicles, diagnose diseases, and beat human champions at complex games. The field continues to evolve rapidly, raising important questions about ethics, governance, and the future relationship between humans and intelligent machines.
6 |
7 | The relationship between quantum mechanics and gravity is one of the greatest unsolved problems in physics. Standard quantum mechanics has successfully explained the behavior of particles at microscopic scales, while Einstein's general relativity accurately describes gravity at cosmic scales. However, attempts to unify these theories have proven enormously challenging. Quantum field theory works for three fundamental forces—electromagnetic, strong, and weak—but breaks down when applied to gravity. String theory proposes that fundamental particles are actually tiny vibrating strings in multiple dimensions, potentially reconciling quantum effects with gravitational force. Loop quantum gravity suggests space itself has a discrete, quantum structure at the Planck scale. Other approaches include causal set theory, asymptotic safety, and emergent gravity models. Experiments to detect quantum gravitational effects are extraordinarily difficult due to the weakness of gravity at quantum scales. Black holes represent crucial theoretical laboratories, as they require both quantum and gravitational descriptions. Hawking radiation—thermal emission from black holes—highlights the theoretical conflicts between these frameworks. Scientists hope advanced experiments like gravitational wave detectors might eventually provide empirical guidance for this theoretical quest.
8 |
9 | In the beginning of the universe, according to the prevailing Big Bang theory, all matter, energy, space, and time were compressed into an infinitesimally small, infinitely dense point called a singularity. Approximately 13.8 billion years ago, this singularity began expanding rapidly in an event known as the Big Bang. During the first fraction of a second, the universe underwent inflation, expanding faster than the speed of light. As the universe cooled, fundamental forces separated, and elementary particles formed. Within minutes, hydrogen and helium nuclei emerged, though it would take about 380,000 years for the universe to cool enough for neutral atoms to form. This moment released the cosmic microwave background radiation still detectable today. For millions of years, the universe remained dark until gravity gradually pulled matter together, forming the first stars roughly 100-200 million years after the Big Bang. These early stars, much larger and shorter-lived than our sun, produced heavier elements through nuclear fusion and subsequent supernova explosions, seeding the universe with the building blocks necessary for planets and life. Galaxies began forming as gravity drew stars together, with our own Milky Way taking shape around 13.6 billion years ago. The solar system, including Earth, formed much later, approximately 4.6 billion years ago, from the remnants of previous stellar generations.
10 |
11 | The future of human civilization depends on our collective ability to address multiple interconnected challenges while leveraging unprecedented technological capabilities. Climate change presents existential threats through rising sea levels, extreme weather events, agricultural disruptions, and ecosystem collapse. Developing sustainable energy systems, implementing carbon capture technologies, and redesigning infrastructure will be crucial for maintaining environmental stability. Artificial intelligence and automation promise incredible productivity gains but require thoughtful governance to ensure benefits are widely shared and risks mitigated. Biotechnology advances may extend human lifespans, eliminate diseases, and enhance cognitive and physical capabilities, fundamentally altering the human condition. Space exploration could provide resources, habitats, and insurance against Earth-based catastrophes. Social factors will be equally important: reducing inequality, strengthening democratic institutions, resolving geopolitical conflicts, and developing ethical frameworks for emerging technologies. Educational systems must evolve to prepare people for rapidly changing circumstances. The civilization that emerges from these transformations may differ radically from today's, potentially spreading beyond Earth to become multi-planetary or even interstellar, with artificial intelligence and enhanced humans collaborating in ways currently difficult to imagine.
12 |
13 | To understand consciousness, we must first recognize its multifaceted nature spanning subjective experience, self-awareness, and integrated information processing. Neuroscience has identified neural correlates of consciousness, particularly in the brain's thalamocortical system, where synchronous activity across regions appears to generate coherent experiences. However, explaining how physical processes produce subjective experiences—what philosopher David Chalmers calls the "hard problem"—remains elusive. Various theoretical frameworks attempt to bridge this explanatory gap. Integrated Information Theory proposes that consciousness arises from complex information integration in neural systems. Global Workspace Theory suggests consciousness emerges when information becomes broadly accessible across brain regions. Predictive Processing models frame consciousness as hierarchical prediction systems constructing our perceptual reality. Some researchers explore quantum effects in microtubules as potential consciousness generators. Philosophical positions range from physicalism (consciousness is entirely physical) to dualism (consciousness is separate from physical reality) to panpsychism (consciousness is fundamental and ubiquitous). Comparative studies across species, developmental psychology, meditation research, psychedelic science, and artificial intelligence all offer complementary insights. Advances in brain imaging, computational modeling, and theoretical frameworks suggest understanding consciousness may require integrating multiple levels of analysis, from molecular processes to emergent properties of complex systems.
14 |
15 | The evolution of language shows that communication systems can develop remarkable complexity through incremental changes driven by both biological and cultural factors. Human language likely evolved from simpler communication systems used by our ancestors, with genetic adaptations providing the necessary cognitive and anatomical foundations. Brain changes created neural circuits capable of processing complex syntax, while anatomical modifications to the vocal tract enabled articulation of diverse sounds. Archaeological evidence suggests symbolic communication emerged at least 100,000 years ago, with fully modern language capabilities present by 50,000 years ago, coinciding with a flourishing of human culture and technology. Unlike animal communication, human language is characterized by recursion, displacement (discussing absent objects), and infinite generativity. Comparative animal studies reveal precursors to language in other species, such as alarm calls in monkeys and song learning in birds. Once established, languages evolved primarily through cultural transmission, diversifying into thousands of languages. Historical linguistics demonstrates systematic sound changes, grammatical shifts, and vocabulary evolution across generations. Languages constantly adapt to cognitive constraints, social needs, and technological contexts—as seen in the development of writing systems, standardization processes, and digital communication. Contemporary research integrates evolutionary biology, neuroscience, anthropology, and computational modeling to understand this uniquely human capacity that fundamentally shapes our experience and enables cultural accumulation across generations.
16 |
17 | The optimal economic system would balance efficiency, equity, innovation, sustainability, and human flourishing. Pure market economies excel at innovation and resource allocation but often generate inequality and environmental externalities. Command economies can ensure basic necessities and reduce inequality but typically struggle with efficiency and personal freedom. Most successful modern economies blend elements of both approaches, with markets driving production and innovation while government provides public goods, regulates excesses, and ensures social safety nets. Different nations emphasize different aspects of this mix: Nordic countries combine competitive markets with robust welfare systems, while Singapore pairs strong state direction with market mechanisms. Beyond this spectrum, alternative models include worker cooperatives, participatory economics, and various forms of local exchange systems. Digital technologies offer new possibilities through platform cooperativism, decentralized autonomous organizations, and blockchain-based economic systems. The optimal system would likely incorporate mechanisms for addressing climate change, technological disruption, and changing workforce dynamics. It would value unpaid labor like caregiving, maintain living standards while reducing environmental impacts, and harness technology to enhance rather than replace human capabilities. Perhaps most importantly, it would remain adaptive, allowing democratic experimentation and evolution as circumstances change. The ideal system may not be universal but contextual—varying based on cultural values, development stages, and available resources across different societies.
18 |
19 | The nature of reality suggests that our everyday perceptions capture only a limited slice of what exists. Physics reveals a cosmos governed by quantum mechanics at tiny scales—where particles exist in probability clouds and can be entangled across vast distances—and by relativity at cosmic scales, where space and time form a unified fabric warped by mass and energy. Most of what exists appears to be dark matter and dark energy, invisible to direct observation. Our sensory systems evolved to detect only information relevant to survival, not to perceive ultimate reality. The brain constructs our experience from limited sensory data, filling gaps and creating coherent narratives. This suggests reality may be fundamentally different from our perceptions of it. Some philosophers propose reality might be information-based rather than material, with physical laws emerging from computational processes. Others suggest consciousness might be fundamental rather than emergent. Scientific realists maintain that scientific theories increasingly approximate objective reality, while constructivists emphasize how social and cultural factors shape our understanding. What seems clear is that reality encompasses levels of complexity beyond intuitive human understanding, requiring mathematical models and conceptual frameworks that often contradict common sense. Our most advanced theories remain incomplete, suggesting reality's true nature may be stranger and more fascinating than we can currently comprehend.
20 |
--------------------------------------------------------------------------------
/plots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/.gitkeep
--------------------------------------------------------------------------------
/plots/configuration_summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/configuration_summary.png
--------------------------------------------------------------------------------
/plots/inference_speed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/inference_speed.png
--------------------------------------------------------------------------------
/plots/key_value_sensitivity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/key_value_sensitivity.png
--------------------------------------------------------------------------------
/plots/kv_cache_memory_usage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/kv_cache_memory_usage.png
--------------------------------------------------------------------------------
/plots/memory_vs_quality.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/memory_vs_quality.png
--------------------------------------------------------------------------------
/plots/perplexity_change.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/perplexity_change.png
--------------------------------------------------------------------------------
/results/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/results/.gitkeep
--------------------------------------------------------------------------------
/scripts/benchmark_kvsplit.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | KVSplit Benchmark Script
5 |
6 | This script benchmarks different KV cache quantization configurations
7 | for the KVSplit project. It tests each configuration sequentially to
8 | avoid GPU contention and captures memory usage, throughput, and perplexity.
9 |
10 | Configurations tested:
11 | - FP16 (baseline)
12 | - K8V8 (standard 8-bit quantization)
13 | - K8V4 (higher precision for keys)
14 | - K4V8 (reversed - to demonstrate key sensitivity)
15 | - K4V4 (standard 4-bit quantization)
16 |
17 | For each configuration, sequence lengths 128, 2048, 4096, and 8192 are tested,
18 | with each test repeated 3 times for statistical significance.
19 | """
20 |
21 | import os
22 | import sys
23 | import csv
24 | import time
25 | import json
26 | import datetime
27 | import subprocess
28 | import re
29 | import statistics
30 | import argparse
31 | import random
32 | import math
33 | from pathlib import Path
34 |
35 | # ANSI color codes for prettier output
36 | RED = '\033[0;31m'
37 | GREEN = '\033[0;32m'
38 | YELLOW = '\033[1;33m'
39 | BLUE = '\033[0;34m'
40 | MAGENTA = '\033[0;35m'
41 | CYAN = '\033[0;36m'
42 | RESET = '\033[0m'
43 |
44 | # Define the configurations to test
45 | CONFIGURATIONS = [
46 | {"name": "FP16", "key_bits": None, "val_bits": None, "flags": []}, # Baseline
47 | {"name": "K8V8", "key_bits": 8, "val_bits": 8, "flags": ["--kvq", "8"]},
48 | {"name": "K8V4", "key_bits": 8, "val_bits": 4, "flags": ["--kvq-key", "8", "--kvq-val", "4"]},
49 | {"name": "K4V8", "key_bits": 4, "val_bits": 8, "flags": ["--kvq-key", "4", "--kvq-val", "8"]},
50 | {"name": "K4V4", "key_bits": 4, "val_bits": 4, "flags": ["--kvq", "4"]},
51 | ]
52 |
53 | # Sequence lengths to test
54 | SEQUENCE_LENGTHS = [128, 2048, 4096, 8192]
55 |
56 | # Number of times to repeat each test
57 | REPEAT_COUNT = 3
58 |
59 | # Test prompts - different prompts to avoid caching effects between runs
60 | TEST_PROMPTS = [
61 | "The meaning of life is",
62 | "In the beginning of the universe",
63 | "The theory of relativity explains",
64 | "The history of artificial intelligence begins",
65 | "The relationship between quantum mechanics and gravity is",
66 | "The future of human civilization depends on",
67 | "To understand consciousness, we must first",
68 | "The evolution of language shows that",
69 | "The optimal economic system would balance",
70 | "The nature of reality suggests that",
71 | ]
72 |
73 | class BenchmarkResult:
74 | """Class to store benchmark results for a single test run"""
75 |
76 | def __init__(self, config_name, seq_len, run_num):
77 | self.config_name = config_name
78 | self.sequence_length = seq_len
79 | self.run_number = run_num
80 | self.vram_usage_mb = 0
81 | self.kv_cache_mb = 0
82 | self.k_cache_mb = 0
83 | self.v_cache_mb = 0
84 | self.throughput_tokens_per_sec = 0
85 | self.perplexity = 0
86 | self.time_to_first_token_ms = 0
87 | self.total_time_sec = 0
88 | self.timestamp = datetime.datetime.now().isoformat()
89 | self.success = False # Track if the run was successful
90 |
91 | def to_dict(self):
92 | """Convert the result to a dictionary for CSV export"""
93 | # Get K bits and V bits from configuration name
94 | k_bits = None
95 | v_bits = None
96 |
97 | # Extract bits from configuration name (e.g., K8V4 -> k_bits=8, v_bits=4)
98 | if self.config_name != "FP16":
99 | match = re.match(r"K(\d+)V(\d+)", self.config_name)
100 | if match:
101 | k_bits = int(match.group(1))
102 | v_bits = int(match.group(2))
103 |
104 | return {
105 | "Configuration": self.config_name,
106 | "K_bits": k_bits,
107 | "V_bits": v_bits,
108 | "Sequence_Length": self.sequence_length,
109 | "Run_Number": self.run_number,
110 | "Success": self.success,
111 | "VRAM_Usage_MB": self.vram_usage_mb,
112 | "KV_Cache_MB": self.kv_cache_mb,
113 | "K_Cache_MB": self.k_cache_mb,
114 | "V_Cache_MB": self.v_cache_mb,
115 | "Throughput_Tokens_Per_Sec": self.throughput_tokens_per_sec,
116 | "Perplexity": self.perplexity,
117 | "Time_To_First_Token_ms": self.time_to_first_token_ms,
118 | "Total_Time_Sec": self.total_time_sec,
119 | "Timestamp": self.timestamp,
120 | }
121 |
122 | class Benchmarker:
123 | """Class to run benchmarks and collect results"""
124 |
125 | def __init__(self, base_dir, llama_exec, model_path, output_dir):
126 | self.base_dir = base_dir
127 | self.llama_exec = Path(llama_exec).resolve()
128 | self.model_path = Path(model_path).resolve()
129 | self.output_dir = Path(output_dir).resolve()
130 | self.results = []
131 |
132 | # Create output directory if it doesn't exist
133 | self.output_dir.mkdir(parents=True, exist_ok=True)
134 |
135 | # Ensure we can run the model
136 | if not self.llama_exec.exists():
137 | raise FileNotFoundError(f"Llama executable not found at {self.llama_exec}")
138 | if not self.model_path.exists():
139 | raise FileNotFoundError(f"Model not found at {self.model_path}")
140 |
141 | # Apply the patch to ensure we have the KVSplit functionality
142 | self._apply_patch()
143 |
144 | print(f"{GREEN}Benchmarker initialized:{RESET}")
145 | print(f" - Llama executable: {self.llama_exec}")
146 | print(f" - Model: {self.model_path}")
147 | print(f" - Output directory: {self.output_dir}")
148 |
149 | def _apply_patch(self, apply=False):
150 | if not apply:
151 | print(f"{YELLOW}Skipping patch application - using existing parameters.{RESET}")
152 | return
153 |
154 | llama_cpp_dir = Path(self.base_dir) / "llama.cpp"
155 | arg_cpp_path = llama_cpp_dir / "common" / "arg.cpp"
156 |
157 | # Check if the file contains our parameter
158 | applied = False
159 | if arg_cpp_path.exists():
160 | with open(arg_cpp_path, 'r') as f:
161 | if "kv_quant_key;" in f.read():
162 | applied = True
163 |
164 | if applied:
165 | print(f"{YELLOW}KVSplit patch already applied.{RESET}")
166 | return
167 |
168 | print(f"{YELLOW}Applying KVSplit modifications...{RESET}")
169 | try:
170 | # Read the current file
171 | with open(arg_cpp_path, 'r') as f:
172 | content = f.read()
173 |
174 | # Add variable declarations at the top of the file
175 | if "static std::string kv_quant_key;" not in content:
176 | declarations = (
177 | "// KVSplit declarations\n"
178 | "static std::string kv_quant_key;\n"
179 | "static std::string kv_quant_val;\n"
180 | "static std::string kv_quant_general;\n\n"
181 | )
182 | # Insert after the includes but before the first function
183 | import_end = content.find("//")
184 | if import_end > 0:
185 | content = content[:import_end] + declarations + content[import_end:]
186 |
187 | # Add the mapping from bit sizes to quantization types
188 | kv_cache_types_end = content.find("const std::vector kv_cache_types")
189 | if kv_cache_types_end > 0:
190 | kv_cache_types_end = content.find("};")
191 | if kv_cache_types_end > 0:
192 | # Find the actual end of the array
193 | kv_cache_types_end = content.find("};")
194 | # Insert after the kv_cache_types array
195 | bit_mapping = (
196 | "\n\n// Mapping of bit sizes to quantization types\n"
197 | "const std::unordered_map kv_quant_bit_to_type = {\n"
198 | " {16, GGML_TYPE_F16}, // 16-bit = FP16\n"
199 | " {32, GGML_TYPE_F32}, // 32-bit = FP32\n"
200 | " {8, GGML_TYPE_Q8_0}, // 8-bit = Q8_0\n"
201 | " {4, GGML_TYPE_Q4_0}, // 4-bit = Q4_0\n"
202 | "};\n"
203 | )
204 | content = content[:kv_cache_types_end+2] + bit_mapping + content[kv_cache_types_end+2:]
205 |
206 | # Update the kv_cache_type_from_str function
207 | kv_cache_type_func = content.find("static ggml_type kv_cache_type_from_str")
208 | if kv_cache_type_func > 0:
209 | function_end = content.find("return GGML_TYPE_COUNT;")
210 | if function_end > 0:
211 | # Replace the return line with our enhanced version
212 | old_return = "return GGML_TYPE_COUNT; //invalid"
213 | new_return = (
214 | " // Also try parsing bit sizes (4 or 8)\n"
215 | " try {\n"
216 | " int bits = std::stoi(s);\n"
217 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n"
218 | " return kv_quant_bit_to_type.at(bits);\n"
219 | " }\n"
220 | " } catch (...) {}\n\n"
221 | " return GGML_TYPE_COUNT; // invalid"
222 | )
223 | content = content.replace(old_return, new_return)
224 |
225 | # Add helper functions
226 | get_all_kv_cache_types_end = content.find("static std::string get_all_kv_cache_types()")
227 | if get_all_kv_cache_types_end > 0:
228 | function_end = content.find("}", get_all_kv_cache_types_end)
229 | function_end = content.find("}", function_end + 1)
230 | if function_end > 0:
231 | # Add our new functions after get_all_kv_cache_types
232 | helper_functions = (
233 | "\n\nstatic std::string get_kv_quant_bit_options() {\n"
234 | " // Return the supported bit sizes only (for --kvq-key and --kvq-val)\n"
235 | " std::stringstream msg;\n"
236 | " bool first = true;\n"
237 | " for (const auto& pair : kv_quant_bit_to_type) {\n"
238 | " if (!first) {\n"
239 | " msg << \", \";\n"
240 | " }\n"
241 | " msg << pair.first;\n"
242 | " first = false;\n"
243 | " }\n"
244 | " return msg.str();\n"
245 | "}\n\n"
246 | "// Helper to convert bit size to quantization type\n"
247 | "static ggml_type kv_quant_bits_to_type(int bits) {\n"
248 | " auto it = kv_quant_bit_to_type.find(bits);\n"
249 | " if (it != kv_quant_bit_to_type.end()) {\n"
250 | " return it->second;\n"
251 | " }\n"
252 | " // Default to FP16 if invalid\n"
253 | " return GGML_TYPE_F16;\n"
254 | "}\n"
255 | )
256 | content = content[:function_end+1] + helper_functions + content[function_end+1:]
257 |
258 | # Add the command-line arguments
259 | cache_type_k_arg = content.find('add_arg_type::opt, "cache-type-k"')
260 | if cache_type_k_arg > 0:
261 | next_arg = content.find("parser.add_arg(", cache_type_k_arg + 1)
262 | if next_arg > 0:
263 | # Add our new arguments after cache-type-k
264 | new_args = (
265 | "\n\n parser.add_arg(\n"
266 | " add_arg_type::opt, \"kvq-key\", &kv_quant_key,\n"
267 | " add_arg_handler([&](std::string_view value) {\n"
268 | " try {\n"
269 | " int bits = std::stoi(std::string(value));\n"
270 | " // Set key cache quantization type\n"
271 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n"
272 | " params.cache_type_k = kv_quant_bit_to_type.at(bits);\n"
273 | " } else {\n"
274 | " LOG_ERROR(\"Invalid KV cache key quantization bits: %d (valid options: %s)\\n\", \n"
275 | " bits, get_kv_quant_bit_options().c_str());\n"
276 | " return false;\n"
277 | " }\n"
278 | " } catch (...) {\n"
279 | " LOG_ERROR(\"Invalid KV cache key quantization bits: '%s' (valid options: %s)\\n\", \n"
280 | " std::string(value).c_str(), get_kv_quant_bit_options().c_str());\n"
281 | " return false;\n"
282 | " }\n"
283 | " return true;\n"
284 | " }, [&]() -> std::string { return \"\"; }),\n"
285 | " \"\",\n"
286 | " \"Set KV cache key quantization bits (options: \" + get_kv_quant_bit_options() + \")\"\n"
287 | " ).set_env(\"LLAMA_ARG_KVQ_KEY\");\n\n"
288 | " parser.add_arg(\n"
289 | " add_arg_type::opt, \"kvq-val\", &kv_quant_val,\n"
290 | " add_arg_handler([&](std::string_view value) {\n"
291 | " try {\n"
292 | " int bits = std::stoi(std::string(value));\n"
293 | " // Set value cache quantization type\n"
294 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n"
295 | " params.cache_type_v = kv_quant_bit_to_type.at(bits);\n"
296 | " } else {\n"
297 | " LOG_ERROR(\"Invalid KV cache value quantization bits: %d (valid options: %s)\\n\", \n"
298 | " bits, get_kv_quant_bit_options().c_str());\n"
299 | " return false;\n"
300 | " }\n"
301 | " } catch (...) {\n"
302 | " LOG_ERROR(\"Invalid KV cache value quantization bits: '%s' (valid options: %s)\\n\", \n"
303 | " std::string(value).c_str(), get_kv_quant_bit_options().c_str());\n"
304 | " return false;\n"
305 | " }\n"
306 | " return true;\n"
307 | " }, [&]() -> std::string { return \"\"; }),\n"
308 | " \"\",\n"
309 | " \"Set KV cache value quantization bits (options: \" + get_kv_quant_bit_options() + \")\"\n"
310 | " ).set_env(\"LLAMA_ARG_KVQ_VAL\");\n\n"
311 | " parser.add_arg(\n"
312 | " add_arg_type::opt, \"kvq\", &kv_quant_general,\n"
313 | " add_arg_handler([&](std::string_view value) {\n"
314 | " try {\n"
315 | " int bits = std::stoi(std::string(value));\n"
316 | " // Set both key and value cache quantization to the same type for backwards compatibility\n"
317 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n"
318 | " params.cache_type_k = kv_quant_bit_to_type.at(bits);\n"
319 | " params.cache_type_v = kv_quant_bit_to_type.at(bits);\n"
320 | " } else {\n"
321 | " LOG_ERROR(\"Invalid KV cache quantization bits: %d (valid options: %s)\\n\", \n"
322 | " bits, get_kv_quant_bit_options().c_str());\n"
323 | " return false;\n"
324 | " }\n"
325 | " } catch (...) {\n"
326 | " LOG_ERROR(\"Invalid KV cache quantization bits: '%s' (valid options: %s)\\n\", \n"
327 | " std::string(value).c_str(), get_kv_quant_bit_options().c_str());\n"
328 | " return false;\n"
329 | " }\n"
330 | " return true;\n"
331 | " }, [&]() -> std::string { return \"\"; }),\n"
332 | " \"\",\n"
333 | " \"Set both KV cache key and value quantization bits (options: \" + get_kv_quant_bit_options() + \")\"\n"
334 | " ).set_env(\"LLAMA_ARG_KVQ\");\n"
335 | )
336 | content = content[:next_arg] + new_args + content[next_arg:]
337 |
338 | # Write the modified content back to the file
339 | with open(arg_cpp_path, 'w') as f:
340 | f.write(content)
341 |
342 | print(f"{GREEN}✓ KVSplit modifications applied successfully{RESET}")
343 |
344 | # Rebuild llama.cpp
345 | print(f"{YELLOW}Rebuilding llama.cpp...{RESET}")
346 | build_dir = llama_cpp_dir / "build"
347 | if build_dir.exists():
348 | try:
349 | subprocess.run(
350 | ["cmake", "--build", ".", "--config", "Release"],
351 | cwd=str(build_dir),
352 | check=True,
353 | capture_output=True
354 | )
355 | print(f"{GREEN}✓ llama.cpp rebuilt successfully{RESET}")
356 | except subprocess.CalledProcessError as e:
357 | print(f"{RED}Failed to rebuild llama.cpp: {e}{RESET}")
358 | print(f"Build output: {e.stdout}\n{e.stderr}")
359 | raise
360 | else:
361 | print(f"{RED}Build directory not found at {build_dir}. Please run setup.sh first.{RESET}")
362 | except Exception as e:
363 | print(f"{RED}Failed to apply modifications: {e}{RESET}")
364 | raise
365 |
366 | def _parse_metal_memory(self, log_text):
367 | """Parse Metal memory allocation from log output"""
368 | vram_usage_mb = 0
369 | kv_cache_mb = 0
370 | k_cache_mb = 0
371 | v_cache_mb = 0
372 |
373 | # Try to match the newer format first (Metal_Mapped model buffer size)
374 | metal_alloc = re.search(r"Metal_Mapped model buffer size\s*=\s*([\d.]+)\s*MiB", log_text)
375 | if metal_alloc:
376 | vram_usage_mb = float(metal_alloc.group(1))
377 |
378 | # If not found, try older formats
379 | if not vram_usage_mb:
380 | metal_alloc = re.search(r"METAL ALLOC.*?size (\d+) KiB", log_text)
381 | if metal_alloc:
382 | # Convert to MB
383 | vram_usage_mb = float(metal_alloc.group(1)) / 1024 # KiB to MB
384 |
385 | if not vram_usage_mb:
386 | metal_alloc = re.search(r"GGML_METAL_log_alloc.*?(\d+)", log_text)
387 | if metal_alloc:
388 | # Convert to MB (check units in the matched string)
389 | vram_usage_mb = float(metal_alloc.group(1)) / (1024 * 1024) # Bytes to MB
390 |
391 | # Parse KV cache size from the newer unified format
392 | kv_unified = re.search(r"KV self size\s*=\s*([\d.]+)\s*MiB", log_text)
393 | if kv_unified:
394 | kv_cache_mb = float(kv_unified.group(1))
395 |
396 | # Extract K and V sizes from the newer format
397 | k_size = re.search(r"K \([^)]+\):\s*([\d.]+)\s*MiB", log_text)
398 | v_size = re.search(r"V \([^)]+\):\s*([\d.]+)\s*MiB", log_text)
399 |
400 | if k_size:
401 | k_cache_mb = float(k_size.group(1))
402 | if v_size:
403 | v_cache_mb = float(v_size.group(1))
404 |
405 | # If not found, try older formats
406 | if kv_cache_mb == 0:
407 | # Old/verbose format
408 | kv_matches = re.findall(r"METAL ALLOC:.*?[kK][vV].*?[cC]ache.*?(\d+) bytes", log_text)
409 | if kv_matches:
410 | # Sum all KV cache allocations and convert to MB
411 | kv_cache_mb = sum(int(x) for x in kv_matches) / (1024 * 1024)
412 |
413 | k_matches = re.findall(r"METAL ALLOC:.*?\bK\b.*?(\d+) bytes", log_text)
414 | if k_matches:
415 | k_cache_mb = sum(int(x) for x in k_matches) / (1024 * 1024)
416 |
417 | v_matches = re.findall(r"METAL ALLOC:.*?\bV\b.*?(\d+) bytes", log_text)
418 | if v_matches:
419 | v_cache_mb = sum(int(x) for x in v_matches) / (1024 * 1024)
420 |
421 | # Newer llama.cpp format for KV cache size
422 | if kv_cache_mb == 0:
423 | log_alloc = re.findall(r"llama_kv_cache_init: memory_size = ([\d.]+) MB", log_text)
424 | if log_alloc:
425 | kv_cache_mb = float(log_alloc[0])
426 |
427 | # As a last resort, look for Metal KV buffer size
428 | if kv_cache_mb == 0:
429 | metal_kv = re.search(r"Metal KV buffer size\s*=\s*([\d.]+)\s*MiB", log_text)
430 | if metal_kv:
431 | kv_cache_mb = float(metal_kv.group(1))
432 |
433 | # If we still don't have VRAM usage, use memory_pressure as fallback
434 | if vram_usage_mb == 0:
435 | try:
436 | mem_output = subprocess.run(["memory_pressure", "-Q"],
437 | capture_output=True,
438 | text=True,
439 | check=True).stdout
440 | # Parse memory_pressure output
441 | gpu_mem = re.search(r"GPU Memory: (\d+) MB", mem_output)
442 | if gpu_mem:
443 | vram_usage_mb = float(gpu_mem.group(1))
444 | except Exception as e:
445 | print(f"{RED}Warning: Failed to get memory info from memory_pressure: {e}{RESET}")
446 |
447 | return vram_usage_mb, kv_cache_mb, k_cache_mb, v_cache_mb
448 |
449 | def _parse_perplexity(self, log_text):
450 | """Parse perplexity from log output"""
451 | # Try to extract log probability values for token predictions
452 | # Extract all instances of "logprob" values
453 | logprob_matches = re.findall(r"\bll=([-\d.]+)\b", log_text)
454 | if logprob_matches and len(logprob_matches) > 1:
455 | # Convert log probabilities to perplexity: exp(-mean(logprobs))
456 | logprobs = [float(lp) for lp in logprob_matches if float(lp) < 0] # Ignore any positive logprobs (errors)
457 | if logprobs:
458 | avg_logprob = sum(logprobs) / len(logprobs)
459 | perplexity = math.exp(-avg_logprob) # PPL = exp(-avg_logprob)
460 | return perplexity
461 |
462 | # Try standard formats
463 | # Format produced by --perplexity flag
464 | perplexity_match = re.search(r"perplexity = ([\d.]+),", log_text)
465 | if perplexity_match:
466 | return float(perplexity_match.group(1))
467 |
468 | # Alternate format with 'perplexity:'
469 | perplexity_match = re.search(r"perplexity:\s*([\d.]+)", log_text)
470 | if perplexity_match:
471 | return float(perplexity_match.group(1))
472 |
473 | # PPL format used in some llama.cpp versions
474 | perplexity_match = re.search(r"PPL\s*=\s*([\d.]+)", log_text)
475 | if perplexity_match:
476 | return float(perplexity_match.group(1))
477 |
478 | # NLL (negative log likelihood) - convert to perplexity
479 | nll_match = re.search(r"NLL\s*=\s*([\d.]+)", log_text)
480 | if nll_match:
481 | nll = float(nll_match.group(1))
482 | return math.exp(nll) # perplexity = exp(NLL)
483 |
484 | # Try looking for 'average loss' which is similar to NLL
485 | avg_loss = re.search(r"average loss = ([\d.]+)", log_text)
486 | if avg_loss:
487 | loss = float(avg_loss.group(1))
488 | return math.exp(loss) # perplexity = exp(loss)
489 |
490 | return 0
491 |
492 | def _parse_throughput(self, log_text):
493 | """Parse throughput (tokens/sec) from log output"""
494 | # Try the new format in llama_perf_context_print
495 | throughput_match = re.search(r"llama_perf_context_print:\s+eval time.*tokens per second,\s+([\d.]+)\)", log_text)
496 | if throughput_match:
497 | return float(throughput_match.group(1))
498 |
499 | # Try earlier format
500 | throughput_match = re.search(r"eval time: .*? tokens/sec: ([\d.]+)", log_text)
501 | if throughput_match:
502 | return float(throughput_match.group(1))
503 |
504 | # Try alternate format
505 | throughput_match = re.search(r"tokens per second: ([\d.]+)", log_text)
506 | if throughput_match:
507 | return float(throughput_match.group(1))
508 |
509 | # Try another common format
510 | throughput_match = re.search(r"([\d.]+) tokens per second\)", log_text)
511 | if throughput_match:
512 | return float(throughput_match.group(1))
513 |
514 | return 0
515 |
516 | def _parse_time_to_first_token(self, log_text):
517 | """Parse time to first token from log output"""
518 | # Standard format
519 | time_match = re.search(r"time to first token: ([\d.]+) ms", log_text)
520 | if time_match:
521 | return float(time_match.group(1))
522 |
523 | # Newer format in llama_perf logs
524 | time_match = re.search(r"llama_perf_context_print:\s+prompt eval time\s*=\s*([\d.]+)\s*ms", log_text)
525 | if time_match:
526 | return float(time_match.group(1))
527 |
528 | return 0
529 |
530 | def _parse_total_time(self, log_text):
531 | """Parse total evaluation time from log output"""
532 | # Standard format
533 | time_match = re.search(r"eval time: ([\d.]+) s", log_text)
534 | if time_match:
535 | return float(time_match.group(1))
536 |
537 | # Newer format in llama_perf logs
538 | time_match = re.search(r"llama_perf_context_print:\s+total time\s*=\s*([\d.]+)\s*ms", log_text)
539 | if time_match:
540 | # Convert ms to seconds
541 | return float(time_match.group(1)) / 1000.0
542 |
543 | return 0
544 |
545 | def run_benchmark(self, config, seq_len, run_num):
546 | """Run a single benchmark"""
547 |
548 | config_name = config["name"]
549 | prompt = random.choice(TEST_PROMPTS)
550 |
551 | print(f"\n{YELLOW}Running benchmark for {config_name}, sequence length {seq_len}, run {run_num+1}/{REPEAT_COUNT}{RESET}")
552 |
553 | # Create a temporary prompt file for perplexity benchmark
554 | prompt_file = self.output_dir / f"prompt_{config_name}_{seq_len}_{run_num+1}.txt"
555 | with open(prompt_file, "w") as f:
556 | # For perplexity test, we need a longer text
557 | f.write("\n".join([prompt] + TEST_PROMPTS[:3]))
558 |
559 | # Set up command-line arguments for a standard generation benchmark
560 | cmd = [
561 | str(self.llama_exec),
562 | "-m", str(self.model_path),
563 | "-p", prompt, # Simple prompt
564 | "-c", str(seq_len), # Context size
565 | "-n", str(seq_len), # Generate up to seq_len tokens
566 | "-t", "8", # Number of threads
567 | "--flash-attn" # Enable flash attention which is required for KV quantization
568 | ]
569 |
570 | # Add configuration-specific flags
571 | if config["flags"]:
572 | cmd.extend(config["flags"])
573 |
574 | print(f"{BLUE}Command: {' '.join(cmd)}{RESET}")
575 |
576 | # Start benchmark - run the process and capture output
577 | start_time = time.time()
578 | try:
579 | process = subprocess.run(
580 | cmd,
581 | capture_output=True,
582 | text=True,
583 | check=False, # Don't raise exception on non-zero exit
584 | cwd=self.base_dir
585 | )
586 | log_output = process.stdout + process.stderr
587 | success = process.returncode == 0
588 | except Exception as e:
589 | log_output = str(e)
590 | success = False
591 | end_time = time.time()
592 |
593 | # We need a much longer text file for the perplexity test
594 | # Copy the standard perplexity test data to a new file specific for this config
595 | perplexity_value = 0.0
596 |
597 | # Use our pre-created perplexity test data file
598 | perplexity_test_file = Path(self.base_dir) / "perplexity_test_data.txt"
599 | if not os.path.exists(perplexity_test_file):
600 | print(f"{YELLOW}Warning: Perplexity test data file not found. Creating a new one...{RESET}")
601 | # Create sample content - philosophical themes that work well for perplexity testing
602 | perplexity_content = """
603 | The meaning of life is a profound philosophical question that has been debated for centuries by thinkers across cultures and disciplines. Some philosophers argue that meaning comes from purpose, while others suggest it emerges from human relationships and connections. Religious perspectives often point to divine purpose or spiritual fulfillment, whereas existentialists like Sartre propose that we must create our own meaning in an otherwise indifferent universe.
604 |
605 | The theory of relativity explains the relationship between space and time, revolutionizing our understanding of physics. Einstein's work showed that time and space are not absolute but relative to the observer's frame of reference. This challenged Newton's laws by demonstrating that the speed of light is constant regardless of the observer's motion.
606 |
607 | The history of artificial intelligence begins with ancient myths of mechanical beings and philosophical inquiries about thinking machines. The field formally emerged in the mid-20th century, with the 1956 Dartmouth Conference marking its official birth.
608 |
609 | The relationship between quantum mechanics and gravity is one of the greatest unsolved problems in physics. Standard quantum mechanics has successfully explained the behavior of particles at microscopic scales, while Einstein's general relativity accurately describes gravity at cosmic scales.
610 |
611 | In the beginning of the universe, according to the prevailing Big Bang theory, all matter, energy, space, and time were compressed into an infinitesimally small, infinitely dense point called a singularity.
612 | """
613 | with open(perplexity_test_file, "w") as f:
614 | f.write(perplexity_content)
615 |
616 | # Run perplexity benchmark using llama-perplexity tool
617 | perplexity_tool = str(self.llama_exec).replace('llama-cli', 'llama-perplexity')
618 | if os.path.exists(perplexity_tool):
619 | perpl_cmd = [
620 | perplexity_tool,
621 | "-m", str(self.model_path),
622 | "-f", str(perplexity_test_file),
623 | "-t", "8", # Number of threads
624 | "--ctx-size", "512", # Use smaller context size to enable perplexity calculation
625 | "--flash-attn" # Enable flash attention which is required for KV quantization
626 | ]
627 |
628 | # Add configuration-specific flags
629 | if config["flags"]:
630 | perpl_cmd.extend(config["flags"])
631 |
632 | try:
633 | print(f"{BLUE}Running perplexity test: {' '.join(perpl_cmd)}{RESET}")
634 | perpl_process = subprocess.run(
635 | perpl_cmd,
636 | capture_output=True,
637 | text=True,
638 | check=False,
639 | cwd=self.base_dir
640 | )
641 | perpl_output = perpl_process.stdout + perpl_process.stderr
642 |
643 | # Save perplexity output to log file
644 | perpl_log_file = self.output_dir / f"perplexity_{config_name}_n{seq_len}_run{run_num+1}.log"
645 | with open(perpl_log_file, "w") as f:
646 | f.write(perpl_output)
647 |
648 | # Parse perplexity result - look for the final PPL estimate
649 | match = re.search(r"Final estimate:\s*PPL\s*=\s*([\d.]+)", perpl_output)
650 | if match:
651 | perplexity_value = float(match.group(1))
652 | print(f"{GREEN}Perplexity: {perplexity_value:.4f}{RESET}")
653 | else:
654 | # Try alternate format
655 | match = re.search(r"perplexity:\s*([\d.]+)", perpl_output)
656 | if match:
657 | perplexity_value = float(match.group(1))
658 | print(f"{GREEN}Perplexity: {perplexity_value:.4f}{RESET}")
659 | else:
660 | print(f"{RED}Failed to extract perplexity value from output{RESET}")
661 | except Exception as e:
662 | print(f"{RED}Error running perplexity test: {e}{RESET}")
663 | else:
664 | print(f"{YELLOW}Warning: llama-perplexity tool not found at {perplexity_tool}{RESET}")
665 |
666 | # Save log output to file
667 | log_file = self.output_dir / f"benchmark_{config_name}_n{seq_len}_run{run_num+1}.log"
668 | with open(log_file, "w") as f:
669 | f.write(log_output)
670 |
671 | # Create result object
672 | result = BenchmarkResult(config_name, seq_len, run_num+1)
673 | result.success = success
674 |
675 | # Try to parse metrics from log output even if the command failed
676 | # This allows us to capture partial results
677 | total_allocated, kv_cache_size, k_cache_size, v_cache_size = self._parse_metal_memory(log_output)
678 | result.vram_usage_mb = total_allocated
679 | result.kv_cache_mb = kv_cache_size
680 | result.k_cache_mb = k_cache_size
681 | result.v_cache_mb = v_cache_size
682 |
683 | # Use the perplexity value from the dedicated perplexity tool test
684 | result.perplexity = perplexity_value
685 |
686 | result.throughput_tokens_per_sec = self._parse_throughput(log_output)
687 | result.time_to_first_token_ms = self._parse_time_to_first_token(log_output)
688 | result.total_time_sec = self._parse_total_time(log_output)
689 |
690 | # If we couldn't parse the total time, use our measured time
691 | if result.total_time_sec == 0:
692 | result.total_time_sec = end_time - start_time
693 |
694 | # Save error messages for analysis
695 | error_file = self.output_dir / f"error_{config_name}_n{seq_len}_run{run_num+1}.txt"
696 | if not success:
697 | # Try to extract error message
698 | error_lines = [line for line in log_output.splitlines() if "error" in line.lower() or "exception" in line.lower() or "failed" in line.lower()]
699 | if error_lines:
700 | with open(error_file, "w") as f:
701 | f.write("\n".join(error_lines))
702 |
703 | if success:
704 | # Print results
705 | print(f"{GREEN}Benchmark completed:{RESET}")
706 | print(f" - VRAM usage: {result.vram_usage_mb:.2f} MB")
707 | print(f" - KV cache: {result.kv_cache_mb:.2f} MB")
708 | print(f" - Throughput: {result.throughput_tokens_per_sec:.2f} tokens/sec")
709 | print(f" - Perplexity: {result.perplexity:.4f}")
710 | print(f" - Log saved to: {log_file}")
711 | else:
712 | # Still try to print any metrics we could parse
713 | print(f"{RED}Benchmark failed. Check log file: {log_file}{RESET}")
714 | if result.kv_cache_mb > 0 or result.throughput_tokens_per_sec > 0 or result.perplexity > 0:
715 | print(f"{YELLOW}Partial results:{RESET}")
716 | if result.kv_cache_mb > 0:
717 | print(f" - KV cache: {result.kv_cache_mb:.2f} MB")
718 | if result.throughput_tokens_per_sec > 0:
719 | print(f" - Throughput: {result.throughput_tokens_per_sec:.2f} tokens/sec")
720 | if result.perplexity > 0:
721 | print(f" - Perplexity: {result.perplexity:.4f}")
722 |
723 | self.results.append(result)
724 | return result
725 |
726 | def run_all_benchmarks(self):
727 | """Run all benchmark configurations"""
728 | start_time = time.time()
729 | total_tests = len(CONFIGURATIONS) * len(SEQUENCE_LENGTHS) * REPEAT_COUNT
730 | completed_tests = 0
731 |
732 | print(f"{GREEN}Starting benchmark suite with {total_tests} total tests{RESET}")
733 | print(f"Testing configurations: {', '.join(c['name'] for c in CONFIGURATIONS)}")
734 | print(f"Sequence lengths: {', '.join(str(n) for n in SEQUENCE_LENGTHS)}")
735 | print(f"Each test repeated {REPEAT_COUNT} times")
736 |
737 | for config in CONFIGURATIONS:
738 | for seq_len in SEQUENCE_LENGTHS:
739 | for run in range(REPEAT_COUNT):
740 | # Run a single benchmark
741 | result = self.run_benchmark(config, seq_len, run)
742 | completed_tests += 1
743 |
744 | # Calculate and display progress
745 | progress_pct = (completed_tests / total_tests) * 100
746 | elapsed = time.time() - start_time
747 | eta = (elapsed / completed_tests) * (total_tests - completed_tests) if completed_tests > 0 else 0
748 |
749 | print(f"{BLUE}Progress: {completed_tests}/{total_tests} ({progress_pct:.1f}%)")
750 | print(f"Elapsed: {elapsed/60:.1f} minutes, ETA: {eta/60:.1f} minutes{RESET}")
751 |
752 | # Verify if the benchmark was successful - if first run fails, skip the rest for this config
753 | if run == 0 and not result.success and not any([result.kv_cache_mb, result.throughput_tokens_per_sec, result.perplexity]):
754 | print(f"{RED}First run for {config['name']} with sequence length {seq_len} failed completely.{RESET}")
755 | print(f"{YELLOW}Skipping remaining runs for this configuration and sequence length.{RESET}")
756 | # Skip the remaining runs for this config+seq_len
757 | completed_tests += REPEAT_COUNT - 1
758 | break
759 |
760 | # Small pause between tests to let system cool down
761 | if completed_tests < total_tests:
762 | print(f"{YELLOW}Waiting 2 seconds before next test...{RESET}")
763 | time.sleep(2)
764 |
765 | total_time = time.time() - start_time
766 | print(f"{GREEN}All benchmarks completed in {total_time/60:.1f} minutes!{RESET}")
767 |
768 | # Export all results
769 | self.export_results()
770 |
771 | # Generate summary statistics
772 | self.generate_summary()
773 |
774 | def export_results(self):
775 | """Export benchmark results to CSV"""
776 | timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
777 | csv_file = self.output_dir / f"benchmark_results_{timestamp}.csv"
778 |
779 | # Convert results to dictionaries
780 | result_dicts = [result.to_dict() for result in self.results]
781 |
782 | # Write to CSV
783 | if result_dicts:
784 | headers = result_dicts[0].keys()
785 | with open(csv_file, "w", newline="") as f:
786 | writer = csv.DictWriter(f, fieldnames=headers)
787 | writer.writeheader()
788 | writer.writerows(result_dicts)
789 |
790 | print(f"{GREEN}Results exported to {csv_file}{RESET}")
791 | else:
792 | print(f"{YELLOW}No results to export{RESET}")
793 |
794 | # Also export as JSON for easier parsing
795 | json_file = self.output_dir / f"benchmark_results_{timestamp}.json"
796 | with open(json_file, "w") as f:
797 | json.dump(result_dicts, f, indent=2)
798 |
799 | print(f"{GREEN}Results also exported to {json_file}{RESET}")
800 |
801 | return csv_file, json_file
802 |
803 | def generate_summary(self):
804 | """Generate summary statistics from all benchmark runs"""
805 | if not self.results:
806 | print(f"{YELLOW}No results to summarize{RESET}")
807 | return
808 |
809 | print(f"\n{GREEN}=== BENCHMARK SUMMARY ==={RESET}")
810 |
811 | # Check if we have any successful measurements
812 | has_measurements = False
813 | for result in self.results:
814 | if result.kv_cache_mb > 0 or result.throughput_tokens_per_sec > 0 or result.perplexity > 0:
815 | has_measurements = True
816 | break
817 |
818 | if not has_measurements:
819 | print(f"{RED}No successful measurements were captured in the benchmark run.{RESET}")
820 | print(f"{YELLOW}This may be because:{RESET}")
821 | print("1. The KVSplit patch wasn't properly applied")
822 | print("2. The parameters aren't recognized by llama.cpp")
823 | print("3. There was an issue with the benchmark command execution")
824 | print(f"\nCheck the log files in {self.output_dir} for detailed error messages")
825 | return
826 |
827 | # Group results by configuration and sequence length
828 | grouped_results = {}
829 | for result in self.results:
830 | key = (result.config_name, result.sequence_length)
831 | if key not in grouped_results:
832 | grouped_results[key] = []
833 | grouped_results[key].append(result)
834 |
835 | # Calculate summary statistics for each group
836 | summary_rows = []
837 | for (config_name, seq_len), results in grouped_results.items():
838 | # Calculate means and standard deviations
839 | vram_usage = [r.vram_usage_mb for r in results if r.vram_usage_mb > 0]
840 | kv_cache = [r.kv_cache_mb for r in results if r.kv_cache_mb > 0]
841 | throughput = [r.throughput_tokens_per_sec for r in results if r.throughput_tokens_per_sec > 0]
842 | perplexity = [r.perplexity for r in results if r.perplexity > 0]
843 |
844 | # Only calculate stats if we have data
845 | vram_mean = statistics.mean(vram_usage) if vram_usage else 0
846 | vram_stdev = statistics.stdev(vram_usage) if len(vram_usage) > 1 else 0
847 | kv_mean = statistics.mean(kv_cache) if kv_cache else 0
848 | kv_stdev = statistics.stdev(kv_cache) if len(kv_cache) > 1 else 0
849 | throughput_mean = statistics.mean(throughput) if throughput else 0
850 | throughput_stdev = statistics.stdev(throughput) if len(throughput) > 1 else 0
851 | perplexity_mean = statistics.mean(perplexity) if perplexity else 0
852 | perplexity_stdev = statistics.stdev(perplexity) if len(perplexity) > 1 else 0
853 |
854 | # Add to summary rows
855 | summary_rows.append({
856 | "Configuration": config_name,
857 | "Sequence_Length": seq_len,
858 | "VRAM_Usage_MB_Mean": vram_mean,
859 | "VRAM_Usage_MB_StdDev": vram_stdev,
860 | "KV_Cache_MB_Mean": kv_mean,
861 | "KV_Cache_MB_StdDev": kv_stdev,
862 | "Throughput_Mean": throughput_mean,
863 | "Throughput_StdDev": throughput_stdev,
864 | "Perplexity_Mean": perplexity_mean,
865 | "Perplexity_StdDev": perplexity_stdev,
866 | "Sample_Count": len(results),
867 | })
868 |
869 | # Export summary to CSV
870 | timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
871 | summary_file = self.output_dir / f"benchmark_summary_{timestamp}.csv"
872 |
873 | if summary_rows:
874 | headers = summary_rows[0].keys()
875 | with open(summary_file, "w", newline="") as f:
876 | writer = csv.DictWriter(f, fieldnames=headers)
877 | writer.writeheader()
878 | writer.writerows(summary_rows)
879 |
880 | print(f"{GREEN}Summary statistics exported to {summary_file}{RESET}")
881 |
882 | # Print summary to console
883 | print(f"\n{CYAN}Memory Usage Summary (KV Cache MB){RESET}")
884 | print(f"{'Configuration':<10} | {'128 tokens':<15} | {'2048 tokens':<15} | {'4096 tokens':<15} | {'8192 tokens':<15}")
885 | print("-" * 80)
886 |
887 | for config in CONFIGURATIONS:
888 | row = f"{config['name']:<10} | "
889 | for seq_len in SEQUENCE_LENGTHS:
890 | key = (config['name'], seq_len)
891 | if key in grouped_results:
892 | kv_cache = [r.kv_cache_mb for r in grouped_results[key] if r.kv_cache_mb > 0]
893 | if kv_cache:
894 | mean = statistics.mean(kv_cache)
895 | row += f"{mean:6.2f} MB | "
896 | else:
897 | row += f"{'N/A':<15} | "
898 | else:
899 | row += f"{'N/A':<15} | "
900 | print(row)
901 |
902 | print(f"\n{CYAN}Throughput Summary (tokens/sec){RESET}")
903 | print(f"{'Configuration':<10} | {'128 tokens':<15} | {'2048 tokens':<15} | {'4096 tokens':<15} | {'8192 tokens':<15}")
904 | print("-" * 80)
905 |
906 | for config in CONFIGURATIONS:
907 | row = f"{config['name']:<10} | "
908 | for seq_len in SEQUENCE_LENGTHS:
909 | key = (config['name'], seq_len)
910 | if key in grouped_results:
911 | throughput = [r.throughput_tokens_per_sec for r in grouped_results[key] if r.throughput_tokens_per_sec > 0]
912 | if throughput:
913 | mean = statistics.mean(throughput)
914 | row += f"{mean:6.2f} t/s | "
915 | else:
916 | row += f"{'N/A':<15} | "
917 | else:
918 | row += f"{'N/A':<15} | "
919 | print(row)
920 |
921 | print(f"\n{CYAN}Perplexity Summary (lower is better){RESET}")
922 | print(f"{'Configuration':<10} | {'128 tokens':<15} | {'2048 tokens':<15} | {'4096 tokens':<15} | {'8192 tokens':<15}")
923 | print("-" * 80)
924 |
925 | for config in CONFIGURATIONS:
926 | row = f"{config['name']:<10} | "
927 | for seq_len in SEQUENCE_LENGTHS:
928 | key = (config['name'], seq_len)
929 | if key in grouped_results:
930 | perplexity = [r.perplexity for r in grouped_results[key] if r.perplexity > 0]
931 | if perplexity:
932 | mean = statistics.mean(perplexity)
933 | row += f"{mean:6.4f} | "
934 | else:
935 | row += f"{'N/A':<15} | "
936 | else:
937 | row += f"{'N/A':<15} | "
938 | print(row)
939 |
940 | # Print key insights
941 | print(f"\n{GREEN}Key Insights:{RESET}")
942 | print("1. K8V4 (8-bit keys, 4-bit values) typically provides a good balance of memory efficiency")
943 | print(" and quality, keeping key precision where it matters most.")
944 | print("2. K4V8 typically shows more quality degradation as keys are more sensitive to quantization.")
945 | print("3. Longer context lengths demonstrate more significant memory savings with mixed precision.")
946 | print("4. Memory measurements may show slight differences from theoretical calculations due to")
947 | print(" the 256B page alignment in the llama.cpp memory allocator.")
948 | print("5. Using the existing --cache-type-k and --cache-type-v parameters allows for split-precision")
949 | print(" KV cache without modifying the llama.cpp source code.")
950 | print()
951 | print(f"{GREEN}Full benchmark data and logs are available in: {self.output_dir}{RESET}")
952 |
953 | return summary_file
954 |
955 | def main():
956 | parser = argparse.ArgumentParser(description="KVSplit Benchmark Tool")
957 | parser.add_argument("--base-dir", default=None, help="Base directory for the KVSplit project")
958 | parser.add_argument("--model", default=None, help="Path to the model file")
959 | parser.add_argument("--llama-exec", default=None, help="Path to the llama.cpp executable")
960 | parser.add_argument("--output-dir", default=None, help="Directory to store benchmark results")
961 |
962 | args = parser.parse_args()
963 |
964 | # Determine base directory if not specified
965 | if args.base_dir is None:
966 | args.base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
967 |
968 | # Set default paths based on base directory
969 | if args.output_dir is None:
970 | args.output_dir = os.path.join(args.base_dir, "results")
971 |
972 | if args.model is None:
973 | args.model = os.path.join(args.base_dir, "models", "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
974 |
975 | if args.llama_exec is None:
976 | # First try to find main directly
977 | llama_cpp_dir = os.path.join(args.base_dir, "llama.cpp")
978 |
979 | # Look in build/bin directory first (CMake build)
980 | candidate = os.path.join(llama_cpp_dir, "build", "bin", "main")
981 | if os.path.exists(candidate):
982 | args.llama_exec = candidate
983 | else:
984 | # Try llama-cli
985 | candidate = os.path.join(llama_cpp_dir, "build", "bin", "llama-cli")
986 | if os.path.exists(candidate):
987 | args.llama_exec = candidate
988 | else:
989 | # Try just main in llama.cpp dir (Make build)
990 | candidate = os.path.join(llama_cpp_dir, "main")
991 | if os.path.exists(candidate):
992 | args.llama_exec = candidate
993 | else:
994 | print(f"{RED}Error: Could not find llama.cpp executable. Please specify with --llama-exec{RESET}")
995 | sys.exit(1)
996 |
997 | print(f"{GREEN}KVSplit Benchmark{RESET}")
998 | print(f"Base directory: {args.base_dir}")
999 | print(f"Llama executable: {args.llama_exec}")
1000 | print(f"Model: {args.model}")
1001 | print(f"Output directory: {args.output_dir}")
1002 |
1003 | # Run benchmarks
1004 | benchmarker = Benchmarker(args.base_dir, args.llama_exec, args.model, args.output_dir)
1005 | benchmarker.run_all_benchmarks()
1006 |
1007 | if __name__ == "__main__":
1008 | main()
1009 |
--------------------------------------------------------------------------------
/scripts/capture_memory.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # capture_memory.sh - Script to capture memory usage during inference
3 | #
4 | # This script captures screenshots at regular intervals and converts them
5 | # to a GIF, allowing you to visualize memory reduction in Activity Monitor
6 | # or other monitoring tools when running different KV cache configurations.
7 | #
8 | # Usage:
9 | # ./scripts/capture_memory.sh [options]
10 | #
11 | # Options:
12 | # --frames N Number of frames to capture (default: 30)
13 | # --delay N Delay between frames in seconds (default: 1)
14 | # --fps N Frames per second in the output GIF (default: 5)
15 | # --output FILE Output filename (default: memory_reduction.gif)
16 |
17 | set -euo pipefail
18 |
19 | # ANSI color codes
20 | GREEN='\033[0;32m'
21 | YELLOW='\033[1;33m'
22 | BLUE='\033[0;34m'
23 | RED='\033[0;31m'
24 | RESET='\033[0m'
25 |
26 | # Default settings
27 | FRAMES=30
28 | DELAY=1
29 | FPS=5
30 | OUTPUT="memory_reduction.gif"
31 | FRAMES_DIR="capture_frames"
32 |
33 | # Parse arguments
34 | while [[ $# -gt 0 ]]; do
35 | case $1 in
36 | --frames)
37 | FRAMES="$2"
38 | shift 2
39 | ;;
40 | --delay)
41 | DELAY="$2"
42 | shift 2
43 | ;;
44 | --fps)
45 | FPS="$2"
46 | shift 2
47 | ;;
48 | --output)
49 | OUTPUT="$2"
50 | shift 2
51 | ;;
52 | *)
53 | echo -e "${RED}Error: Unknown option $1${RESET}"
54 | exit 1
55 | ;;
56 | esac
57 | done
58 |
59 | # Check for gifski
60 | if ! command -v gifski &> /dev/null; then
61 | echo -e "${YELLOW}⚠️ gifski is not installed. Attempting to install it...${RESET}"
62 | if command -v brew &> /dev/null; then
63 | brew install gifski
64 | else
65 | echo -e "${RED}❌ Error: Homebrew is not installed. Please install gifski manually:${RESET}"
66 | echo -e "${BLUE} brew install gifski${RESET}"
67 | exit 1
68 | fi
69 | fi
70 |
71 | echo -e "${GREEN}📹 Memory Usage Capture Tool${RESET}"
72 | echo -e "${BLUE}This script will capture ${FRAMES} screenshots at ${DELAY}-second intervals${RESET}"
73 | echo -e "${BLUE}and combine them into a GIF at ${FPS} frames per second.${RESET}"
74 | echo
75 | echo -e "${YELLOW}Instructions:${RESET}"
76 | echo -e "1. Open Activity Monitor (Applications > Utilities > Activity Monitor)"
77 | echo -e "2. Position it on your screen where it's clearly visible"
78 | echo -e "3. Sort by Memory usage (click the 'Memory' column header)"
79 | echo -e "4. Run your KVSplit commands in another terminal window"
80 | echo
81 |
82 | # Ask for confirmation
83 | read -p "Press Enter when ready to start capturing, or Ctrl+C to cancel..."
84 |
85 | # Create frames directory
86 | mkdir -p "$FRAMES_DIR"
87 |
88 | # Start countdown
89 | echo -e "${YELLOW}Starting capture in:${RESET}"
90 | for i in {3..1}; do
91 | echo -e "${YELLOW}$i...${RESET}"
92 | sleep 1
93 | done
94 |
95 | echo -e "${GREEN}🎬 Capturing started!${RESET}"
96 |
97 | # Capture frames
98 | for ((i=1; i<=$FRAMES; i++)); do
99 | echo -e "${BLUE}Capturing frame $i of $FRAMES${RESET}"
100 | screencapture -x "$FRAMES_DIR/frame_$(printf "%03d" $i).png"
101 |
102 | # Show progress
103 | percent=$((i * 100 / FRAMES))
104 | completed=$((percent / 2))
105 | remaining=$((50 - completed))
106 | progress="["
107 | for ((j=0; j"
109 | for ((j=0; j/dev/null 2>&1; then
41 | echo -e "${YELLOW}⚠️ Homebrew not found. Please install Homebrew first:${RESET}"
42 | echo -e "${YELLOW} /bin/bash -c \"\$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"${RESET}"
43 | exit 1
44 | fi
45 |
46 | brew install cmake parallel gifski python || {
47 | echo -e "${YELLOW}⚠️ Some Homebrew dependencies couldn't be installed.${RESET}"
48 | echo -e "${YELLOW} This might be due to them already being installed or another issue.${RESET}"
49 | echo -e "${YELLOW} Continuing with installation...${RESET}"
50 | }
51 |
52 | # Python environment setup
53 | echo -e "${BLUE}Python environment setup options:${RESET}"
54 | echo -e "1. ${GREEN}Virtual Environment:${RESET} Create a new Python venv in this directory"
55 | echo -e "2. ${GREEN}System Python:${RESET} Use system Python installation"
56 | echo -e "3. ${GREEN}Skip:${RESET} Skip Python setup (manual setup later)"
57 | read -p "Select option [1/2/3] (default: 1): " -n 1 -r PYTHON_CHOICE
58 | echo
59 |
60 | case $PYTHON_CHOICE in
61 | "2")
62 | echo -e "${BLUE}Using system Python...${RESET}"
63 | PYTHON_CMD="python3"
64 | PIP_CMD="pip3"
65 |
66 | # Check if Python is available
67 | if ! command -v $PYTHON_CMD &> /dev/null; then
68 | echo -e "${RED}❌ Python not found. Please install Python 3.${RESET}"
69 | PYTHON_SETUP_FAILED=true
70 | else
71 | echo -e "${GREEN}✅ Using Python: $($PYTHON_CMD --version)${RESET}"
72 | fi
73 |
74 | # Check if dependencies are already installed
75 | echo -e "${BLUE}Checking for required Python packages...${RESET}"
76 | MISSING_PACKAGES=""
77 | for pkg in pandas numpy matplotlib seaborn; do
78 | if ! $PYTHON_CMD -c "import $pkg" &> /dev/null; then
79 | MISSING_PACKAGES="$MISSING_PACKAGES $pkg"
80 | fi
81 | done
82 |
83 | if [ -n "$MISSING_PACKAGES" ]; then
84 | echo -e "${YELLOW}⚠️ Missing packages:$MISSING_PACKAGES${RESET}"
85 | echo -e "${YELLOW}You can install them with: $PIP_CMD install$MISSING_PACKAGES${RESET}"
86 | read -p "Install missing packages now? (y/n) " -n 1 -r
87 | echo
88 | if [[ $REPLY =~ ^[Yy]$ ]]; then
89 | $PIP_CMD install $MISSING_PACKAGES || {
90 | echo -e "${RED}❌ Failed to install packages.${RESET}"
91 | echo -e "${YELLOW}You may need to install them manually:${RESET}"
92 | echo -e "${YELLOW}$PIP_CMD install pandas numpy matplotlib seaborn${RESET}"
93 | }
94 | fi
95 | else
96 | echo -e "${GREEN}✅ All required packages are installed.${RESET}"
97 | fi
98 | ;;
99 | "3")
100 | echo -e "${YELLOW}Skipping Python setup...${RESET}"
101 | echo -e "${YELLOW}You'll need to manually install these packages to use visualization tools:${RESET}"
102 | echo -e "${YELLOW}- pandas${RESET}"
103 | echo -e "${YELLOW}- numpy${RESET}"
104 | echo -e "${YELLOW}- matplotlib${RESET}"
105 | echo -e "${YELLOW}- seaborn${RESET}"
106 | ;;
107 | *)
108 | # Default is virtual environment
109 | echo -e "${BLUE}Setting up Python virtual environment...${RESET}"
110 | if ! command -v python3 &> /dev/null; then
111 | echo -e "${RED}❌ Python not found. Please install Python 3.${RESET}"
112 | PYTHON_SETUP_FAILED=true
113 | elif ! python3 -m venv venv; then
114 | echo -e "${YELLOW}⚠️ Could not create Python virtual environment.${RESET}"
115 | echo -e "${YELLOW}Continuing without virtual environment...${RESET}"
116 | PYTHON_SETUP_FAILED=true
117 | else
118 | echo -e "${GREEN}✅ Virtual environment created.${RESET}"
119 | echo -e "${BLUE}Installing Python dependencies...${RESET}"
120 | source venv/bin/activate
121 | pip install --upgrade pip
122 | pip install pandas numpy matplotlib seaborn || {
123 | echo -e "${YELLOW}⚠️ Could not install Python dependencies.${RESET}"
124 | echo -e "${YELLOW}You may need to install them manually later.${RESET}"
125 | PYTHON_SETUP_FAILED=true
126 | }
127 |
128 | if [ -z "$PYTHON_SETUP_FAILED" ]; then
129 | echo -e "${GREEN}✅ Python dependencies installed successfully.${RESET}"
130 | echo -e "${YELLOW}To activate the virtual environment in the future, run:${RESET}"
131 | echo -e "${YELLOW}source venv/bin/activate${RESET}"
132 | fi
133 | fi
134 | ;;
135 | esac
136 |
137 | if [ -n "$PYTHON_SETUP_FAILED" ]; then
138 | echo -e "${YELLOW}Python setup incomplete. Visualization tools may not work.${RESET}"
139 | echo -e "${YELLOW}You can manually install required packages later.${RESET}"
140 | fi
141 |
142 | # Setup method selection
143 | echo -e "${BLUE}Choose llama.cpp setup method:${RESET}"
144 | echo -e "1. ${GREEN}Standard:${RESET} Clone and patch llama.cpp (recommended for most users)"
145 | echo -e "2. ${GREEN}Git Submodule:${RESET} Use a forked llama.cpp as a submodule (advanced)"
146 | read -p "Select option [1/2] (default: 1): " -n 1 -r SETUP_CHOICE
147 | echo
148 |
149 | if [[ $SETUP_CHOICE == "2" ]]; then
150 | echo -e "${BLUE}Setting up llama.cpp as a git submodule...${RESET}"
151 |
152 | # Check if this is a git repository
153 | if [ ! -d ".git" ]; then
154 | echo -e "${YELLOW}⚠️ This directory is not a git repository. Initializing git...${RESET}"
155 | git init
156 | fi
157 |
158 | # Remove existing llama.cpp if present
159 | if [ -d "llama.cpp" ]; then
160 | echo -e "${YELLOW}⚠️ Removing existing llama.cpp directory...${RESET}"
161 | rm -rf llama.cpp
162 | fi
163 |
164 | # Add the forked llama.cpp as a submodule
165 | # Note: You would typically fork llama.cpp to your own account and modify it there
166 | echo -e "${BLUE}Adding llama.cpp as a submodule...${RESET}"
167 | echo -e "${YELLOW}In a real setup, you would use your own fork with KVSplit changes already applied.${RESET}"
168 | git submodule add https://github.com/ggerganov/llama.cpp.git
169 | git submodule update --init --recursive
170 |
171 | # Apply the patch to the submodule
172 | echo -e "${BLUE}Applying KV split patch to llama.cpp submodule...${RESET}"
173 | cd llama.cpp
174 | git apply ../patch/fixed_kv_patch.diff || echo -e "${YELLOW}⚠️ Patch application failed, you may need to modify the patch.${RESET}"
175 | cd ..
176 |
177 | echo -e "${GREEN}✅ Submodule setup complete.${RESET}"
178 | echo -e "${YELLOW}Note: In a real-world scenario, you would fork llama.cpp, make your changes there,${RESET}"
179 | echo -e "${YELLOW} and use your fork as the submodule URL instead of applying patches.${RESET}"
180 | else
181 | # Standard clone and patch approach
182 | echo -e "${BLUE}Setting up llama.cpp (standard method)...${RESET}"
183 | if [ -d "llama.cpp" ]; then
184 | echo -e "${YELLOW}⚠️ llama.cpp directory already exists.${RESET}"
185 | read -p "Update llama.cpp repository? (y/n) " -n 1 -r
186 | echo
187 | if [[ $REPLY =~ ^[Yy]$ ]]; then
188 | echo -e "${BLUE}Updating llama.cpp...${RESET}"
189 | cd llama.cpp
190 | git fetch --all
191 | git reset --hard origin/master
192 | cd ..
193 | fi
194 | else
195 | echo -e "${BLUE}Cloning llama.cpp repository...${RESET}"
196 | git clone https://github.com/ggerganov/llama.cpp
197 | fi
198 |
199 | # Apply the patch to llama.cpp
200 | echo -e "${BLUE}Applying KV split patch to llama.cpp...${RESET}"
201 | cd llama.cpp
202 | git apply ../patch/fixed_kv_patch.diff || echo -e "${YELLOW}⚠️ Patch application failed or patch already applied.${RESET}"
203 | cd ..
204 | fi
205 |
206 | # Check if the KV split patch exists
207 | echo -e "${BLUE}Setting up KV split patch...${RESET}"
208 | if [ ! -f "patch/fixed_kv_patch.diff" ]; then
209 | echo -e "${YELLOW}⚠️ KV patch not found, copying from included patch...${RESET}"
210 | # Copy the fixed patch that works with current llama.cpp
211 | if [ -f "patch/split_kv_quant.diff" ]; then
212 | cp patch/split_kv_quant.diff patch/fixed_kv_patch.diff
213 | else
214 | echo -e "${RED}❌ No patch files found! Your installation may not have KV split functionality.${RESET}"
215 | mkdir -p patch
216 | # Include a minimal version of the patch inline as a fallback
217 | cat > patch/fixed_kv_patch.diff << 'EOL'
218 | diff --git a/common/common.cpp b/common/common.cpp
219 | index abcdef1..1234567 100644
220 | --- a/common/common.cpp
221 | +++ b/common/common.cpp
222 | @@ -1290,6 +1290,30 @@ struct cli_params {
223 | "KV cache quantization for keys. If not specified, defaults to F16",
224 | {"--cache-type-k", "-ctk"}
225 | );
226 | +
227 | + add_param(
228 | + ¶ms.cache_type_v,
229 | + [](enum llama_kv_cache_type & val, const std::string & arg) {
230 | + val = llama_model_kv_cache_type_from_str(arg.c_str());
231 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) {
232 | + return CLI_PARAM_CONVERSION_ERROR;
233 | + }
234 | + return CLI_PARAM_CONVERSION_OK;
235 | + },
236 | + "KV cache quantization for values. If not specified, defaults to F16",
237 | + {"--cache-type-v", "-ctv"}
238 | + );
239 | +
240 | + // Combined KV cache quantization (sets both key and value)
241 | + add_param(
242 | + [&](const std::string & arg) {
243 | + enum llama_kv_cache_type val = llama_model_kv_cache_type_from_str(arg.c_str());
244 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) {
245 | + return CLI_PARAM_CONVERSION_ERROR;
246 | + }
247 | + params.cache_type_k = params.cache_type_v = val;
248 | + return CLI_PARAM_CONVERSION_OK;
249 | + },
250 | + "--kvq", "-kvq"
251 | + );
252 | }
253 | EOL
254 | fi
255 | fi
256 |
257 | # Continue with build process
258 |
259 | # Build llama.cpp with Metal support
260 | echo -e "${BLUE}Building llama.cpp with Metal support...${RESET}"
261 | cd llama.cpp 2>/dev/null || true # Only cd if not already in llama.cpp
262 | mkdir -p build
263 | cd build
264 | cmake .. -DLLAMA_METAL=ON
265 | cmake --build . --config Release -j
266 | cd ../..
267 |
268 | # Download a small test model if no models exist
269 | echo -e "${BLUE}Checking for test models...${RESET}"
270 | if [ ! "$(ls -A models 2>/dev/null)" ]; then
271 | echo -e "${BLUE}No models found. Would you like to download TinyLlama for testing?${RESET}"
272 | read -p "(y/n) " -n 1 -r
273 | echo
274 | if [[ $REPLY =~ ^[Yy]$ ]]; then
275 | echo -e "${BLUE}Downloading TinyLlama model...${RESET}"
276 | TINYLLAMA_URL="https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
277 | MODEL_NAME="tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
278 | curl -L -o models/$MODEL_NAME $TINYLLAMA_URL
279 | echo -e "${GREEN}Model downloaded to models/$MODEL_NAME${RESET}"
280 | fi
281 | fi
282 |
283 | # Set up perplexity test data
284 | echo -e "${BLUE}Setting up perplexity test data...${RESET}"
285 | cat > perplexity_test_data.txt << 'EOL'
286 | The importance of efficient memory usage in language models cannot be overstated.
287 | As context lengths grow longer, the KV cache becomes a significant bottleneck.
288 | By applying different precision to keys and values, we can achieve substantial
289 | memory savings without compromising model quality. This approach is particularly
290 | beneficial for consumer devices like Apple Silicon Macs, where memory constraints
291 | are more pronounced. Through careful benchmarking, we've found that 8-bit keys
292 | combined with 4-bit values offers an excellent balance of efficiency and quality.
293 | EOL
294 |
295 | echo -e "${GREEN}✅ KVSplit installed successfully!${RESET}"
296 | echo -e "${BLUE}Directory structure:${RESET}"
297 | echo -e " ${YELLOW}./llama.cpp/build/bin/${RESET} - Compiled binaries"
298 | echo -e " ${YELLOW}./models/${RESET} - LLM model files"
299 | echo -e " ${YELLOW}./scripts/${RESET} - Utility scripts"
300 | echo -e " ${YELLOW}./results/${RESET} - Benchmark results"
301 | echo -e " ${YELLOW}./plots/${RESET} - Visualization outputs"
302 | echo -e ""
303 | echo -e "${GREEN}Recommended usage:${RESET}"
304 | echo -e "${YELLOW}# Run inference with K8V4 (recommended configuration)${RESET}"
305 | echo -e "./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p \"Your prompt\" --kvq 8 --flash-attn"
306 | echo -e ""
307 | echo -e "${YELLOW}# Run quick comparison test${RESET}"
308 | echo -e "./scripts/quick_compare.py --model models/your-model.gguf"
309 | echo -e ""
310 | echo -e "${YELLOW}# Run full benchmark${RESET}"
311 | echo -e "python scripts/benchmark_kvsplit.py"
312 | echo -e ""
313 | echo -e "${GREEN}Thank you for using KVSplit!${RESET}"
314 |
--------------------------------------------------------------------------------
/scripts/quick_compare.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | """
3 | quick_compare.py - Runs a quick comparison of different KV quantization settings
4 |
5 | This script provides a simple way to compare different KV cache quantization
6 | configurations for llama.cpp models. It shows memory usage, speed, and quality
7 | metrics in an easy-to-understand table format.
8 |
9 | Usage:
10 | python quick_compare.py --model ~/models/your-model.gguf --prompt "Your test prompt"
11 | """
12 |
13 | import argparse
14 | import subprocess
15 | import re
16 | import os
17 | import json
18 | import tempfile
19 | import time
20 | from pathlib import Path
21 | import sys
22 |
23 | # ANSI color codes
24 | GREEN = '\033[0;32m'
25 | YELLOW = '\033[1;33m'
26 | BLUE = '\033[0;34m'
27 | RED = '\033[0;31m'
28 | RESET = '\033[0m'
29 |
30 | def print_color(color, message):
31 | """Print colored message to console"""
32 | print(f"{color}{message}{RESET}")
33 |
34 | def create_temp_prompt_file(prompt, min_length=200):
35 | """Create a temporary file with sufficient prompt text for perplexity testing"""
36 | # Ensure the prompt is long enough for meaningful perplexity testing
37 | if len(prompt) < min_length:
38 | # Extend the prompt with philosophical content
39 | extension = """
40 | The concept of memory efficiency in language models is fundamentally important.
41 | As we process longer contexts, the memory footprint becomes a critical constraint.
42 | By applying different precision to attention mechanisms, we can achieve significant
43 | savings without compromising the quality of generated text or understanding.
44 | This approach is particularly valuable for resource-constrained environments.
45 | """
46 | prompt = prompt + extension
47 |
48 | # Write to temp file
49 | fd, temp_path = tempfile.mkstemp(text=True)
50 | with os.fdopen(fd, 'w') as f:
51 | f.write(prompt)
52 |
53 | return temp_path
54 |
55 | def parse_memory_from_output(output):
56 | """Parse memory usage metrics from llama.cpp output"""
57 | try:
58 | # Look for KV cache memory usage
59 | kv_cache_mb = None
60 | kv_match = re.search(r'KV cache elements: \d+.*?(\d+(?:\.\d+)?) MiB', output, re.MULTILINE)
61 | if kv_match:
62 | kv_cache_mb = float(kv_match.group(1))
63 |
64 | # Look for VRAM usage (try multiple patterns)
65 | vram_mb = None
66 | patterns = [
67 | r'VRAM usage: (\d+(?:\.\d+)?) MiB',
68 | r'GPU memory used: \d+ bytes = (\d+(?:\.\d+)?) MB'
69 | ]
70 | for pattern in patterns:
71 | match = re.search(pattern, output, re.MULTILINE)
72 | if match:
73 | vram_mb = float(match.group(1))
74 | break
75 |
76 | return kv_cache_mb, vram_mb
77 |
78 | except Exception as e:
79 | print_color(RED, f"Error parsing memory metrics: {e}")
80 | return None, None
81 |
82 | def parse_speed_from_output(output):
83 | """Parse speed metrics from llama.cpp output"""
84 | try:
85 | # Parse tokens per second
86 | speed_match = re.search(r'(\d+(\.\d+)?) tokens/sec', output, re.MULTILINE)
87 | if speed_match:
88 | tokens_per_sec = float(speed_match.group(1))
89 | return tokens_per_sec
90 |
91 | # Alternative pattern
92 | speed_match = re.search(r'eval time: (\d+\.\d+) ms \((\d+\.\d+) tokens/sec\)', output, re.MULTILINE)
93 | if speed_match:
94 | tokens_per_sec = float(speed_match.group(2))
95 | return tokens_per_sec
96 |
97 | return None
98 | except Exception as e:
99 | print_color(RED, f"Error parsing speed metrics: {e}")
100 | return None
101 |
102 | def parse_perplexity(output):
103 | """Parse perplexity from llama-perplexity output"""
104 | try:
105 | perplexity_match = re.search(r'perplexity: (\d+\.\d+)', output, re.MULTILINE | re.IGNORECASE)
106 | if perplexity_match:
107 | return float(perplexity_match.group(1))
108 |
109 | # Alternative pattern
110 | perplexity_match = re.search(r'final\s+(?:avg)?\s*perplexity: (\d+\.\d+)', output, re.MULTILINE | re.IGNORECASE)
111 | if perplexity_match:
112 | return float(perplexity_match.group(1))
113 |
114 | return None
115 | except Exception as e:
116 | print_color(RED, f"Error parsing perplexity: {e}")
117 | return None
118 |
119 | def run_comparison(model_path, prompt, seq_len=2048, num_threads=8):
120 | """Run a comparison of different KV quantization settings"""
121 | configs = [
122 | {"name": "FP16", "args": "--ctx-size {seq_len}", "desc": "Baseline (16-bit)"},
123 | {"name": "K8V8", "args": "--ctx-size {seq_len} --kvq 8", "desc": "8-bit keys & values"},
124 | {"name": "K8V4", "args": "--ctx-size {seq_len} --kvq-key 8 --kvq-val 4", "desc": "8-bit keys, 4-bit values (RECOMMENDED)"},
125 | {"name": "K4V8", "args": "--ctx-size {seq_len} --kvq-key 4 --kvq-val 8", "desc": "4-bit keys, 8-bit values"},
126 | {"name": "K4V4", "args": "--ctx-size {seq_len} --kvq 4", "desc": "4-bit keys & values"},
127 | ]
128 |
129 | # Validate the model path
130 | model_path = os.path.expanduser(model_path)
131 | if not os.path.exists(model_path):
132 | print_color(RED, f"Error: Model file not found at {model_path}")
133 | sys.exit(1)
134 |
135 | # Create a temporary prompt file for perplexity testing
136 | prompt_file = create_temp_prompt_file(prompt)
137 |
138 | # Get the base directory
139 | base_dir = Path(__file__).parent.parent.absolute()
140 | llama_cli_path = base_dir / "llama.cpp" / "build" / "bin" / "llama-cli"
141 | llama_perplexity_path = base_dir / "llama.cpp" / "build" / "bin" / "llama-perplexity"
142 |
143 | # Validate the binaries
144 | if not os.path.exists(llama_cli_path):
145 | print_color(RED, f"Error: llama-cli binary not found at {llama_cli_path}")
146 | print_color(YELLOW, "Did you run the install_kvsplit.sh script?")
147 | sys.exit(1)
148 |
149 | if not os.path.exists(llama_perplexity_path):
150 | print_color(RED, f"Error: llama-perplexity binary not found at {llama_perplexity_path}")
151 | print_color(YELLOW, "Did you run the install_kvsplit.sh script?")
152 | sys.exit(1)
153 |
154 | results = []
155 | fp16_perplexity = None # For calculating relative perplexity
156 |
157 | print_color(GREEN, "Running quick comparison of KV cache configurations:")
158 | print_color(BLUE, f"Model: {model_path}")
159 | print_color(BLUE, f"Context size: {seq_len} tokens")
160 | print_color(BLUE, f"Threads: {num_threads}")
161 | print()
162 |
163 | for i, config in enumerate(configs):
164 | config_name = config["name"]
165 | print_color(YELLOW, f"[{i+1}/{len(configs)}] Testing {config_name}: {config['desc']}")
166 |
167 | try:
168 | # Format the args string with the sequence length
169 | args = config["args"].format(seq_len=seq_len)
170 |
171 | # Run inference to measure memory and speed
172 | inference_cmd = f"{llama_cli_path} -m {model_path} {args} -p \"{prompt[:50]}\" -n 50 -t {num_threads} --flash-attn"
173 | print_color(BLUE, f"Running: {inference_cmd}")
174 |
175 | try:
176 | inference_output = subprocess.check_output(
177 | inference_cmd, shell=True, stderr=subprocess.STDOUT
178 | ).decode('utf-8', errors='ignore')
179 | except subprocess.CalledProcessError as e:
180 | inference_output = e.output.decode('utf-8', errors='ignore')
181 | print_color(RED, f"Command failed with exit code {e.returncode}")
182 | print_color(RED, inference_output)
183 | continue
184 |
185 | # Run perplexity test
186 | perplexity_cmd = f"{llama_perplexity_path} -m {model_path} {args} -f {prompt_file} -t {num_threads}"
187 | print_color(BLUE, f"Running perplexity test: {perplexity_cmd}")
188 |
189 | try:
190 | perplexity_output = subprocess.check_output(
191 | perplexity_cmd, shell=True, stderr=subprocess.STDOUT
192 | ).decode('utf-8', errors='ignore')
193 | except subprocess.CalledProcessError as e:
194 | perplexity_output = e.output.decode('utf-8', errors='ignore')
195 | print_color(RED, f"Perplexity command failed with exit code {e.returncode}")
196 | print_color(RED, perplexity_output)
197 | perplexity_output = ""
198 |
199 | # Parse metrics
200 | kv_cache_mb, vram_mb = parse_memory_from_output(inference_output)
201 | tokens_per_sec = parse_speed_from_output(inference_output)
202 | perplexity = parse_perplexity(perplexity_output)
203 |
204 | # Store FP16 perplexity as baseline
205 | if config_name == "FP16" and perplexity is not None:
206 | fp16_perplexity = perplexity
207 |
208 | # Calculate perplexity change
209 | perplexity_change = None
210 | if fp16_perplexity is not None and perplexity is not None:
211 | perplexity_change = ((perplexity - fp16_perplexity) / fp16_perplexity) * 100
212 |
213 | results.append({
214 | "Configuration": config_name,
215 | "Description": config["desc"],
216 | "KV_Cache_MB": kv_cache_mb,
217 | "VRAM_MB": vram_mb,
218 | "Tokens_per_sec": tokens_per_sec,
219 | "Perplexity": perplexity,
220 | "Perplexity_Change_Pct": perplexity_change
221 | })
222 |
223 | print_color(GREEN, f"Completed {config_name} test")
224 | if kv_cache_mb is not None:
225 | print_color(GREEN, f" KV Cache: {kv_cache_mb:.2f} MB")
226 | if vram_mb is not None:
227 | print_color(GREEN, f" VRAM: {vram_mb:.2f} MB")
228 | if tokens_per_sec is not None:
229 | print_color(GREEN, f" Speed: {tokens_per_sec:.2f} tokens/sec")
230 | if perplexity is not None:
231 | print_color(GREEN, f" Perplexity: {perplexity:.4f}")
232 | if perplexity_change is not None:
233 | change_color = GREEN if perplexity_change < 1.0 else (YELLOW if perplexity_change < 5.0 else RED)
234 | print_color(change_color, f" Quality impact: {perplexity_change:+.2f}% vs FP16")
235 |
236 | print()
237 | time.sleep(1) # Brief pause between tests
238 |
239 | except Exception as e:
240 | print_color(RED, f"Error running {config_name} test: {e}")
241 |
242 | # Clean up the temporary file
243 | try:
244 | os.unlink(prompt_file)
245 | except:
246 | pass
247 |
248 | # Calculate savings percentages
249 | if len(results) > 0 and "FP16" in [r["Configuration"] for r in results]:
250 | fp16_result = next(r for r in results if r["Configuration"] == "FP16")
251 | fp16_kv = fp16_result.get("KV_Cache_MB")
252 | fp16_vram = fp16_result.get("VRAM_MB")
253 |
254 | if fp16_kv is not None:
255 | for result in results:
256 | kv = result.get("KV_Cache_MB")
257 | if kv is not None and result["Configuration"] != "FP16":
258 | result["KV_Savings_Pct"] = (1 - kv / fp16_kv) * 100
259 |
260 | if fp16_vram is not None:
261 | for result in results:
262 | vram = result.get("VRAM_MB")
263 | if vram is not None and result["Configuration"] != "FP16":
264 | result["VRAM_Savings_Pct"] = (1 - vram / fp16_vram) * 100
265 |
266 | # Display results as a table
267 | if len(results) > 0:
268 | print_color(GREEN, "📊 KVSplit Comparison Results:")
269 | print()
270 |
271 | # Header
272 | print(f"{'Configuration':<12} {'KV Cache':<15} {'VRAM':<15} {'Speed':<15} {'Quality':<15} {'Description':<30}")
273 | print("-" * 100)
274 |
275 | # Rows
276 | for result in results:
277 | config = result.get("Configuration", "")
278 |
279 | # KV cache column
280 | kv_cache = f"{result.get('KV_Cache_MB', 'N/A'):.2f} MB" if result.get('KV_Cache_MB') else "N/A"
281 | if "KV_Savings_Pct" in result:
282 | kv_cache += f" (-{result['KV_Savings_Pct']:.1f}%)"
283 |
284 | # VRAM column
285 | vram = f"{result.get('VRAM_MB', 'N/A'):.2f} MB" if result.get('VRAM_MB') else "N/A"
286 | if "VRAM_Savings_Pct" in result:
287 | vram += f" (-{result['VRAM_Savings_Pct']:.1f}%)"
288 |
289 | # Speed column
290 | speed = f"{result.get('Tokens_per_sec', 'N/A'):.1f} t/s" if result.get('Tokens_per_sec') else "N/A"
291 |
292 | # Quality column
293 | quality = ""
294 | if result.get('Perplexity') is not None:
295 | quality = f"{result.get('Perplexity'):.4f}"
296 | if result.get('Perplexity_Change_Pct') is not None and config != "FP16":
297 | quality += f" ({result['Perplexity_Change_Pct']:+.2f}%)"
298 | else:
299 | quality = "N/A"
300 |
301 | print(f"{config:<12} {kv_cache:<15} {vram:<15} {speed:<15} {quality:<15} {result.get('Description', ''):<30}")
302 |
303 | print()
304 | print_color(GREEN, "Interpretation:")
305 | print_color(BLUE, "- Lower KV Cache and VRAM values are better (more memory efficient)")
306 | print_color(BLUE, "- Higher Speed values are better (faster inference)")
307 | print_color(BLUE, "- Lower Perplexity values and smaller % changes are better (higher quality)")
308 | print()
309 | print_color(GREEN, "Recommendation:")
310 | print_color(YELLOW, "K8V4 (8-bit keys, 4-bit values) typically offers the best balance of memory savings and quality.")
311 | print()
312 |
313 | # Save results as JSON
314 | try:
315 | base_dir = Path(__file__).parent.parent.absolute()
316 | results_dir = base_dir / "results"
317 | os.makedirs(results_dir, exist_ok=True)
318 | timestamp = time.strftime("%Y%m%d_%H%M%S")
319 | json_path = results_dir / f"quick_compare_{timestamp}.json"
320 |
321 | with open(json_path, 'w') as f:
322 | json.dump(results, f, indent=2)
323 |
324 | print_color(GREEN, f"Results saved to {json_path}")
325 | except Exception as e:
326 | print_color(RED, f"Error saving results: {e}")
327 |
328 | def main():
329 | """Main function"""
330 | parser = argparse.ArgumentParser(description="Compare KV quantization settings")
331 | parser.add_argument("--model", required=True, help="Path to the model file")
332 | parser.add_argument("--prompt", default="The theory of quantum mechanics explains how particles behave at the atomic and subatomic levels. This counterintuitive framework has revolutionized our understanding of physics.",
333 | help="Test prompt")
334 | parser.add_argument("--seq-len", type=int, default=2048, help="Sequence length to test")
335 | parser.add_argument("--threads", type=int, default=8, help="Number of threads to use")
336 | args = parser.parse_args()
337 |
338 | run_comparison(args.model, args.prompt, args.seq_len, args.threads)
339 |
340 | if __name__ == "__main__":
341 | main()
342 |
--------------------------------------------------------------------------------
/scripts/run_benchmark.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -euo pipefail
3 |
4 | # Get the directory of this script
5 | SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
6 | LLAMA_CPP_DIR="${SCRIPT_DIR}/llama.cpp"
7 | MODEL_PATH="${SCRIPT_DIR}/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
8 | RESULTS_DIR="${SCRIPT_DIR}/results"
9 |
10 | # Create results directory if it doesn't exist
11 | mkdir -p "${RESULTS_DIR}"
12 |
13 | # Check if model exists
14 | if [ ! -f "${MODEL_PATH}" ]; then
15 | echo "Error: Model not found at ${MODEL_PATH}"
16 | exit 1
17 | fi
18 |
19 | # Change to llama.cpp directory
20 | cd "${LLAMA_CPP_DIR}" || exit 1
21 |
22 | # Test configurations
23 | CONFIGS=(
24 | "FP16"
25 | "K8V8"
26 | "K8V4"
27 | "K4V8"
28 | "K4V4"
29 | )
30 |
31 | # Sequence lengths to test
32 | SEQUENCE_LENGTHS=(128 2048 4096 8192)
33 |
34 | # Run benchmarks
35 | for seq_len in "${SEQUENCE_LENGTHS[@]}"; do
36 | echo -e "\n=== Benchmarking sequence length: ${seq_len} ==="
37 |
38 | for config in "${CONFIGS[@]}"; do
39 | echo -e "\n--- Testing ${config} ---"
40 |
41 | # Set KV cache parameters based on config
42 | case $config in
43 | "FP16")
44 | KV_ARGS=""
45 | ;;
46 | "K8V8")
47 | KV_ARGS="--kvq-key 8 --kvq-val 8"
48 | ;;
49 | "K8V4")
50 | KV_ARGS="--kvq-key 8 --kvq-val 4"
51 | ;;
52 | "K4V8")
53 | KV_ARGS="--kvq-key 4 --kvq-val 8"
54 | ;;
55 | "K4V4")
56 | KV_ARGS="--kvq-key 4 --kvq-val 4"
57 | ;;
58 | *)
59 | echo "Unknown config: ${config}"
60 | continue
61 | ;;
62 | esac
63 |
64 | # Run the benchmark
65 | OUTPUT_FILE="${RESULTS_DIR}/benchmark_${config}_seq${seq_len}.txt"
66 | echo "Running: ./main -m ${MODEL_PATH} -p \"Benchmarking KV cache performance\" -n ${seq_len} -t 8 -fa 0 ${KV_ARGS}"
67 |
68 | # Run with timeout to prevent hanging
69 | timeout 5m ./main -m "${MODEL_PATH}" -p "Benchmarking KV cache performance" \
70 | -n ${seq_len} -t 8 -fa 0 ${KV_ARGS} 2>&1 | tee "${OUTPUT_FILE}" || \
71 | echo "Warning: Benchmark for ${config} with seq_len=${seq_len} timed out or failed"
72 | done
73 | done
74 |
75 | echo -e "\n=== Benchmarking complete! Results saved to ${RESULTS_DIR} ==="
76 |
--------------------------------------------------------------------------------
/scripts/test_kvsplit.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -euo pipefail
3 |
4 | # Get the directory of this script
5 | SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
6 | LLAMA_CPP_DIR="${SCRIPT_DIR}/llama.cpp"
7 | MODEL_PATH="${SCRIPT_DIR}/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
8 |
9 | # Check if model exists
10 | if [ ! -f "${MODEL_PATH}" ]; then
11 | echo "Error: Model not found at ${MODEL_PATH}"
12 | exit 1
13 | fi
14 |
15 | # Run a simple test
16 | cd "${LLAMA_CPP_DIR}" || exit 1
17 |
18 | # Test with default settings
19 | echo -e "\n=== Testing with default settings (FP16) ==="
20 | ./main -m "${MODEL_PATH}" -p "Hello, world" -n 16 -t 4 -fa 0 -t 8
21 |
22 | # Test with K8V4 (8-bit keys, 4-bit values)
23 | echo -e "\n\n=== Testing with K8V4 ==="
24 | ./main -m "${MODEL_PATH}" -p "Hello, world" -n 16 -t 4 -fa 0 -t 8 --kvq-key 8 --kvq-val 4
25 |
--------------------------------------------------------------------------------
/scripts/validate_kvsplit.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Exit on error, undefined variables, and pipeline failures
4 | set -euo pipefail
5 |
6 | # Colors for better output
7 | RED='\033[0;31m'
8 | GREEN='\033[0;32m'
9 | YELLOW='\033[1;33m'
10 | BLUE='\033[0;34m'
11 | NC='\033[0m' # No Color
12 |
13 | # Function to display error messages
14 | error_exit() {
15 | local line_number=$1
16 | local message=$2
17 | echo -e "${RED}Error on line ${line_number}: ${message}${NC}" >&2
18 | exit 1
19 | }
20 |
21 | # Trap errors
22 | trap 'error_exit $LINENO "Command failed with status $?"' ERR
23 |
24 | # Function to print section headers
25 | print_section() {
26 | echo -e "\n${YELLOW}==> $1${NC}"
27 | }
28 |
29 | # Function to extract Metal allocation information
30 | extract_metal_stats() {
31 | local logfile=$1
32 | local type=$2
33 | echo -e "${BLUE}Extracting Metal stats for $type:${NC}"
34 |
35 | # Check if log has Metal allocation information
36 | if grep -q "METAL ALLOC" "$logfile"; then
37 | echo -e "${GREEN}✓ Metal memory logging enabled${NC}"
38 |
39 | # Extract total allocated memory
40 | total_allocated=$(grep "METAL ALLOC" "$logfile" | grep -v "freed" | awk '{sum+=$3} END {print sum/1024/1024 " MB"}')
41 | echo "Total Metal memory allocated: $total_allocated"
42 |
43 | # Extract KV cache allocation if available
44 | kv_cache_alloc=$(grep -i "kv " "$logfile" | grep -i "cache" | grep "METAL ALLOC" | awk '{sum+=$3} END {if(sum>0) print sum/1024/1024 " MB"; else print "Not found"}')
45 | echo "KV cache memory: $kv_cache_alloc"
46 |
47 | # Extract K and V allocations separately if possible
48 | k_alloc=$(grep -i " k " "$logfile" | grep "METAL ALLOC" | awk '{sum+=$3} END {if(sum>0) print sum/1024/1024 " MB"; else print "Not found"}')
49 | v_alloc=$(grep -i " v " "$logfile" | grep "METAL ALLOC" | awk '{sum+=$3} END {if(sum>0) print sum/1024/1024 " MB"; else print "Not found"}')
50 |
51 | echo "K cache memory: $k_alloc"
52 | echo "V cache memory: $v_alloc"
53 | else
54 | echo -e "${RED}No Metal allocation information found. Try using -t 8 flag.${NC}"
55 | fi
56 |
57 | echo "----------------------------------------------"
58 | }
59 |
60 | # Function to verify KV cache type in logs
61 | verify_kv_types() {
62 | local logfile=$1
63 | local expected_k_type=$2
64 | local expected_v_type=$3
65 |
66 | echo -e "${BLUE}Verifying KV cache types:${NC}"
67 |
68 | # Extract type info from logs
69 | if grep -q "type_k" "$logfile" && grep -q "type_v" "$logfile"; then
70 | k_type=$(grep "type_k" "$logfile" | head -1 | sed 's/.*type_k[^a-z0-9]*\([a-z0-9_]*\).*/\1/')
71 | v_type=$(grep "type_v" "$logfile" | head -1 | sed 's/.*type_v[^a-z0-9]*\([a-z0-9_]*\).*/\1/')
72 |
73 | echo "Found key type: $k_type (Expected: $expected_k_type)"
74 | echo "Found value type: $v_type (Expected: $expected_v_type)"
75 |
76 | # Basic verification
77 | if [[ "$k_type" == *"$expected_k_type"* ]]; then
78 | echo -e "${GREEN}✓ Key type matches expected type${NC}"
79 | else
80 | echo -e "${RED}× Key type does not match expected type${NC}"
81 | fi
82 |
83 | if [[ "$v_type" == *"$expected_v_type"* ]]; then
84 | echo -e "${GREEN}✓ Value type matches expected type${NC}"
85 | else
86 | echo -e "${RED}× Value type does not match expected type${NC}"
87 | fi
88 | else
89 | echo -e "${YELLOW}KV cache type information not found in logs${NC}"
90 | fi
91 |
92 | echo "----------------------------------------------"
93 | }
94 |
95 | # Main execution
96 | print_section "KVSplit Validation Script"
97 |
98 | # Get base directory
99 | BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
100 | LLAMA_CPP_DIR="${BASE_DIR}/llama.cpp"
101 | MODEL_PATH="${BASE_DIR}/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
102 | PATCH_PATH="${BASE_DIR}/patch/kvsplit_fixed.patch"
103 | RESULTS_DIR="${BASE_DIR}/results"
104 |
105 | # Ensure results directory exists
106 | mkdir -p "${RESULTS_DIR}"
107 |
108 | # Verify model exists
109 | if [ ! -f "${MODEL_PATH}" ]; then
110 | error_exit "$LINENO" "Model not found at ${MODEL_PATH}. Run setup.sh first."
111 | fi
112 |
113 | # Check if patch exists
114 | if [ ! -f "${PATCH_PATH}" ]; then
115 | error_exit "$LINENO" "Patch file not found at ${PATCH_PATH}"
116 | fi
117 |
118 | # Apply the patch
119 | print_section "Applying KVSplit patch"
120 | cd "${LLAMA_CPP_DIR}"
121 | # Check if patch has already been applied
122 | if grep -q "kvq-key" "${LLAMA_CPP_DIR}/common/arg.cpp"; then
123 | echo -e "${YELLOW}Patch appears to be already applied${NC}"
124 | else
125 | # Apply the patch
126 | patch -p1 < "${PATCH_PATH}" || error_exit "$LINENO" "Failed to apply patch"
127 | echo -e "${GREEN}✓ Patch applied successfully${NC}"
128 | fi
129 |
130 | # Build llama.cpp with Metal support
131 | print_section "Building llama.cpp with Metal support"
132 | cd "${LLAMA_CPP_DIR}"
133 |
134 | # Build with CMake instead of make, since that's how we set up in Step 1
135 | echo "Building with CMake..."
136 | rm -rf build 2>/dev/null || true
137 | mkdir -p build
138 | cd build
139 |
140 | # Configure with CMake
141 | cmake .. \
142 | -DCMAKE_BUILD_TYPE=Release \
143 | -DLLAMA_METAL=on \
144 | -DCMAKE_OSX_ARCHITECTURES="arm64" \
145 | -DCMAKE_OSX_DEPLOYMENT_TARGET="11.0" \
146 | -DLLAMA_METAL_EMBED_LIBRARY=ON \
147 | -DLLAMA_METAL_SHADER_DEBUG=OFF \
148 | -DLLAMA_BUILD_SERVER=OFF \
149 | -DBUILD_SHARED_LIBS=OFF || error_exit "$LINENO" "CMake configuration failed"
150 |
151 | # Build with multiple jobs
152 | cmake --build . --config Release -j$(sysctl -n hw.logicalcpu) || \
153 | error_exit "$LINENO" "Failed to build llama.cpp"
154 |
155 | # Find the main executable
156 | MAIN_EXEC=""
157 | if [ -f "bin/main" ]; then
158 | MAIN_EXEC="bin/main"
159 | elif [ -f "bin/llama-cli" ]; then
160 | MAIN_EXEC="bin/llama-cli"
161 | else
162 | # Try to find any llama executable
163 | MAIN_EXEC=$(find . -name "llama-*" -type f -executable | grep -v '\.o$' | head -1)
164 | if [ -z "$MAIN_EXEC" ]; then
165 | error_exit "$LINENO" "Could not find main llama executable"
166 | fi
167 | fi
168 |
169 | echo -e "${GREEN}✓ Successfully built llama.cpp with Metal support${NC}"
170 | echo "Main executable: ${MAIN_EXEC}"
171 |
172 | cd "${LLAMA_CPP_DIR}"
173 |
174 | # Run tests
175 | print_section "Running validation tests"
176 |
177 | # Test case 1: Baseline (FP16 - same precision for both)
178 | echo -e "${BLUE}Test 1: Baseline (FP16)${NC}"
179 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_fp16.log"
180 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} -t 8 -n 32 -p \"Hello world\""
181 | ./${MAIN_EXEC} -m "${MODEL_PATH}" -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || {
182 | echo -e "${RED}Test 1 failed with exit code $?${NC}"
183 | error_exit "$LINENO" "Baseline test failed"
184 | }
185 | echo -e "${GREEN}✓ Test 1 completed successfully${NC}"
186 | extract_metal_stats "${LOG_FILE}" "FP16"
187 | verify_kv_types "${LOG_FILE}" "f16" "f16"
188 |
189 | # Test case 2: Using existing --kvq parameter (Q8_0 - same precision for both)
190 | echo -e "${BLUE}Test 2: Using existing --kvq parameter (Q8_0)${NC}"
191 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_kvq_q8.log"
192 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq 8 -t 8 -n 32 -p \"Hello world\""
193 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq 8 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || {
194 | echo -e "${RED}Test 2 failed with exit code $?${NC}"
195 | error_exit "$LINENO" "Test using --kvq parameter failed"
196 | }
197 | echo -e "${GREEN}✓ Test 2 completed successfully${NC}"
198 | extract_metal_stats "${LOG_FILE}" "Q8_0"
199 | verify_kv_types "${LOG_FILE}" "q8_0" "q8_0"
200 |
201 | # Test case 3: Split precision (K8V4 - 8-bit keys, 4-bit values)
202 | echo -e "${BLUE}Test 3: Split precision (K8V4 - 8-bit keys, 4-bit values)${NC}"
203 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_k8v4.log"
204 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq-key 8 --kvq-val 4 -t 8 -n 32 -p \"Hello world\""
205 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq-key 8 --kvq-val 4 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || {
206 | echo -e "${RED}Test 3 failed with exit code $?${NC}"
207 | error_exit "$LINENO" "Split precision K8V4 test failed"
208 | }
209 | echo -e "${GREEN}✓ Test 3 completed successfully${NC}"
210 | extract_metal_stats "${LOG_FILE}" "K8V4"
211 | verify_kv_types "${LOG_FILE}" "q8_0" "q4_0"
212 |
213 | # Test case 4: Reverse configuration (K4V8 - 4-bit keys, 8-bit values)
214 | echo -e "${BLUE}Test 4: Reverse configuration (K4V8 - 4-bit keys, 8-bit values)${NC}"
215 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_k4v8.log"
216 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq-key 4 --kvq-val 8 -t 8 -n 32 -p \"Hello world\""
217 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq-key 4 --kvq-val 8 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || {
218 | echo -e "${RED}Test 4 failed with exit code $?${NC}"
219 | error_exit "$LINENO" "Reverse configuration K4V8 test failed"
220 | }
221 | echo -e "${GREEN}✓ Test 4 completed successfully${NC}"
222 | extract_metal_stats "${LOG_FILE}" "K4V8"
223 | verify_kv_types "${LOG_FILE}" "q4_0" "q8_0"
224 |
225 | # Test case 5: Both 4-bit (K4V4 - 4-bit keys, 4-bit values)
226 | echo -e "${BLUE}Test 5: Both 4-bit (K4V4 - 4-bit keys, 4-bit values)${NC}"
227 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_k4v4.log"
228 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq 4 -t 8 -n 32 -p \"Hello world\""
229 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq 4 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || {
230 | echo -e "${RED}Test 5 failed with exit code $?${NC}"
231 | error_exit "$LINENO" "K4V4 test failed"
232 | }
233 | echo -e "${GREEN}✓ Test 5 completed successfully${NC}"
234 | extract_metal_stats "${LOG_FILE}" "K4V4"
235 | verify_kv_types "${LOG_FILE}" "q4_0" "q4_0"
236 |
237 | # Print summary
238 | print_section "KVSplit Validation Summary"
239 | echo -e "${GREEN}✓ All tests completed successfully!${NC}"
240 | echo "Results saved in: ${RESULTS_DIR}"
241 | echo ""
242 | echo -e "${YELLOW}Memory usage summary:${NC}"
243 | echo "Baseline (FP16): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_fp16.log" | awk '{print $4, $5}')"
244 | echo "Q8_0 (--kvq 8): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_kvq_q8.log" | awk '{print $4, $5}')"
245 | echo "K8V4 (--kvq-key 8 --kvq-val 4): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_k8v4.log" | awk '{print $4, $5}')"
246 | echo "K4V8 (--kvq-key 4 --kvq-val 8): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_k4v8.log" | awk '{print $4, $5}')"
247 | echo "K4V4 (--kvq 4): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_k4v4.log" | awk '{print $4, $5}')"
248 | echo ""
249 | echo -e "${BLUE}Notes:${NC}"
250 | echo "1. K8V4 (8-bit keys, 4-bit values) typically provides a good balance of memory savings with lower quality loss"
251 | echo "2. K4V8 (4-bit keys, 8-bit values) typically shows more quality degradation as keys are more sensitive to quantization"
252 | echo "3. Results may vary by model size and context length"
253 | echo "4. Memory measurements may show slight differences from theoretical calculations due to 256B page alignment"
254 | echo ""
255 | echo "Run these with longer contexts and different prompts to better evaluate impact on quality"
256 |
--------------------------------------------------------------------------------
/scripts/visualize_results.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | Visualization Script for KVSplit Benchmarks
5 |
6 | This script generates publication-quality plots from the benchmark data,
7 | showing memory usage, performance impact, quality impact, and key-value sensitivity.
8 | """
9 |
10 | import os
11 | import glob
12 | import pandas as pd
13 | import numpy as np
14 | import matplotlib.pyplot as plt
15 | import seaborn as sns
16 | from pathlib import Path
17 |
18 | # Use colorblind-friendly style
19 | plt.style.use("tableau-colorblind10")
20 |
21 | # Constants
22 | OUTPUT_DIR = Path("../plots")
23 | RESULTS_DIR = Path("../results")
24 |
25 | # Create output directory if it doesn't exist
26 | os.makedirs(OUTPUT_DIR, exist_ok=True)
27 |
28 | def load_latest_results():
29 | """Load the most recent benchmark results CSV file"""
30 | # Find the most recent benchmark results file
31 | result_files = glob.glob(str(RESULTS_DIR / "benchmark_results_*.csv"))
32 | if not result_files:
33 | raise FileNotFoundError("No benchmark result files found")
34 |
35 | # Sort by modification time, newest first
36 | latest_file = max(result_files, key=os.path.getmtime)
37 | print(f"Using benchmark data from: {latest_file}")
38 |
39 | # Load data
40 | df = pd.read_csv(latest_file)
41 | return df
42 |
43 | def prepare_data(df):
44 | """Process the benchmark data for visualization"""
45 | # Group by configuration and sequence length, averaging across runs
46 | grouped = df.groupby(['Configuration', 'Sequence_Length']).agg({
47 | 'VRAM_Usage_MB': 'mean',
48 | 'KV_Cache_MB': 'mean',
49 | 'Throughput_Tokens_Per_Sec': 'mean',
50 | 'Perplexity': 'mean',
51 | 'Success': 'mean' # How many runs succeeded
52 | }).reset_index()
53 |
54 | # Calculate the baseline FP16 values for each sequence length
55 | fp16_baseline = grouped[grouped['Configuration'] == 'FP16'].copy()
56 |
57 | # Create lookup dictionaries for baseline values
58 | vram_baseline = dict(zip(fp16_baseline['Sequence_Length'], fp16_baseline['VRAM_Usage_MB']))
59 | kv_baseline = dict(zip(fp16_baseline['Sequence_Length'], fp16_baseline['KV_Cache_MB']))
60 | perplexity_baseline = fp16_baseline['Perplexity'].mean()
61 |
62 | # Add percent savings and perplexity change columns
63 | grouped['VRAM_Savings_Pct'] = grouped.apply(
64 | lambda row: 0 if row['Configuration'] == 'FP16' else
65 | (1 - row['VRAM_Usage_MB'] / vram_baseline[row['Sequence_Length']]) * 100,
66 | axis=1
67 | )
68 |
69 | grouped['KV_Savings_Pct'] = grouped.apply(
70 | lambda row: 0 if row['Configuration'] == 'FP16' else
71 | (1 - row['KV_Cache_MB'] / kv_baseline[row['Sequence_Length']]) * 100,
72 | axis=1
73 | )
74 |
75 | grouped['Perplexity_Change_Pct'] = grouped.apply(
76 | lambda row: ((row['Perplexity'] - perplexity_baseline) / perplexity_baseline) * 100,
77 | axis=1
78 | )
79 |
80 | return grouped
81 |
82 | def plot_memory_usage(df):
83 | """Create a bar chart showing memory usage by configuration and sequence length"""
84 | plt.figure(figsize=(12, 8))
85 |
86 | # Create a grouped bar chart for KV cache size
87 | ax = sns.barplot(
88 | data=df,
89 | x='Configuration',
90 | y='KV_Cache_MB',
91 | hue='Sequence_Length',
92 | palette='viridis'
93 | )
94 |
95 | # Add title and labels
96 | plt.title('KV Cache Memory Usage by Configuration', fontsize=16)
97 | plt.ylabel('KV Cache Size (MB)', fontsize=14)
98 | plt.xlabel('Configuration', fontsize=14)
99 | plt.xticks(rotation=0, fontsize=12)
100 | plt.yticks(fontsize=12)
101 |
102 | # Adjust legend
103 | plt.legend(title='Sequence Length', fontsize=12, title_fontsize=13)
104 |
105 | # Add annotations showing % savings vs FP16 for a specific sequence length
106 | # Choose the longest sequence length for annotations
107 | longest_seq = df['Sequence_Length'].max()
108 | long_seq_data = df[df['Sequence_Length'] == longest_seq]
109 |
110 | # Get baseline VRAM for FP16
111 | fp16_kv = long_seq_data[long_seq_data['Configuration'] == 'FP16']['KV_Cache_MB'].values[0]
112 |
113 | # Annotate each non-FP16 bar with savings percentage
114 | bars = [patch for i, patch in enumerate(ax.patches)
115 | if i % len(df['Sequence_Length'].unique()) == len(df['Sequence_Length'].unique()) - 1]
116 |
117 | configs = long_seq_data['Configuration'].unique()
118 | for i, bar in enumerate(bars):
119 | if i < len(configs) and configs[i] != 'FP16':
120 | config_kv = long_seq_data[long_seq_data['Configuration'] == configs[i]]['KV_Cache_MB'].values[0]
121 | savings = (1 - config_kv / fp16_kv) * 100
122 | ax.text(
123 | bar.get_x() + bar.get_width()/2,
124 | bar.get_height() + 5,
125 | f'{savings:.1f}%',
126 | ha='center',
127 | fontsize=11,
128 | fontweight='bold',
129 | color='green'
130 | )
131 |
132 | # Add a note about the annotations
133 | plt.annotate(
134 | f'Percentages show memory savings\ncompared to FP16 for {longest_seq} tokens',
135 | xy=(0.5, 0.97),
136 | xycoords='figure fraction',
137 | ha='center',
138 | fontsize=12,
139 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8)
140 | )
141 |
142 | # Save the figure
143 | plt.tight_layout()
144 | plt.savefig(OUTPUT_DIR / 'kv_cache_memory_usage.png', dpi=200, bbox_inches="tight")
145 | plt.close()
146 |
147 | def plot_performance(df):
148 | """Create a bar chart showing throughput by configuration and sequence length"""
149 | plt.figure(figsize=(12, 8))
150 |
151 | # Create a grouped bar chart for throughput
152 | ax = sns.barplot(
153 | data=df,
154 | x='Configuration',
155 | y='Throughput_Tokens_Per_Sec',
156 | hue='Sequence_Length',
157 | palette='viridis'
158 | )
159 |
160 | # Add title and labels
161 | plt.title('Inference Speed by Configuration', fontsize=16)
162 | plt.ylabel('Throughput (Tokens per second)', fontsize=14)
163 | plt.xlabel('Configuration', fontsize=14)
164 | plt.xticks(rotation=0, fontsize=12)
165 | plt.yticks(fontsize=12)
166 |
167 | # Adjust legend
168 | plt.legend(title='Sequence Length', fontsize=12, title_fontsize=13)
169 |
170 | # Calculate average change vs FP16 for configurations
171 | configs = df['Configuration'].unique()
172 | seq_lengths = df['Sequence_Length'].unique()
173 |
174 | # Average improvement in throughput compared to FP16
175 | improvements = {}
176 | for config in configs:
177 | if config == 'FP16':
178 | continue
179 | improvement_sum = 0
180 | for seq_len in seq_lengths:
181 | fp16_throughput = df[(df['Configuration'] == 'FP16') &
182 | (df['Sequence_Length'] == seq_len)]['Throughput_Tokens_Per_Sec'].values[0]
183 | config_throughput = df[(df['Configuration'] == config) &
184 | (df['Sequence_Length'] == seq_len)]['Throughput_Tokens_Per_Sec'].values[0]
185 | improvement_pct = ((config_throughput / fp16_throughput) - 1) * 100
186 | improvement_sum += improvement_pct
187 | improvements[config] = improvement_sum / len(seq_lengths)
188 |
189 | # Annotate with the average improvement
190 | y_max = df['Throughput_Tokens_Per_Sec'].max() * 1.05
191 | for i, config in enumerate(configs[1:], 1): # Skip FP16
192 | plt.annotate(
193 | f"{improvements[config]:.1f}% vs FP16",
194 | xy=(i, y_max * 0.95),
195 | ha='center',
196 | fontsize=11,
197 | fontweight='bold',
198 | color='green' if improvements[config] > 0 else 'red'
199 | )
200 |
201 | # Add a note about the annotations
202 | plt.annotate(
203 | 'Percentages show average throughput\nimprovement compared to FP16',
204 | xy=(0.5, 0.97),
205 | xycoords='figure fraction',
206 | ha='center',
207 | fontsize=12,
208 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8)
209 | )
210 |
211 | # Save the figure
212 | plt.tight_layout()
213 | plt.savefig(OUTPUT_DIR / 'inference_speed.png', dpi=200, bbox_inches="tight")
214 | plt.close()
215 |
216 | def plot_quality_impact(df):
217 | """Create a bar chart showing perplexity change vs FP16"""
218 | # Get average perplexity per configuration (averaging across sequence lengths)
219 | perplexity_by_config = df.groupby('Configuration')['Perplexity'].mean().reset_index()
220 | baseline = perplexity_by_config[perplexity_by_config['Configuration'] == 'FP16']['Perplexity'].values[0]
221 |
222 | # Calculate perplexity change vs baseline
223 | perplexity_by_config['Perplexity_Change'] = ((perplexity_by_config['Perplexity'] - baseline) / baseline) * 100
224 |
225 | plt.figure(figsize=(10, 6))
226 |
227 | # Plot the perplexity change (excluding FP16)
228 | non_fp16 = perplexity_by_config[perplexity_by_config['Configuration'] != 'FP16']
229 | bars = plt.bar(
230 | non_fp16['Configuration'],
231 | non_fp16['Perplexity_Change'],
232 | color=sns.color_palette("Reds_r", len(non_fp16))
233 | )
234 |
235 | # Add title and labels
236 | plt.title('Quality Impact: Perplexity Change vs FP16', fontsize=16)
237 | plt.ylabel('Perplexity Change (%)', fontsize=14)
238 | plt.xlabel('Configuration', fontsize=14)
239 | plt.axhline(y=0, color='blue', linestyle='-', alpha=0.3, label='FP16 Baseline')
240 |
241 | # Add a red line for 5% degradation threshold
242 | plt.axhline(y=5, color='red', linestyle='--', alpha=0.5, label='5% Degradation Threshold')
243 |
244 | # Add annotations
245 | for bar in bars:
246 | height = bar.get_height()
247 | plt.text(
248 | bar.get_x() + bar.get_width()/2,
249 | height + 0.1 if height > 0 else height - 0.3,
250 | f'{height:.2f}%',
251 | ha='center',
252 | fontsize=12,
253 | fontweight='bold',
254 | color='black'
255 | )
256 |
257 | # Add legend
258 | plt.legend(fontsize=12)
259 |
260 | # Add annotation explaining the implications
261 | plt.annotate(
262 | 'Lower is better. Values below 5% generally\nindicate minimal quality degradation.',
263 | xy=(0.5, 0.97),
264 | xycoords='figure fraction',
265 | ha='center',
266 | fontsize=12,
267 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8)
268 | )
269 |
270 | # Save the figure
271 | plt.tight_layout()
272 | plt.savefig(OUTPUT_DIR / 'perplexity_change.png', dpi=200, bbox_inches="tight")
273 | plt.close()
274 |
275 | def plot_key_vs_value_sensitivity(df):
276 | """Create a heatmap showing sensitivity to key vs value precision"""
277 | # Create a dataframe for the heatmap
278 | configs = ['FP16', 'K8V8', 'K8V4', 'K4V8', 'K4V4']
279 | k_bits = [16, 8, 8, 4, 4]
280 | v_bits = [16, 8, 4, 8, 4]
281 | perplexity_avg = df.groupby('Configuration')['Perplexity'].mean().reset_index()
282 |
283 | # Create a lookup for perplexity values
284 | perplexity_lookup = dict(zip(perplexity_avg['Configuration'], perplexity_avg['Perplexity']))
285 | perplexity_values = [perplexity_lookup.get(config, np.nan) for config in configs]
286 |
287 | # Calculate perplexity change
288 | fp16_perplexity = perplexity_lookup['FP16']
289 | perplexity_change = [((p - fp16_perplexity) / fp16_perplexity) * 100 for p in perplexity_values]
290 |
291 | # Create dataframe for heatmap
292 | sensitivity_data = pd.DataFrame({
293 | 'Config': configs,
294 | 'K_bits': k_bits,
295 | 'V_bits': v_bits,
296 | 'Perplexity': perplexity_values,
297 | 'Perplexity_Change_Pct': perplexity_change
298 | })
299 |
300 | # Create a figure with two subplots
301 | fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))
302 |
303 | # Plot 1: Heatmap of perplexity values by K/V bit precision
304 | pivot1 = sensitivity_data.pivot(index='K_bits', columns='V_bits', values='Perplexity')
305 | sns.heatmap(
306 | pivot1,
307 | annot=True,
308 | fmt='.2f',
309 | cmap='viridis_r', # Lower is better for perplexity
310 | ax=ax1,
311 | cbar_kws={'label': 'Perplexity'}
312 | )
313 | ax1.set_title('Perplexity by Key/Value Precision', fontsize=14)
314 | ax1.set_xlabel('Value Bits', fontsize=12)
315 | ax1.set_ylabel('Key Bits', fontsize=12)
316 |
317 | # Plot 2: Heatmap of perplexity change percentage
318 | pivot2 = sensitivity_data.pivot(index='K_bits', columns='V_bits', values='Perplexity_Change_Pct')
319 | sns.heatmap(
320 | pivot2,
321 | annot=True,
322 | fmt='.2f',
323 | cmap='RdYlGn_r', # Red for worse, green for better
324 | ax=ax2,
325 | cbar_kws={'label': 'Perplexity Change (%)'}
326 | )
327 | ax2.set_title('Perplexity Change vs FP16 (%)', fontsize=14)
328 | ax2.set_xlabel('Value Bits', fontsize=12)
329 | ax2.set_ylabel('Key Bits', fontsize=12)
330 |
331 | # Add overall title
332 | plt.suptitle('Key vs Value Precision Sensitivity', fontsize=16)
333 |
334 | # Add annotation explaining the key findings
335 | fig.text(
336 | 0.5, 0.02,
337 | 'Key precision (rows) has a larger impact on quality than value precision (columns).\nK8V4 offers an excellent balance of quality and memory efficiency.',
338 | ha='center',
339 | fontsize=12,
340 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8)
341 | )
342 |
343 | # Save the figure
344 | plt.tight_layout(rect=[0, 0.05, 1, 0.95])
345 | plt.savefig(OUTPUT_DIR / 'key_value_sensitivity.png', dpi=200, bbox_inches="tight")
346 | plt.close()
347 |
348 | def plot_memory_vs_quality(df):
349 | """Create a scatter plot showing the tradeoff between memory usage and quality"""
350 | # Average across sequence lengths
351 | avg_by_config = df.groupby('Configuration').agg({
352 | 'KV_Cache_MB': 'mean',
353 | 'Perplexity': 'mean',
354 | 'KV_Savings_Pct': 'mean',
355 | 'Perplexity_Change_Pct': 'mean'
356 | }).reset_index()
357 |
358 | # Create custom color map for the configurations
359 | colors = {
360 | 'FP16': 'blue',
361 | 'K8V8': 'green',
362 | 'K8V4': 'purple',
363 | 'K4V8': 'orange',
364 | 'K4V4': 'red'
365 | }
366 |
367 | plt.figure(figsize=(10, 8))
368 |
369 | # Create scatter plot
370 | for config in avg_by_config['Configuration']:
371 | data = avg_by_config[avg_by_config['Configuration'] == config]
372 | plt.scatter(
373 | data['KV_Savings_Pct'],
374 | data['Perplexity_Change_Pct'],
375 | s=200, # Size
376 | c=colors[config], # Color
377 | label=config,
378 | alpha=0.7,
379 | edgecolors='black'
380 | )
381 |
382 | # Add config label to each point
383 | plt.annotate(
384 | config,
385 | xy=(data['KV_Savings_Pct'].values[0], data['Perplexity_Change_Pct'].values[0]),
386 | xytext=(5, 5),
387 | textcoords='offset points',
388 | fontsize=12,
389 | fontweight='bold'
390 | )
391 |
392 | # Add quadrant labels
393 | plt.axhline(y=5, color='red', linestyle='--', alpha=0.5, label='5% Quality Threshold')
394 | plt.axvline(x=50, color='green', linestyle='--', alpha=0.5, label='50% Memory Savings Threshold')
395 |
396 | # Add shaded quadrants with labels
397 | plt.fill_between(
398 | [-10, 50], 5, 15, color='red', alpha=0.1
399 | )
400 | plt.fill_between(
401 | [50, 100], 5, 15, color='orange', alpha=0.1
402 | )
403 | plt.fill_between(
404 | [-10, 50], -10, 5, color='yellow', alpha=0.1
405 | )
406 | plt.fill_between(
407 | [50, 100], -10, 5, color='green', alpha=0.1
408 | )
409 |
410 | # Add quadrant labels
411 | plt.text(25, 10, "Low Savings, Worse Quality", ha='center', fontsize=10)
412 | plt.text(75, 10, "High Savings, Worse Quality", ha='center', fontsize=10)
413 | plt.text(25, 2, "Low Savings, Better Quality", ha='center', fontsize=10)
414 | plt.text(75, 2, "High Savings, Better Quality", ha='center', fontsize=10)
415 |
416 | # Add title and labels
417 | plt.title('Memory Savings vs Quality Tradeoff', fontsize=16)
418 | plt.xlabel('KV Cache Memory Savings (%)', fontsize=14)
419 | plt.ylabel('Perplexity Change vs FP16 (%)', fontsize=14)
420 | plt.xlim(-10, 100)
421 | plt.ylim(-2, 15)
422 |
423 | # Add legend
424 | plt.legend(fontsize=12)
425 |
426 | # Add annotation explaining the optimal region
427 | plt.annotate(
428 | 'Optimal configurations maximize memory savings\nwhile minimizing perplexity increase.',
429 | xy=(0.5, 0.97),
430 | xycoords='figure fraction',
431 | ha='center',
432 | fontsize=12,
433 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8)
434 | )
435 |
436 | # Save the figure
437 | plt.tight_layout()
438 | plt.savefig(OUTPUT_DIR / 'memory_vs_quality.png', dpi=200, bbox_inches="tight")
439 | plt.close()
440 |
441 | def create_summary_table(df):
442 | """Create a summary table visual showing the key metrics for each configuration"""
443 | # Average across sequence lengths
444 | avg_by_config = df.groupby('Configuration').agg({
445 | 'KV_Cache_MB': 'mean',
446 | 'Throughput_Tokens_Per_Sec': 'mean',
447 | 'Perplexity': 'mean',
448 | 'KV_Savings_Pct': 'mean',
449 | 'Perplexity_Change_Pct': 'mean'
450 | }).reset_index()
451 |
452 | # Add a column for throughput change
453 | fp16_throughput = avg_by_config[avg_by_config['Configuration'] == 'FP16']['Throughput_Tokens_Per_Sec'].values[0]
454 | avg_by_config['Throughput_Change_Pct'] = ((avg_by_config['Throughput_Tokens_Per_Sec'] / fp16_throughput) - 1) * 100
455 |
456 | # Sort configurations in a specific order
457 | order = ['FP16', 'K8V8', 'K8V4', 'K4V8', 'K4V4']
458 | avg_by_config['Order'] = avg_by_config['Configuration'].map({k: i for i, k in enumerate(order)})
459 | avg_by_config = avg_by_config.sort_values('Order').drop('Order', axis=1)
460 |
461 | # Create a figure for the table
462 | fig, ax = plt.subplots(figsize=(12, 6))
463 | ax.axis('tight')
464 | ax.axis('off')
465 |
466 | # Define table data
467 | table_data = [
468 | avg_by_config['Configuration'].tolist(),
469 | [f"{x:.2f} MB" for x in avg_by_config['KV_Cache_MB']],
470 | [f"{x:.1f}%" for x in avg_by_config['KV_Savings_Pct']],
471 | [f"{x:.0f}" for x in avg_by_config['Throughput_Tokens_Per_Sec']],
472 | [f"{x:+.1f}%" for x in avg_by_config['Throughput_Change_Pct']],
473 | [f"{x:.2f}" for x in avg_by_config['Perplexity']],
474 | [f"{x:+.2f}%" for x in avg_by_config['Perplexity_Change_Pct']]
475 | ]
476 |
477 | # Define row labels
478 | row_labels = [
479 | 'Configuration',
480 | 'Avg KV Cache (MB)',
481 | 'Memory Savings',
482 | 'Throughput (t/s)',
483 | 'Throughput vs FP16',
484 | 'Perplexity',
485 | 'Quality Impact'
486 | ]
487 |
488 | # Create table
489 | table = ax.table(
490 | cellText=table_data,
491 | rowLabels=row_labels,
492 | loc='center',
493 | cellLoc='center',
494 | colWidths=[0.15] * len(avg_by_config)
495 | )
496 |
497 | # Customize table appearance
498 | table.auto_set_font_size(False)
499 | table.set_fontsize(12)
500 | table.scale(1, 1.5)
501 |
502 | # Color cells based on values
503 | for i in range(len(row_labels)):
504 | for j in range(len(avg_by_config)):
505 | cell = table[(i, j)]
506 |
507 | # Memory savings - green for higher savings
508 | if i == 2 and j > 0: # KV_Savings_Pct row, excluding FP16
509 | savings = avg_by_config.iloc[j]['KV_Savings_Pct']
510 | intensity = min(savings / 80, 1) # Normalize to [0, 1]
511 | cell.set_facecolor((1 - intensity, 1, 1 - intensity))
512 |
513 | # Throughput - green for better, red for worse
514 | elif i == 4 and j > 0: # Throughput_Change_Pct row, excluding FP16
515 | change = avg_by_config.iloc[j]['Throughput_Change_Pct']
516 | if change > 0:
517 | intensity = min(change / 20, 1) # Normalize to [0, 1]
518 | cell.set_facecolor((1 - intensity, 1, 1 - intensity))
519 | else:
520 | intensity = min(abs(change) / 20, 1) # Normalize to [0, 1]
521 | cell.set_facecolor((1, 1 - intensity, 1 - intensity))
522 |
523 | # Perplexity - red for worse (higher values)
524 | elif i == 6 and j > 0: # Perplexity_Change_Pct row, excluding FP16
525 | change = avg_by_config.iloc[j]['Perplexity_Change_Pct']
526 | if change > 0:
527 | intensity = min(change / 10, 1) # Normalize to [0, 1]
528 | cell.set_facecolor((1, 1 - intensity, 1 - intensity))
529 | else:
530 | intensity = min(abs(change) / 10, 1) # Normalize to [0, 1]
531 | cell.set_facecolor((1 - intensity, 1, 1 - intensity))
532 |
533 | # Add title
534 | plt.suptitle('KVSplit Configuration Summary', fontsize=16)
535 |
536 | # Add annotation
537 | plt.figtext(
538 | 0.5, 0.01,
539 | 'Values are averaged across all sequence lengths. Green indicates better performance, red indicates worse.',
540 | ha='center',
541 | fontsize=12
542 | )
543 |
544 | # Save the figure
545 | plt.tight_layout(rect=[0, 0.05, 1, 0.95])
546 | plt.savefig(OUTPUT_DIR / 'configuration_summary.png', dpi=200, bbox_inches="tight")
547 | plt.close()
548 |
549 | def main():
550 | """Main function to generate all visualizations"""
551 | # Load data
552 | try:
553 | df = load_latest_results()
554 | df_processed = prepare_data(df)
555 |
556 | # Generate plots
557 | plot_memory_usage(df_processed)
558 | plot_performance(df_processed)
559 | plot_quality_impact(df_processed)
560 | plot_key_vs_value_sensitivity(df_processed)
561 | plot_memory_vs_quality(df_processed)
562 | create_summary_table(df_processed)
563 |
564 | print(f"Visualizations saved to {OUTPUT_DIR}")
565 | except Exception as e:
566 | print(f"Error generating visualizations: {e}")
567 |
568 | if __name__ == "__main__":
569 | main()
570 |
--------------------------------------------------------------------------------