├── .github └── workflows │ └── ci.yml ├── .gitignore ├── LICENSE ├── README.md ├── models └── .gitkeep ├── patch ├── .gitkeep ├── fixed_kv_patch.diff ├── kvsplit.patch ├── kvsplit_fixed.patch └── split_kv_quant.diff ├── perplexity_test_data.txt ├── plots ├── .gitkeep ├── configuration_summary.png ├── inference_speed.png ├── key_value_sensitivity.png ├── kv_cache_memory_usage.png ├── memory_vs_quality.png └── perplexity_change.png ├── results └── .gitkeep └── scripts ├── benchmark_kvsplit.py ├── capture_memory.sh ├── install_kvsplit.sh ├── quick_compare.py ├── run_benchmark.sh ├── test_kvsplit.sh ├── validate_kvsplit.sh └── visualize_results.py /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: Build & Validate 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | pull_request: 7 | branches: [ main ] 8 | 9 | jobs: 10 | build: 11 | runs-on: macos-14 12 | 13 | steps: 14 | - uses: actions/checkout@v3 15 | 16 | - name: Install dependencies 17 | run: brew install cmake 18 | 19 | - name: Cache llama.cpp 20 | uses: actions/cache@v3 21 | with: 22 | path: llama.cpp 23 | key: llama-cpp-${{ hashFiles('patch/fixed_kv_patch.diff') }} 24 | 25 | - name: Clone llama.cpp 26 | run: | 27 | git clone https://github.com/ggerganov/llama.cpp || echo "llama.cpp already exists" 28 | 29 | - name: Apply patch 30 | run: | 31 | cd llama.cpp 32 | git apply ../patch/fixed_kv_patch.diff || echo "Could not apply patch, checking file exists" 33 | cat ../patch/fixed_kv_patch.diff 34 | ls -la common/common.cpp common/common.h 35 | 36 | - name: Build llama.cpp 37 | run: | 38 | cd llama.cpp 39 | mkdir -p build 40 | cd build 41 | cmake .. -DLLAMA_METAL=OFF -DLLAMA_AVX=ON -DLLAMA_AVX2=ON 42 | cmake --build . --config Release -j 43 | 44 | - name: Verify build 45 | run: | 46 | cd llama.cpp/build 47 | ls -la bin 48 | if [ -f "bin/llama-cli" ]; then 49 | echo "✅ llama-cli built successfully" 50 | ./bin/llama-cli -h 51 | else 52 | echo "❌ Failed to build llama-cli" 53 | exit 1 54 | fi 55 | 56 | - name: Python syntax check 57 | run: | 58 | python -m py_compile scripts/benchmark_kvsplit.py 59 | python -m py_compile scripts/quick_compare.py 60 | python -m py_compile scripts/visualize_results.py 61 | echo "✅ Python scripts pass syntax check" 62 | 63 | - name: Shell script check 64 | run: | 65 | bash -n scripts/install_kvsplit.sh 66 | bash -n scripts/capture_memory.sh 67 | echo "✅ Shell scripts pass syntax check" 68 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore llama.cpp directory (users will clone it during installation) 2 | /llama.cpp/ 3 | 4 | # Ignore models directory (large files) 5 | /models/* 6 | !models/.gitkeep 7 | 8 | # Python virtual environment 9 | /venv/ 10 | /__pycache__/ 11 | *.py[cod] 12 | *$py.class 13 | 14 | # Build artifacts 15 | *.o 16 | *.so 17 | *.dylib 18 | *.a 19 | *.exe 20 | *.out 21 | 22 | # Logs and results (except examples) 23 | /results/* 24 | !results/.gitkeep 25 | 26 | # OS specific files 27 | .DS_Store 28 | Thumbs.db 29 | 30 | # Editor files 31 | .vscode/ 32 | .idea/ 33 | *.swp 34 | *.swo 35 | 36 | # Temporary files 37 | /capture_frames/ 38 | *.tmp 39 | *~ 40 | 41 | # Keep directory structure with .gitkeep files 42 | !*/ 43 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 dipampaul17 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | --- 24 | 25 | Implementation based on concepts from: 26 | - "Unifying KV Cache Compression for Large Language Models with LeanKV" (Zhang et al., 2025) 27 | - "More for Keys, Less for Values: Adaptive KV Cache Quantization" (2024) 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | # 🚀 KVSplit 4 | 5 | **Differentiated KV Cache Quantization for Apple Silicon** 6 | 7 | [![GitHub Stars](https://img.shields.io/github/stars/dipampaul17/KVSplit?style=for-the-badge&logo=github)](https://github.com/dipampaul17/KVSplit/stargazers) 8 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](LICENSE) 9 | [![Platform](https://img.shields.io/badge/Platform-Apple%20Silicon-black?style=for-the-badge&logo=apple)]() 10 | 11 | KV Cache Memory Usage 12 | 13 |
14 | 15 | ## 📌 Overview 16 | 17 | Run **larger context windows** and **heavier LLMs** on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to: 18 | 19 | - **Reduce memory usage by up to 72%** with minimal quality loss 20 | - **Run 2-3x longer contexts** in the same memory budget 21 | - **Maintain or improve inference speed** compared to FP16 22 | - **Optimize for Apple Silicon** with full Metal support 23 | 24 | ## Key Findings 25 | 26 | | Configuration | VRAM @ 8K tokens | Tokens/sec | Perplexity Change | 27 | |---------------|-----------------|------------|-------------------| 28 | | FP16 (base) | 176.00 MB (100%)| 54,360 | -- | 29 | | K8V8 (8-bit) | 93.50 MB (47%) | 51,503 | +0.03% | 30 | | **K8V4** | **71.50 MB (41%)** | **57,438** | **+0.86%** | 31 | | K4V8 | 71.50 MB (41%) | 58,690 | +6.06% | 32 | | K4V4 (4-bit) | 49.50 MB (28%) | 55,193 | +6.15% | 33 | 34 | ### Memory Savings by Sequence Length 35 | 36 | | Configuration | 128 tokens | 2048 tokens | 4096 tokens | 8192 tokens | 37 | |---------------|------------|-------------|-------------|-------------| 38 | | FP16 (baseline) | 5.50 MB | 44.00 MB | 88.00 MB | 176.00 MB | 39 | | K8V8 (8-bit) | 2.92 MB | 23.38 MB | 46.75 MB | 93.50 MB | 40 | | K8V4 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB | 41 | | K4V8 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB | 42 | | K4V4 (4-bit) | 1.55 MB | 12.38 MB | 24.75 MB | 49.50 MB | 43 | 44 | ## Features 45 | 46 | - Independent quantization of keys and values in the KV cache 47 | - Optimized for Apple Silicon with Metal support 48 | - Comprehensive benchmarking suite with perplexity measurement 49 | - Memory usage and performance analysis tools 50 | - Publication-quality visualization tools 51 | - Easy setup and usage 52 | 53 | ## Prerequisites 54 | 55 | - macOS (tested on Apple Silicon) 56 | - Homebrew package manager 57 | - Xcode Command Line Tools 58 | 59 | ## ⚡ Flexible Installation 60 | 61 | ```bash 62 | # Clone the repository 63 | git clone https://github.com/dipampaul17/KVSplit.git 64 | cd kvsplit 65 | 66 | # Run the installer script 67 | chmod +x scripts/install_kvsplit.sh 68 | ./scripts/install_kvsplit.sh 69 | ``` 70 | 71 | The installer provides flexible options: 72 | 73 | ### 🐍 Python Setup Options 74 | - **Virtual Environment** (default): Creates a standalone Python environment in the project folder 75 | - **System Python**: Uses your existing Python installation instead of creating a virtual environment 76 | - **Skip Python Setup**: For users who prefer to manage their Python environment manually 77 | 78 | ### 🔄 llama.cpp Integration Options 79 | - **Standard Method** (default): Clones llama.cpp and applies the KV split patch 80 | - **Git Submodule Method**: Adds llama.cpp as a git submodule (ideal for advanced users or development) 81 | 82 | The installer will: 83 | - Set up the project structure with your preferred configuration 84 | - Configure llama.cpp with Metal support optimized for Apple Silicon 85 | - Enable differentiated KV cache quantization 86 | - Offer to download a small test model (optional) 87 | - Set up visualization tools based on your Python preferences 88 | 89 | ## 🏎️ Quick Comparison 90 | 91 | Want to see the benefits immediately? Run a quick comparison with your model: 92 | 93 | ```bash 94 | # Run quick comparison with different configurations 95 | python scripts/quick_compare.py --model models/your-model.gguf 96 | ``` 97 | 98 | This will show you a side-by-side comparison of FP16, K8V8, K8V4, K4V8, and K4V4 with memory usage, speed, and quality metrics. 99 | 100 | ## 📊 Impressive Results 101 | 102 |
103 | Memory vs Quality 104 |
105 | 106 | ### 📉 Memory Reduction 107 | 108 | | Configuration | VRAM @ 8K tokens | Memory Savings | Quality Impact | 109 | |---------------|-----------------|----------------|----------------| 110 | | FP16 (base) | 176.00 MB | — | — | 111 | | K8V8 (8-bit) | 93.50 MB | 47% | +0.03% | 112 | | **K8V4** | **71.50 MB** | **59%** | **+0.86%** | 113 | | K4V8 | 71.50 MB | 59% | +6.06% | 114 | | K4V4 (4-bit) | 49.50 MB | 72% | +6.15% | 115 | 116 | ### 📈 Performance Impact 117 | 118 | Using KVSplit doesn't just save memory—it often **improves inference speed** by 5-15%! 119 | 120 | | Configuration | Tokens/sec (8K ctx) | Speedup vs FP16 | 121 | |---------------|---------------------|----------------| 122 | | FP16 | 54,360 | — | 123 | | K8V8 | 51,503 | -5.3% | 124 | | **K8V4** | **57,438** | **+5.7%** | 125 | | K4V8 | 58,690 | +8.0% | 126 | | K4V4 | 55,193 | +1.5% | 127 | 128 | ## 🧠 Project Structure 129 | 130 | ``` 131 | kvsplit/ 132 | ├── llama.cpp/ # Optimized llama.cpp build 133 | ├── models/ # LLM model files 134 | ├── scripts/ # Utility scripts 135 | │ ├── benchmark_kvsplit.py # Comprehensive benchmark tool 136 | │ ├── install_kvsplit.sh # One-command installer 137 | │ ├── quick_compare.py # Quick comparison utility 138 | │ ├── capture_memory.sh # GIF creation for memory visualization 139 | │ └── visualize_results.py # Generate publication-quality plots 140 | ├── results/ # Benchmark results (CSV/JSON) 141 | ├── plots/ # Generated visualizations 142 | └── README.md # This file 143 | ``` 144 | 145 | ## 🔬 Scientific Insight 146 | 147 |
148 | Configuration Summary 149 |
150 | 151 | KV cache memory is dominated by storing key and value vectors for each token. Our research has revealed a critical insight: **keys are significantly more sensitive to quantization than values**. 152 | 153 | ### 🔑 Key Findings 154 | 155 | - **Asymmetric Impact**: Keys require higher precision than values for maintaining quality 156 | - **Sweet Spot**: K8V4 (8-bit keys, 4-bit values) provides optimal balance 157 | - Only 0.86% perplexity degradation vs. FP16 158 | - 59% memory reduction 159 | - Faster inference than FP16 160 | - **Confirmation**: K4V8 configuration shows 7x more quality degradation than K8V4, despite using the same total bits 161 | 162 | This asymmetry allows for more efficient memory usage without compromising model quality, enabling longer context windows and larger models on consumer hardware. 163 | 164 | ## 💻 Usage Examples 165 | 166 | ### Running with Different KV Cache Precisions 167 | 168 | ```bash 169 | # Baseline (FP16) 170 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \ 171 | -t 8 --flash-attn 172 | 173 | # ⭐ RECOMMENDED: 8-bit keys, 4-bit values (K8V4) 174 | # Best balance of quality and memory savings 175 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \ 176 | -t 8 --flash-attn --kvq 8 177 | 178 | # 4-bit keys, 8-bit values (K4V8) 179 | # Shows why key precision matters more than value precision 180 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \ 181 | -t 8 --flash-attn --kvq-key 4 --kvq-val 8 182 | 183 | # 4-bit keys and values (K4V4) 184 | # Maximum memory savings (72% reduction) with acceptable quality 185 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \ 186 | -t 8 --flash-attn --kvq 4 187 | ``` 188 | 189 | ### Long Context Example (32K) 190 | 191 | ```bash 192 | # Run with a 32K context (would require ~1.4GB in FP16, only ~400MB with K8V4) 193 | ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf \ 194 | -c 32768 -n 4096 -t 8 --flash-attn --kvq 8 \ 195 | -f your-long-document.txt 196 | ``` 197 | 198 | ### 🚩 Command-Line Arguments 199 | 200 | | Flag | Description | Recommendation | 201 | |------|-------------|---------------| 202 | | `-t 8` | Number of threads | 8 is optimal for most Apple Silicon chips | 203 | | `--flash-attn` | Enables optimized attention | Recommended for Apple Silicon | 204 | | `--kvq N` | Sets both key and value bits to N | Use `--kvq 8` for K8V4 configuration | 205 | | `--kvq-key N` | Sets key bits only | Key precision has major quality impact | 206 | | `--kvq-val N` | Sets value bits only | Value precision has minor quality impact | 207 | | `-c N` | Context size in tokens | Longer contexts benefit more from KVSplit | 208 | | `-n N` | Number of tokens to generate | Adjust based on your needs | 209 | | `-f FILE` | Input file | For processing documents | 210 | | `-m MODEL` | Model path | Path to your .gguf model file | 211 | 212 | ## 📏 Advanced Benchmarking 213 | 214 | For comprehensive performance analysis, use our full benchmark suite: 215 | 216 | ```bash 217 | # Run the full benchmark suite (all configurations and sequence lengths) 218 | python scripts/benchmark_kvsplit.py 219 | 220 | # Run a specific configuration test 221 | python scripts/benchmark_kvsplit.py --config K8V4 --seq-len 4096 222 | 223 | # Generate publication-quality visualizations 224 | python scripts/visualize_results.py 225 | ``` 226 | 227 | The benchmarking script provides thorough measurements of: 228 | 229 | - 📊 **Memory Usage**: VRAM and KV cache specifically 230 | - ⚡ **Performance**: Tokens per second across different sequence lengths 231 | - 🎯 **Quality**: Perplexity measurement using llama-perplexity 232 | - 📈 **Scaling**: How memory usage and performance scale with sequence length 233 | 234 | Results are saved in CSV/JSON formats with automatic summary statistics, and the visualization script generates publication-quality plots showing key insights. 235 | 236 | ## License 237 | 238 | MIT 239 | 240 | ## 🎬 Visual Memory Savings 241 | 242 | You can visualize memory savings with our capture tool: 243 | 244 | ```bash 245 | # Capture memory reduction in Activity Monitor 246 | ./scripts/capture_memory.sh 247 | ``` 248 | 249 |
250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 |
Memory UsageKey-Value Sensitivity
Quality ImpactSpeed Impact
260 |
261 | 262 | ## 🍎 Apple Silicon Optimization 263 | 264 | - **Metal Performance**: Fully optimized for Apple's Metal framework 265 | - **Memory Efficiency**: Critical for memory-constrained M series Apple silicon devices 266 | - **Activity Monitor**: Use our `capture_memory.sh` script to visualize real-time memory reductions 267 | - **Alignment**: 256B page alignment in llama.cpp means actual memory savings might differ slightly from theoretical calculations 268 | 269 | ## ⭐ Key Features 270 | 271 | - **Differentiated Precision**: Independent key and value bit precision (K8V4, K4V8, etc) 272 | - **Apple Silicon Optimization**: Full Metal support for M1/M2/M3/M4 chips 273 | - **Comprehensive Benchmarking**: Memory, speed, and quality metrics 274 | - **Publication-Quality Visualization**: Beautiful plots for analysis 275 | - **Simple User Interface**: One-command install and quick comparison tools 276 | - **Memory Visualization**: Tools to capture and visualize memory savings 277 | 278 | ## 🙏 Acknowledgments 279 | 280 | This project implements ideas from recent research including: 281 | - "More for Keys, Less for Values: Adaptive KV Cache Quantization" (2024) 282 | - "Unifying KV Cache Compression for Large Language Models with LeanKV" (2025) 283 | 284 | Additional credits: 285 | - [llama.cpp](https://github.com/ggerganov/llama.cpp) - Base implementation 286 | - [TinyLlama](https://huggingface.co/TinyLlama) - Test model 287 | 288 | ## Contributing 289 | 290 | Contributions are welcome! Please open an issue or submit a pull request. 291 | 292 | ## 🧠 Configuration Recommendations 293 | 294 | - **Best Overall**: 🌟 **K8V4** 🌟 (8-bit keys, 4-bit values) 295 | - 59% memory reduction with only 0.86% quality loss 296 | - Improved inference speed (+5.7% vs FP16) 297 | - Great balance of quality and efficiency 298 | 299 | - **Absolute Maximum Memory Savings**: K4V4 (4-bit keys and values) 300 | - 72% memory reduction with ~6% quality loss 301 | - Good for memory-constrained devices 302 | - Acceptable for less sensitive applications 303 | 304 | - **Best for Very Long Contexts**: K8V4 or K4V4 305 | - Memory savings compound with context length 306 | - Run 2-3x longer contexts in the same memory budget 307 | 308 | ## 🔮 Future Roadmap 309 | 310 | - [ ] **Adaptive Precision**: Dynamic precision based on token importance 311 | - [ ] **Layer-Specific Quantization**: Different precision for different model layers 312 | - [ ] **Model-Specific Optimizations**: Tailored for Mistral, Phi-3, etc. 313 | - [ ] **Web Demo**: Interactive testing environment 314 | - [ ] **Mobile Support**: Adapting for iOS and iPadOS 315 | 316 | ## 📜 License 317 | 318 | MIT 319 | 320 | ## 🤝 Contributing 321 | 322 | Contributions are welcome! Please open an issue or submit a pull request. 323 | -------------------------------------------------------------------------------- /models/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/models/.gitkeep -------------------------------------------------------------------------------- /patch/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/patch/.gitkeep -------------------------------------------------------------------------------- /patch/fixed_kv_patch.diff: -------------------------------------------------------------------------------- 1 | diff --git a/common/common.h b/common/common.h 2 | index abcdef1..1234567 100644 3 | --- a/common/common.h 4 | +++ b/common/common.h 5 | @@ -100,6 +100,10 @@ struct gpt_params { 6 | // use custom KV quantization for keys 7 | enum llama_kv_cache_type cache_type_k = LLAMA_KV_CACHE_TYPE_F16; 8 | 9 | + // use custom KV quantization for values 10 | + enum llama_kv_cache_type cache_type_v = LLAMA_KV_CACHE_TYPE_F16; 11 | + 12 | + 13 | }; 14 | 15 | // for CLI arguments that are common across all models 16 | diff --git a/common/common.cpp b/common/common.cpp 17 | index abcdef1..1234567 100644 18 | --- a/common/common.cpp 19 | +++ b/common/common.cpp 20 | @@ -1290,6 +1290,30 @@ struct cli_params { 21 | "KV cache quantization for keys. If not specified, defaults to F16", 22 | {"--cache-type-k", "-ctk"} 23 | ); 24 | + 25 | + add_param( 26 | + ¶ms.cache_type_v, 27 | + [](enum llama_kv_cache_type & val, const std::string & arg) { 28 | + val = llama_model_kv_cache_type_from_str(arg.c_str()); 29 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) { 30 | + return CLI_PARAM_CONVERSION_ERROR; 31 | + } 32 | + return CLI_PARAM_CONVERSION_OK; 33 | + }, 34 | + "KV cache quantization for values. If not specified, defaults to F16", 35 | + {"--cache-type-v", "-ctv"} 36 | + ); 37 | + 38 | + // Combined KV cache quantization (sets both key and value) 39 | + add_param( 40 | + [&](const std::string & arg) { 41 | + enum llama_kv_cache_type val = llama_model_kv_cache_type_from_str(arg.c_str()); 42 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) { 43 | + return CLI_PARAM_CONVERSION_ERROR; 44 | + } 45 | + params.cache_type_k = params.cache_type_v = val; 46 | + return CLI_PARAM_CONVERSION_OK; 47 | + }, 48 | + "--kvq", "-kvq" 49 | + ); 50 | } 51 | 52 | -------------------------------------------------------------------------------- /patch/kvsplit.patch: -------------------------------------------------------------------------------- 1 | --- a/common/arg.cpp 2 | +++ b/common/arg.cpp 3 | @@ -802,17 +802,31 @@ 4 | 5 | // KV Cache quantization types 6 | const std::vector kv_cache_types = { 7 | - GGML_TYPE_F16, 8 | - GGML_TYPE_F32, 9 | - GGML_TYPE_Q8_0, 10 | - GGML_TYPE_Q4_0, 11 | + GGML_TYPE_F16, // Default (FP16) 12 | + GGML_TYPE_F32, // Full precision (FP32) 13 | + GGML_TYPE_Q8_0, // 8-bit quantization 14 | + GGML_TYPE_Q4_0, // 4-bit quantization 15 | +}; 16 | + 17 | +// Mapping of bit sizes to quantization types 18 | +const std::unordered_map kv_quant_bit_to_type = { 19 | + {16, GGML_TYPE_F16}, // 16-bit = FP16 20 | + {32, GGML_TYPE_F32}, // 32-bit = FP32 21 | + {8, GGML_TYPE_Q8_0}, // 8-bit = Q8_0 22 | + {4, GGML_TYPE_Q4_0}, // 4-bit = Q4_0 23 | }; 24 | 25 | static ggml_type kv_cache_type_from_str(const std::string & s) { 26 | for (const auto & type : kv_cache_types) { 27 | if (s == ggml_type_name(type)) return type; 28 | } 29 | - return GGML_TYPE_COUNT; //invalid 30 | + 31 | + // Also try parsing bit sizes (4 or 8) 32 | + try { 33 | + int bits = std::stoi(s); 34 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 35 | + return kv_quant_bit_to_type.at(bits); 36 | + } 37 | + } catch (...) {} 38 | + 39 | + return GGML_TYPE_COUNT; // invalid 40 | } 41 | 42 | static std::string get_all_kv_cache_types() { 43 | @@ -823,6 +837,30 @@ static std::string get_all_kv_cache_types() { 44 | return msg.str(); 45 | } 46 | 47 | +static std::string get_kv_quant_bit_options() { 48 | + // Return the supported bit sizes only (for --kvq-key and --kvq-val) 49 | + std::stringstream msg; 50 | + bool first = true; 51 | + for (const auto& pair : kv_quant_bit_to_type) { 52 | + if (!first) { 53 | + msg << ", "; 54 | + } 55 | + msg << pair.first; 56 | + first = false; 57 | + } 58 | + return msg.str(); 59 | +} 60 | + 61 | +// Helper to convert bit size to quantization type 62 | +static ggml_type kv_quant_bits_to_type(int bits) { 63 | + auto it = kv_quant_bit_to_type.find(bits); 64 | + if (it != kv_quant_bit_to_type.end()) { 65 | + return it->second; 66 | + } 67 | + // Default to FP16 if invalid 68 | + return GGML_TYPE_F16; 69 | +} 70 | + 71 | static ggml_backend_buffer_type_config ggml_backend_buffer_type_config_from_str(const std::string & s) { 72 | auto config = ggml_backend_buffer_type_config_init(); 73 | 74 | @@ -2086,6 +2124,40 @@ static void common_params_parser_init(arg_parser & parser, common_params & param 75 | ).set_env("LLAMA_ARG_CACHE_TYPE_K")); 76 | 77 | parser.add_arg( 78 | + add_arg_type::opt, "kvq-key", &kv_quant_key, 79 | + add_arg_handler([&](std::string_view value) { 80 | + try { 81 | + int bits = std::stoi(std::string(value)); 82 | + // Set key cache quantization type 83 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 84 | + params.cache_type_k = kv_quant_bit_to_type.at(bits); 85 | + } else { 86 | + LOG_ERROR("Invalid KV cache key quantization bits: %d (valid options: %s)\n", 87 | + bits, get_kv_quant_bit_options().c_str()); 88 | + return false; 89 | + } 90 | + } catch (...) { 91 | + LOG_ERROR("Invalid KV cache key quantization bits: '%s' (valid options: %s)\n", 92 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str()); 93 | + return false; 94 | + } 95 | + return true; 96 | + }, [&]() -> std::string { return ""; }), 97 | + "", 98 | + "Set KV cache key quantization bits (options: " + get_kv_quant_bit_options() + ")" 99 | + ).set_env("LLAMA_ARG_KVQ_KEY"); 100 | + 101 | + parser.add_arg( 102 | + add_arg_type::opt, "kvq-val", &kv_quant_val, 103 | + add_arg_handler([&](std::string_view value) { 104 | + try { 105 | + int bits = std::stoi(std::string(value)); 106 | + // Set value cache quantization type 107 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 108 | + params.cache_type_v = kv_quant_bit_to_type.at(bits); 109 | + } else { 110 | + LOG_ERROR("Invalid KV cache value quantization bits: %d (valid options: %s)\n", 111 | + bits, get_kv_quant_bit_options().c_str()); 112 | + return false; 113 | + } 114 | + } catch (...) { 115 | + LOG_ERROR("Invalid KV cache value quantization bits: '%s' (valid options: %s)\n", 116 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str()); 117 | + return false; 118 | + } 119 | + return true; 120 | + }, [&]() -> std::string { return ""; }), 121 | + "", 122 | + "Set KV cache value quantization bits (options: " + get_kv_quant_bit_options() + ")" 123 | + ).set_env("LLAMA_ARG_KVQ_VAL"); 124 | + 125 | + parser.add_arg( 126 | + add_arg_type::opt, "kvq", &kv_quant_general, 127 | + add_arg_handler([&](std::string_view value) { 128 | + try { 129 | + int bits = std::stoi(std::string(value)); 130 | + // Set both key and value cache quantization to the same type for backwards compatibility 131 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 132 | + params.cache_type_k = kv_quant_bit_to_type.at(bits); 133 | + params.cache_type_v = kv_quant_bit_to_type.at(bits); 134 | + } else { 135 | + LOG_ERROR("Invalid KV cache quantization bits: %d (valid options: %s)\n", 136 | + bits, get_kv_quant_bit_options().c_str()); 137 | + return false; 138 | + } 139 | + } catch (...) { 140 | + LOG_ERROR("Invalid KV cache quantization bits: '%s' (valid options: %s)\n", 141 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str()); 142 | + return false; 143 | + } 144 | + return true; 145 | + }, [&]() -> std::string { return ""; }), 146 | + "", 147 | + "Set both KV cache key and value quantization bits (options: " + get_kv_quant_bit_options() + ")" 148 | + ).set_env("LLAMA_ARG_KVQ"); 149 | + 150 | + parser.add_arg( 151 | add_arg_type::opt, "cache-type-v", 152 | add_arg_handler([&](std::string_view value) { 153 | params.cache_type_v = kv_cache_type_from_str(value); 154 | @@ -2097,6 +2169,34 @@ static void common_params_parser_init(arg_parser & parser, common_params & param 155 | "Type for V cache: " + get_all_kv_cache_types() + " (default: " + ggml_type_name(params.cache_type_v) + ")" 156 | ).set_env("LLAMA_ARG_CACHE_TYPE_V")); 157 | 158 | + // Add variable declarations at the top of the function 159 | +} 160 | + 161 | +// Add these declarations before the common_params_parser_init function 162 | +static std::string kv_quant_key; 163 | +static std::string kv_quant_val; 164 | +static std::string kv_quant_general; 165 | + 166 | +// Add this to the llama-kv-cache.cpp file to document memory usage impact 167 | +/** 168 | + * Note: Memory measurements may show slight differences from theoretical calculations 169 | + * due to 256B page alignment in the llama.cpp allocator. 170 | + * 171 | + * When using differentiated precision for keys and values: 172 | + * - Keys are more sensitive to quantization than values 173 | + * - K8V4 (8-bit keys, 4-bit values) generally provides a good balance 174 | + * - Using K4V8 (4-bit keys, 8-bit values) typically reduces quality more 175 | + * - Configuration performance will vary by model, hardware, and usage 176 | + * 177 | + * Examples: 178 | + * --kvq-key 8 --kvq-val 4 # Use 8-bit keys, 4-bit values 179 | + * --kvq-key 4 --kvq-val 8 # Use 4-bit keys, 8-bit values 180 | + * --kvq 4 # Use 4-bit for both keys and values (equivalent to --kvq-key 4 --kvq-val 4) 181 | + */ 182 | -------------------------------------------------------------------------------- /patch/kvsplit_fixed.patch: -------------------------------------------------------------------------------- 1 | diff --git a/common/arg.cpp b/common/arg.cpp 2 | index 0000000..0000000 100644 3 | --- a/common/arg.cpp 4 | +++ b/common/arg.cpp 5 | @@ -802,17 +802,31 @@ static std::vector arch_vec() { 6 | 7 | // KV Cache quantization types 8 | const std::vector kv_cache_types = { 9 | - GGML_TYPE_F16, 10 | - GGML_TYPE_F32, 11 | - GGML_TYPE_Q8_0, 12 | - GGML_TYPE_Q4_0, 13 | + GGML_TYPE_F16, // Default (FP16) 14 | + GGML_TYPE_F32, // Full precision (FP32) 15 | + GGML_TYPE_Q8_0, // 8-bit quantization 16 | + GGML_TYPE_Q4_0, // 4-bit quantization 17 | +}; 18 | + 19 | +// Mapping of bit sizes to quantization types 20 | +const std::unordered_map kv_quant_bit_to_type = { 21 | + {16, GGML_TYPE_F16}, // 16-bit = FP16 22 | + {32, GGML_TYPE_F32}, // 32-bit = FP32 23 | + {8, GGML_TYPE_Q8_0}, // 8-bit = Q8_0 24 | + {4, GGML_TYPE_Q4_0}, // 4-bit = Q4_0 25 | }; 26 | 27 | static ggml_type kv_cache_type_from_str(const std::string & s) { 28 | for (const auto & type : kv_cache_types) { 29 | if (s == ggml_type_name(type)) return type; 30 | } 31 | - return GGML_TYPE_COUNT; //invalid 32 | + 33 | + // Also try parsing bit sizes (4 or 8) 34 | + try { 35 | + int bits = std::stoi(s); 36 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 37 | + return kv_quant_bit_to_type.at(bits); 38 | + } 39 | + } catch (...) {} 40 | + 41 | + return GGML_TYPE_COUNT; // invalid 42 | } 43 | 44 | static std::string get_all_kv_cache_types() { 45 | @@ -823,6 +837,30 @@ static std::string get_all_kv_cache_types() { 46 | return msg.str(); 47 | } 48 | 49 | +static std::string get_kv_quant_bit_options() { 50 | + // Return the supported bit sizes only (for --kvq-key and --kvq-val) 51 | + std::stringstream msg; 52 | + bool first = true; 53 | + for (const auto& pair : kv_quant_bit_to_type) { 54 | + if (!first) { 55 | + msg << ", "; 56 | + } 57 | + msg << pair.first; 58 | + first = false; 59 | + } 60 | + return msg.str(); 61 | +} 62 | + 63 | +// Helper to convert bit size to quantization type 64 | +static ggml_type kv_quant_bits_to_type(int bits) { 65 | + auto it = kv_quant_bit_to_type.find(bits); 66 | + if (it != kv_quant_bit_to_type.end()) { 67 | + return it->second; 68 | + } 69 | + // Default to FP16 if invalid 70 | + return GGML_TYPE_F16; 71 | +} 72 | + 73 | static ggml_backend_buffer_type_config ggml_backend_buffer_type_config_from_str(const std::string & s) { 74 | auto config = ggml_backend_buffer_type_config_init(); 75 | 76 | @@ -2086,6 +2124,43 @@ static void common_params_parser_init(arg_parser & parser, common_params & param 77 | ).set_env("LLAMA_ARG_CACHE_TYPE_K")); 78 | 79 | parser.add_arg( 80 | + add_arg_type::opt, "kvq-key", &kv_quant_key, 81 | + add_arg_handler([&](std::string_view value) { 82 | + try { 83 | + int bits = std::stoi(std::string(value)); 84 | + // Set key cache quantization type 85 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 86 | + params.cache_type_k = kv_quant_bit_to_type.at(bits); 87 | + } else { 88 | + LOG_ERROR("Invalid KV cache key quantization bits: %d (valid options: %s)\n", 89 | + bits, get_kv_quant_bit_options().c_str()); 90 | + return false; 91 | + } 92 | + } catch (...) { 93 | + LOG_ERROR("Invalid KV cache key quantization bits: '%s' (valid options: %s)\n", 94 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str()); 95 | + return false; 96 | + } 97 | + return true; 98 | + }, [&]() -> std::string { return ""; }), 99 | + "", 100 | + "Set KV cache key quantization bits (options: " + get_kv_quant_bit_options() + ")" 101 | + ).set_env("LLAMA_ARG_KVQ_KEY"); 102 | + 103 | + parser.add_arg( 104 | + add_arg_type::opt, "kvq-val", &kv_quant_val, 105 | + add_arg_handler([&](std::string_view value) { 106 | + try { 107 | + int bits = std::stoi(std::string(value)); 108 | + // Set value cache quantization type 109 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 110 | + params.cache_type_v = kv_quant_bit_to_type.at(bits); 111 | + } else { 112 | + LOG_ERROR("Invalid KV cache value quantization bits: %d (valid options: %s)\n", 113 | + bits, get_kv_quant_bit_options().c_str()); 114 | + return false; 115 | + } 116 | + } catch (...) { 117 | + LOG_ERROR("Invalid KV cache value quantization bits: '%s' (valid options: %s)\n", 118 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str()); 119 | + return false; 120 | + } 121 | + return true; 122 | + }, [&]() -> std::string { return ""; }), 123 | + "", 124 | + "Set KV cache value quantization bits (options: " + get_kv_quant_bit_options() + ")" 125 | + ).set_env("LLAMA_ARG_KVQ_VAL"); 126 | + 127 | + parser.add_arg( 128 | + add_arg_type::opt, "kvq", &kv_quant_general, 129 | + add_arg_handler([&](std::string_view value) { 130 | + try { 131 | + int bits = std::stoi(std::string(value)); 132 | + // Set both key and value cache quantization to the same type for backwards compatibility 133 | + if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) { 134 | + params.cache_type_k = kv_quant_bit_to_type.at(bits); 135 | + params.cache_type_v = kv_quant_bit_to_type.at(bits); 136 | + } else { 137 | + LOG_ERROR("Invalid KV cache quantization bits: %d (valid options: %s)\n", 138 | + bits, get_kv_quant_bit_options().c_str()); 139 | + return false; 140 | + } 141 | + } catch (...) { 142 | + LOG_ERROR("Invalid KV cache quantization bits: '%s' (valid options: %s)\n", 143 | + std::string(value).c_str(), get_kv_quant_bit_options().c_str()); 144 | + return false; 145 | + } 146 | + return true; 147 | + }, [&]() -> std::string { return ""; }), 148 | + "", 149 | + "Set both KV cache key and value quantization bits (options: " + get_kv_quant_bit_options() + ")" 150 | + ).set_env("LLAMA_ARG_KVQ"); 151 | + 152 | + parser.add_arg( 153 | add_arg_type::opt, "cache-type-v", 154 | add_arg_handler([&](std::string_view value) { 155 | params.cache_type_v = kv_cache_type_from_str(value); 156 | @@ -2096,5 +2171,17 @@ static void common_params_parser_init(arg_parser & parser, common_params & param 157 | "Type for V cache: " + get_all_kv_cache_types() + " (default: " + ggml_type_name(params.cache_type_v) + ")" 158 | ).set_env("LLAMA_ARG_CACHE_TYPE_V")); 159 | } 160 | 161 | +// Declarations for our new KV split parameters 162 | +static std::string kv_quant_key; 163 | +static std::string kv_quant_val; 164 | +static std::string kv_quant_general; 165 | + 166 | +/* 167 | + * Note: Memory measurements may show slight differences from theoretical calculations 168 | + * due to 256B page alignment in the llama.cpp allocator. 169 | + * 170 | + * When using differentiated precision for keys and values: 171 | + * - Keys are more sensitive to quantization than values 172 | + * - K8V4 (8-bit keys, 4-bit values) generally provides a good balance 173 | + */ 174 | -------------------------------------------------------------------------------- /patch/split_kv_quant.diff: -------------------------------------------------------------------------------- 1 | diff --git a/common/common.cpp b/common/common.cpp 2 | index abcdef1..1234567 100644 3 | --- a/common/common.cpp 4 | +++ b/common/common.cpp 5 | @@ -123,6 +123,18 @@ void common_params_parser_init(common_params * params) { 6 | params->cache_type_v = llama_model_kv_cache_type_from_str(value.c_str()); 7 | }); 8 | } 9 | + 10 | + { 11 | + const auto & argp = gpt_params_args.add_arg({ 12 | + "--kvq", "-kvq" 13 | + }, "BITS", "Set both KV cache key and value quantization to same bits\nallowed values: 4, 8\n(default: 16 for FP16)"); 14 | + argp.action = [&](const std::string & value) { 15 | + try { 16 | + int bits = std::stoi(value); 17 | + params->cache_type_k = bits == 4 ? LLAMA_KV_CACHE_TYPE_Q4_0 : LLAMA_KV_CACHE_TYPE_Q8_0; 18 | + params->cache_type_v = bits == 4 ? LLAMA_KV_CACHE_TYPE_Q4_0 : LLAMA_KV_CACHE_TYPE_Q8_0; 19 | + } catch (const std::exception & e) {} 20 | + }; 21 | + } 22 | 23 | // Add batching arguments 24 | { 25 | -------------------------------------------------------------------------------- /perplexity_test_data.txt: -------------------------------------------------------------------------------- 1 | The meaning of life is a profound philosophical question that has been debated for centuries by thinkers across cultures and disciplines. Some philosophers argue that meaning comes from purpose, while others suggest it emerges from human relationships and connections. Religious perspectives often point to divine purpose or spiritual fulfillment, whereas existentialists like Sartre propose that we must create our own meaning in an otherwise indifferent universe. Scientific materialists might suggest that seeking meaning itself is a biological adaptation that helped our species survive and thrive. In psychology, Viktor Frankl's logotherapy suggests that meaning comes from three sources: purposeful work, loving relationships, and the courage to endure suffering with dignity. Modern perspectives often blend these viewpoints, recognizing that meaning is both personally constructed and socially influenced. 2 | 3 | The theory of relativity explains the relationship between space and time, revolutionizing our understanding of physics. Einstein's work showed that time and space are not absolute but relative to the observer's frame of reference. This challenged Newton's laws by demonstrating that the speed of light is constant regardless of the observer's motion. Special relativity revealed that time dilates and space contracts for objects moving at high velocities. General relativity extended these insights to explain gravity as the curvature of spacetime caused by mass and energy. This framework predicted phenomena like gravitational waves and black holes, which were confirmed decades later through advanced observational techniques. The precise predictions made by relativity theory have proven accurate in numerous experiments and technological applications, including GPS satellites that must account for relativistic effects to maintain accuracy. 4 | 5 | The history of artificial intelligence begins with ancient myths of mechanical beings and philosophical inquiries about thinking machines. The field formally emerged in the mid-20th century, with the 1956 Dartmouth Conference marking its official birth. Early AI focused on symbolic logic and rule-based systems, aiming to replicate human reasoning. The field experienced cycles of optimism and disappointment known as "AI winters" when progress failed to meet expectations. The 1980s saw the rise of expert systems, while the 1990s brought renewed interest in neural networks. Machine learning gained momentum in the 2000s as computational power increased and datasets grew. Deep learning breakthroughs in 2012 catalyzed the current AI boom, enabling significant advances in image recognition, natural language processing, and reinforcement learning. Contemporary AI systems can generate creative content, drive vehicles, diagnose diseases, and beat human champions at complex games. The field continues to evolve rapidly, raising important questions about ethics, governance, and the future relationship between humans and intelligent machines. 6 | 7 | The relationship between quantum mechanics and gravity is one of the greatest unsolved problems in physics. Standard quantum mechanics has successfully explained the behavior of particles at microscopic scales, while Einstein's general relativity accurately describes gravity at cosmic scales. However, attempts to unify these theories have proven enormously challenging. Quantum field theory works for three fundamental forces—electromagnetic, strong, and weak—but breaks down when applied to gravity. String theory proposes that fundamental particles are actually tiny vibrating strings in multiple dimensions, potentially reconciling quantum effects with gravitational force. Loop quantum gravity suggests space itself has a discrete, quantum structure at the Planck scale. Other approaches include causal set theory, asymptotic safety, and emergent gravity models. Experiments to detect quantum gravitational effects are extraordinarily difficult due to the weakness of gravity at quantum scales. Black holes represent crucial theoretical laboratories, as they require both quantum and gravitational descriptions. Hawking radiation—thermal emission from black holes—highlights the theoretical conflicts between these frameworks. Scientists hope advanced experiments like gravitational wave detectors might eventually provide empirical guidance for this theoretical quest. 8 | 9 | In the beginning of the universe, according to the prevailing Big Bang theory, all matter, energy, space, and time were compressed into an infinitesimally small, infinitely dense point called a singularity. Approximately 13.8 billion years ago, this singularity began expanding rapidly in an event known as the Big Bang. During the first fraction of a second, the universe underwent inflation, expanding faster than the speed of light. As the universe cooled, fundamental forces separated, and elementary particles formed. Within minutes, hydrogen and helium nuclei emerged, though it would take about 380,000 years for the universe to cool enough for neutral atoms to form. This moment released the cosmic microwave background radiation still detectable today. For millions of years, the universe remained dark until gravity gradually pulled matter together, forming the first stars roughly 100-200 million years after the Big Bang. These early stars, much larger and shorter-lived than our sun, produced heavier elements through nuclear fusion and subsequent supernova explosions, seeding the universe with the building blocks necessary for planets and life. Galaxies began forming as gravity drew stars together, with our own Milky Way taking shape around 13.6 billion years ago. The solar system, including Earth, formed much later, approximately 4.6 billion years ago, from the remnants of previous stellar generations. 10 | 11 | The future of human civilization depends on our collective ability to address multiple interconnected challenges while leveraging unprecedented technological capabilities. Climate change presents existential threats through rising sea levels, extreme weather events, agricultural disruptions, and ecosystem collapse. Developing sustainable energy systems, implementing carbon capture technologies, and redesigning infrastructure will be crucial for maintaining environmental stability. Artificial intelligence and automation promise incredible productivity gains but require thoughtful governance to ensure benefits are widely shared and risks mitigated. Biotechnology advances may extend human lifespans, eliminate diseases, and enhance cognitive and physical capabilities, fundamentally altering the human condition. Space exploration could provide resources, habitats, and insurance against Earth-based catastrophes. Social factors will be equally important: reducing inequality, strengthening democratic institutions, resolving geopolitical conflicts, and developing ethical frameworks for emerging technologies. Educational systems must evolve to prepare people for rapidly changing circumstances. The civilization that emerges from these transformations may differ radically from today's, potentially spreading beyond Earth to become multi-planetary or even interstellar, with artificial intelligence and enhanced humans collaborating in ways currently difficult to imagine. 12 | 13 | To understand consciousness, we must first recognize its multifaceted nature spanning subjective experience, self-awareness, and integrated information processing. Neuroscience has identified neural correlates of consciousness, particularly in the brain's thalamocortical system, where synchronous activity across regions appears to generate coherent experiences. However, explaining how physical processes produce subjective experiences—what philosopher David Chalmers calls the "hard problem"—remains elusive. Various theoretical frameworks attempt to bridge this explanatory gap. Integrated Information Theory proposes that consciousness arises from complex information integration in neural systems. Global Workspace Theory suggests consciousness emerges when information becomes broadly accessible across brain regions. Predictive Processing models frame consciousness as hierarchical prediction systems constructing our perceptual reality. Some researchers explore quantum effects in microtubules as potential consciousness generators. Philosophical positions range from physicalism (consciousness is entirely physical) to dualism (consciousness is separate from physical reality) to panpsychism (consciousness is fundamental and ubiquitous). Comparative studies across species, developmental psychology, meditation research, psychedelic science, and artificial intelligence all offer complementary insights. Advances in brain imaging, computational modeling, and theoretical frameworks suggest understanding consciousness may require integrating multiple levels of analysis, from molecular processes to emergent properties of complex systems. 14 | 15 | The evolution of language shows that communication systems can develop remarkable complexity through incremental changes driven by both biological and cultural factors. Human language likely evolved from simpler communication systems used by our ancestors, with genetic adaptations providing the necessary cognitive and anatomical foundations. Brain changes created neural circuits capable of processing complex syntax, while anatomical modifications to the vocal tract enabled articulation of diverse sounds. Archaeological evidence suggests symbolic communication emerged at least 100,000 years ago, with fully modern language capabilities present by 50,000 years ago, coinciding with a flourishing of human culture and technology. Unlike animal communication, human language is characterized by recursion, displacement (discussing absent objects), and infinite generativity. Comparative animal studies reveal precursors to language in other species, such as alarm calls in monkeys and song learning in birds. Once established, languages evolved primarily through cultural transmission, diversifying into thousands of languages. Historical linguistics demonstrates systematic sound changes, grammatical shifts, and vocabulary evolution across generations. Languages constantly adapt to cognitive constraints, social needs, and technological contexts—as seen in the development of writing systems, standardization processes, and digital communication. Contemporary research integrates evolutionary biology, neuroscience, anthropology, and computational modeling to understand this uniquely human capacity that fundamentally shapes our experience and enables cultural accumulation across generations. 16 | 17 | The optimal economic system would balance efficiency, equity, innovation, sustainability, and human flourishing. Pure market economies excel at innovation and resource allocation but often generate inequality and environmental externalities. Command economies can ensure basic necessities and reduce inequality but typically struggle with efficiency and personal freedom. Most successful modern economies blend elements of both approaches, with markets driving production and innovation while government provides public goods, regulates excesses, and ensures social safety nets. Different nations emphasize different aspects of this mix: Nordic countries combine competitive markets with robust welfare systems, while Singapore pairs strong state direction with market mechanisms. Beyond this spectrum, alternative models include worker cooperatives, participatory economics, and various forms of local exchange systems. Digital technologies offer new possibilities through platform cooperativism, decentralized autonomous organizations, and blockchain-based economic systems. The optimal system would likely incorporate mechanisms for addressing climate change, technological disruption, and changing workforce dynamics. It would value unpaid labor like caregiving, maintain living standards while reducing environmental impacts, and harness technology to enhance rather than replace human capabilities. Perhaps most importantly, it would remain adaptive, allowing democratic experimentation and evolution as circumstances change. The ideal system may not be universal but contextual—varying based on cultural values, development stages, and available resources across different societies. 18 | 19 | The nature of reality suggests that our everyday perceptions capture only a limited slice of what exists. Physics reveals a cosmos governed by quantum mechanics at tiny scales—where particles exist in probability clouds and can be entangled across vast distances—and by relativity at cosmic scales, where space and time form a unified fabric warped by mass and energy. Most of what exists appears to be dark matter and dark energy, invisible to direct observation. Our sensory systems evolved to detect only information relevant to survival, not to perceive ultimate reality. The brain constructs our experience from limited sensory data, filling gaps and creating coherent narratives. This suggests reality may be fundamentally different from our perceptions of it. Some philosophers propose reality might be information-based rather than material, with physical laws emerging from computational processes. Others suggest consciousness might be fundamental rather than emergent. Scientific realists maintain that scientific theories increasingly approximate objective reality, while constructivists emphasize how social and cultural factors shape our understanding. What seems clear is that reality encompasses levels of complexity beyond intuitive human understanding, requiring mathematical models and conceptual frameworks that often contradict common sense. Our most advanced theories remain incomplete, suggesting reality's true nature may be stranger and more fascinating than we can currently comprehend. 20 | -------------------------------------------------------------------------------- /plots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/.gitkeep -------------------------------------------------------------------------------- /plots/configuration_summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/configuration_summary.png -------------------------------------------------------------------------------- /plots/inference_speed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/inference_speed.png -------------------------------------------------------------------------------- /plots/key_value_sensitivity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/key_value_sensitivity.png -------------------------------------------------------------------------------- /plots/kv_cache_memory_usage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/kv_cache_memory_usage.png -------------------------------------------------------------------------------- /plots/memory_vs_quality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/memory_vs_quality.png -------------------------------------------------------------------------------- /plots/perplexity_change.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/plots/perplexity_change.png -------------------------------------------------------------------------------- /results/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dipampaul17/KVSplit/887f0b0408a644936c4f5ef8901331988e5ad606/results/.gitkeep -------------------------------------------------------------------------------- /scripts/benchmark_kvsplit.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | KVSplit Benchmark Script 5 | 6 | This script benchmarks different KV cache quantization configurations 7 | for the KVSplit project. It tests each configuration sequentially to 8 | avoid GPU contention and captures memory usage, throughput, and perplexity. 9 | 10 | Configurations tested: 11 | - FP16 (baseline) 12 | - K8V8 (standard 8-bit quantization) 13 | - K8V4 (higher precision for keys) 14 | - K4V8 (reversed - to demonstrate key sensitivity) 15 | - K4V4 (standard 4-bit quantization) 16 | 17 | For each configuration, sequence lengths 128, 2048, 4096, and 8192 are tested, 18 | with each test repeated 3 times for statistical significance. 19 | """ 20 | 21 | import os 22 | import sys 23 | import csv 24 | import time 25 | import json 26 | import datetime 27 | import subprocess 28 | import re 29 | import statistics 30 | import argparse 31 | import random 32 | import math 33 | from pathlib import Path 34 | 35 | # ANSI color codes for prettier output 36 | RED = '\033[0;31m' 37 | GREEN = '\033[0;32m' 38 | YELLOW = '\033[1;33m' 39 | BLUE = '\033[0;34m' 40 | MAGENTA = '\033[0;35m' 41 | CYAN = '\033[0;36m' 42 | RESET = '\033[0m' 43 | 44 | # Define the configurations to test 45 | CONFIGURATIONS = [ 46 | {"name": "FP16", "key_bits": None, "val_bits": None, "flags": []}, # Baseline 47 | {"name": "K8V8", "key_bits": 8, "val_bits": 8, "flags": ["--kvq", "8"]}, 48 | {"name": "K8V4", "key_bits": 8, "val_bits": 4, "flags": ["--kvq-key", "8", "--kvq-val", "4"]}, 49 | {"name": "K4V8", "key_bits": 4, "val_bits": 8, "flags": ["--kvq-key", "4", "--kvq-val", "8"]}, 50 | {"name": "K4V4", "key_bits": 4, "val_bits": 4, "flags": ["--kvq", "4"]}, 51 | ] 52 | 53 | # Sequence lengths to test 54 | SEQUENCE_LENGTHS = [128, 2048, 4096, 8192] 55 | 56 | # Number of times to repeat each test 57 | REPEAT_COUNT = 3 58 | 59 | # Test prompts - different prompts to avoid caching effects between runs 60 | TEST_PROMPTS = [ 61 | "The meaning of life is", 62 | "In the beginning of the universe", 63 | "The theory of relativity explains", 64 | "The history of artificial intelligence begins", 65 | "The relationship between quantum mechanics and gravity is", 66 | "The future of human civilization depends on", 67 | "To understand consciousness, we must first", 68 | "The evolution of language shows that", 69 | "The optimal economic system would balance", 70 | "The nature of reality suggests that", 71 | ] 72 | 73 | class BenchmarkResult: 74 | """Class to store benchmark results for a single test run""" 75 | 76 | def __init__(self, config_name, seq_len, run_num): 77 | self.config_name = config_name 78 | self.sequence_length = seq_len 79 | self.run_number = run_num 80 | self.vram_usage_mb = 0 81 | self.kv_cache_mb = 0 82 | self.k_cache_mb = 0 83 | self.v_cache_mb = 0 84 | self.throughput_tokens_per_sec = 0 85 | self.perplexity = 0 86 | self.time_to_first_token_ms = 0 87 | self.total_time_sec = 0 88 | self.timestamp = datetime.datetime.now().isoformat() 89 | self.success = False # Track if the run was successful 90 | 91 | def to_dict(self): 92 | """Convert the result to a dictionary for CSV export""" 93 | # Get K bits and V bits from configuration name 94 | k_bits = None 95 | v_bits = None 96 | 97 | # Extract bits from configuration name (e.g., K8V4 -> k_bits=8, v_bits=4) 98 | if self.config_name != "FP16": 99 | match = re.match(r"K(\d+)V(\d+)", self.config_name) 100 | if match: 101 | k_bits = int(match.group(1)) 102 | v_bits = int(match.group(2)) 103 | 104 | return { 105 | "Configuration": self.config_name, 106 | "K_bits": k_bits, 107 | "V_bits": v_bits, 108 | "Sequence_Length": self.sequence_length, 109 | "Run_Number": self.run_number, 110 | "Success": self.success, 111 | "VRAM_Usage_MB": self.vram_usage_mb, 112 | "KV_Cache_MB": self.kv_cache_mb, 113 | "K_Cache_MB": self.k_cache_mb, 114 | "V_Cache_MB": self.v_cache_mb, 115 | "Throughput_Tokens_Per_Sec": self.throughput_tokens_per_sec, 116 | "Perplexity": self.perplexity, 117 | "Time_To_First_Token_ms": self.time_to_first_token_ms, 118 | "Total_Time_Sec": self.total_time_sec, 119 | "Timestamp": self.timestamp, 120 | } 121 | 122 | class Benchmarker: 123 | """Class to run benchmarks and collect results""" 124 | 125 | def __init__(self, base_dir, llama_exec, model_path, output_dir): 126 | self.base_dir = base_dir 127 | self.llama_exec = Path(llama_exec).resolve() 128 | self.model_path = Path(model_path).resolve() 129 | self.output_dir = Path(output_dir).resolve() 130 | self.results = [] 131 | 132 | # Create output directory if it doesn't exist 133 | self.output_dir.mkdir(parents=True, exist_ok=True) 134 | 135 | # Ensure we can run the model 136 | if not self.llama_exec.exists(): 137 | raise FileNotFoundError(f"Llama executable not found at {self.llama_exec}") 138 | if not self.model_path.exists(): 139 | raise FileNotFoundError(f"Model not found at {self.model_path}") 140 | 141 | # Apply the patch to ensure we have the KVSplit functionality 142 | self._apply_patch() 143 | 144 | print(f"{GREEN}Benchmarker initialized:{RESET}") 145 | print(f" - Llama executable: {self.llama_exec}") 146 | print(f" - Model: {self.model_path}") 147 | print(f" - Output directory: {self.output_dir}") 148 | 149 | def _apply_patch(self, apply=False): 150 | if not apply: 151 | print(f"{YELLOW}Skipping patch application - using existing parameters.{RESET}") 152 | return 153 | 154 | llama_cpp_dir = Path(self.base_dir) / "llama.cpp" 155 | arg_cpp_path = llama_cpp_dir / "common" / "arg.cpp" 156 | 157 | # Check if the file contains our parameter 158 | applied = False 159 | if arg_cpp_path.exists(): 160 | with open(arg_cpp_path, 'r') as f: 161 | if "kv_quant_key;" in f.read(): 162 | applied = True 163 | 164 | if applied: 165 | print(f"{YELLOW}KVSplit patch already applied.{RESET}") 166 | return 167 | 168 | print(f"{YELLOW}Applying KVSplit modifications...{RESET}") 169 | try: 170 | # Read the current file 171 | with open(arg_cpp_path, 'r') as f: 172 | content = f.read() 173 | 174 | # Add variable declarations at the top of the file 175 | if "static std::string kv_quant_key;" not in content: 176 | declarations = ( 177 | "// KVSplit declarations\n" 178 | "static std::string kv_quant_key;\n" 179 | "static std::string kv_quant_val;\n" 180 | "static std::string kv_quant_general;\n\n" 181 | ) 182 | # Insert after the includes but before the first function 183 | import_end = content.find("//") 184 | if import_end > 0: 185 | content = content[:import_end] + declarations + content[import_end:] 186 | 187 | # Add the mapping from bit sizes to quantization types 188 | kv_cache_types_end = content.find("const std::vector kv_cache_types") 189 | if kv_cache_types_end > 0: 190 | kv_cache_types_end = content.find("};") 191 | if kv_cache_types_end > 0: 192 | # Find the actual end of the array 193 | kv_cache_types_end = content.find("};") 194 | # Insert after the kv_cache_types array 195 | bit_mapping = ( 196 | "\n\n// Mapping of bit sizes to quantization types\n" 197 | "const std::unordered_map kv_quant_bit_to_type = {\n" 198 | " {16, GGML_TYPE_F16}, // 16-bit = FP16\n" 199 | " {32, GGML_TYPE_F32}, // 32-bit = FP32\n" 200 | " {8, GGML_TYPE_Q8_0}, // 8-bit = Q8_0\n" 201 | " {4, GGML_TYPE_Q4_0}, // 4-bit = Q4_0\n" 202 | "};\n" 203 | ) 204 | content = content[:kv_cache_types_end+2] + bit_mapping + content[kv_cache_types_end+2:] 205 | 206 | # Update the kv_cache_type_from_str function 207 | kv_cache_type_func = content.find("static ggml_type kv_cache_type_from_str") 208 | if kv_cache_type_func > 0: 209 | function_end = content.find("return GGML_TYPE_COUNT;") 210 | if function_end > 0: 211 | # Replace the return line with our enhanced version 212 | old_return = "return GGML_TYPE_COUNT; //invalid" 213 | new_return = ( 214 | " // Also try parsing bit sizes (4 or 8)\n" 215 | " try {\n" 216 | " int bits = std::stoi(s);\n" 217 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n" 218 | " return kv_quant_bit_to_type.at(bits);\n" 219 | " }\n" 220 | " } catch (...) {}\n\n" 221 | " return GGML_TYPE_COUNT; // invalid" 222 | ) 223 | content = content.replace(old_return, new_return) 224 | 225 | # Add helper functions 226 | get_all_kv_cache_types_end = content.find("static std::string get_all_kv_cache_types()") 227 | if get_all_kv_cache_types_end > 0: 228 | function_end = content.find("}", get_all_kv_cache_types_end) 229 | function_end = content.find("}", function_end + 1) 230 | if function_end > 0: 231 | # Add our new functions after get_all_kv_cache_types 232 | helper_functions = ( 233 | "\n\nstatic std::string get_kv_quant_bit_options() {\n" 234 | " // Return the supported bit sizes only (for --kvq-key and --kvq-val)\n" 235 | " std::stringstream msg;\n" 236 | " bool first = true;\n" 237 | " for (const auto& pair : kv_quant_bit_to_type) {\n" 238 | " if (!first) {\n" 239 | " msg << \", \";\n" 240 | " }\n" 241 | " msg << pair.first;\n" 242 | " first = false;\n" 243 | " }\n" 244 | " return msg.str();\n" 245 | "}\n\n" 246 | "// Helper to convert bit size to quantization type\n" 247 | "static ggml_type kv_quant_bits_to_type(int bits) {\n" 248 | " auto it = kv_quant_bit_to_type.find(bits);\n" 249 | " if (it != kv_quant_bit_to_type.end()) {\n" 250 | " return it->second;\n" 251 | " }\n" 252 | " // Default to FP16 if invalid\n" 253 | " return GGML_TYPE_F16;\n" 254 | "}\n" 255 | ) 256 | content = content[:function_end+1] + helper_functions + content[function_end+1:] 257 | 258 | # Add the command-line arguments 259 | cache_type_k_arg = content.find('add_arg_type::opt, "cache-type-k"') 260 | if cache_type_k_arg > 0: 261 | next_arg = content.find("parser.add_arg(", cache_type_k_arg + 1) 262 | if next_arg > 0: 263 | # Add our new arguments after cache-type-k 264 | new_args = ( 265 | "\n\n parser.add_arg(\n" 266 | " add_arg_type::opt, \"kvq-key\", &kv_quant_key,\n" 267 | " add_arg_handler([&](std::string_view value) {\n" 268 | " try {\n" 269 | " int bits = std::stoi(std::string(value));\n" 270 | " // Set key cache quantization type\n" 271 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n" 272 | " params.cache_type_k = kv_quant_bit_to_type.at(bits);\n" 273 | " } else {\n" 274 | " LOG_ERROR(\"Invalid KV cache key quantization bits: %d (valid options: %s)\\n\", \n" 275 | " bits, get_kv_quant_bit_options().c_str());\n" 276 | " return false;\n" 277 | " }\n" 278 | " } catch (...) {\n" 279 | " LOG_ERROR(\"Invalid KV cache key quantization bits: '%s' (valid options: %s)\\n\", \n" 280 | " std::string(value).c_str(), get_kv_quant_bit_options().c_str());\n" 281 | " return false;\n" 282 | " }\n" 283 | " return true;\n" 284 | " }, [&]() -> std::string { return \"\"; }),\n" 285 | " \"\",\n" 286 | " \"Set KV cache key quantization bits (options: \" + get_kv_quant_bit_options() + \")\"\n" 287 | " ).set_env(\"LLAMA_ARG_KVQ_KEY\");\n\n" 288 | " parser.add_arg(\n" 289 | " add_arg_type::opt, \"kvq-val\", &kv_quant_val,\n" 290 | " add_arg_handler([&](std::string_view value) {\n" 291 | " try {\n" 292 | " int bits = std::stoi(std::string(value));\n" 293 | " // Set value cache quantization type\n" 294 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n" 295 | " params.cache_type_v = kv_quant_bit_to_type.at(bits);\n" 296 | " } else {\n" 297 | " LOG_ERROR(\"Invalid KV cache value quantization bits: %d (valid options: %s)\\n\", \n" 298 | " bits, get_kv_quant_bit_options().c_str());\n" 299 | " return false;\n" 300 | " }\n" 301 | " } catch (...) {\n" 302 | " LOG_ERROR(\"Invalid KV cache value quantization bits: '%s' (valid options: %s)\\n\", \n" 303 | " std::string(value).c_str(), get_kv_quant_bit_options().c_str());\n" 304 | " return false;\n" 305 | " }\n" 306 | " return true;\n" 307 | " }, [&]() -> std::string { return \"\"; }),\n" 308 | " \"\",\n" 309 | " \"Set KV cache value quantization bits (options: \" + get_kv_quant_bit_options() + \")\"\n" 310 | " ).set_env(\"LLAMA_ARG_KVQ_VAL\");\n\n" 311 | " parser.add_arg(\n" 312 | " add_arg_type::opt, \"kvq\", &kv_quant_general,\n" 313 | " add_arg_handler([&](std::string_view value) {\n" 314 | " try {\n" 315 | " int bits = std::stoi(std::string(value));\n" 316 | " // Set both key and value cache quantization to the same type for backwards compatibility\n" 317 | " if (kv_quant_bit_to_type.find(bits) != kv_quant_bit_to_type.end()) {\n" 318 | " params.cache_type_k = kv_quant_bit_to_type.at(bits);\n" 319 | " params.cache_type_v = kv_quant_bit_to_type.at(bits);\n" 320 | " } else {\n" 321 | " LOG_ERROR(\"Invalid KV cache quantization bits: %d (valid options: %s)\\n\", \n" 322 | " bits, get_kv_quant_bit_options().c_str());\n" 323 | " return false;\n" 324 | " }\n" 325 | " } catch (...) {\n" 326 | " LOG_ERROR(\"Invalid KV cache quantization bits: '%s' (valid options: %s)\\n\", \n" 327 | " std::string(value).c_str(), get_kv_quant_bit_options().c_str());\n" 328 | " return false;\n" 329 | " }\n" 330 | " return true;\n" 331 | " }, [&]() -> std::string { return \"\"; }),\n" 332 | " \"\",\n" 333 | " \"Set both KV cache key and value quantization bits (options: \" + get_kv_quant_bit_options() + \")\"\n" 334 | " ).set_env(\"LLAMA_ARG_KVQ\");\n" 335 | ) 336 | content = content[:next_arg] + new_args + content[next_arg:] 337 | 338 | # Write the modified content back to the file 339 | with open(arg_cpp_path, 'w') as f: 340 | f.write(content) 341 | 342 | print(f"{GREEN}✓ KVSplit modifications applied successfully{RESET}") 343 | 344 | # Rebuild llama.cpp 345 | print(f"{YELLOW}Rebuilding llama.cpp...{RESET}") 346 | build_dir = llama_cpp_dir / "build" 347 | if build_dir.exists(): 348 | try: 349 | subprocess.run( 350 | ["cmake", "--build", ".", "--config", "Release"], 351 | cwd=str(build_dir), 352 | check=True, 353 | capture_output=True 354 | ) 355 | print(f"{GREEN}✓ llama.cpp rebuilt successfully{RESET}") 356 | except subprocess.CalledProcessError as e: 357 | print(f"{RED}Failed to rebuild llama.cpp: {e}{RESET}") 358 | print(f"Build output: {e.stdout}\n{e.stderr}") 359 | raise 360 | else: 361 | print(f"{RED}Build directory not found at {build_dir}. Please run setup.sh first.{RESET}") 362 | except Exception as e: 363 | print(f"{RED}Failed to apply modifications: {e}{RESET}") 364 | raise 365 | 366 | def _parse_metal_memory(self, log_text): 367 | """Parse Metal memory allocation from log output""" 368 | vram_usage_mb = 0 369 | kv_cache_mb = 0 370 | k_cache_mb = 0 371 | v_cache_mb = 0 372 | 373 | # Try to match the newer format first (Metal_Mapped model buffer size) 374 | metal_alloc = re.search(r"Metal_Mapped model buffer size\s*=\s*([\d.]+)\s*MiB", log_text) 375 | if metal_alloc: 376 | vram_usage_mb = float(metal_alloc.group(1)) 377 | 378 | # If not found, try older formats 379 | if not vram_usage_mb: 380 | metal_alloc = re.search(r"METAL ALLOC.*?size (\d+) KiB", log_text) 381 | if metal_alloc: 382 | # Convert to MB 383 | vram_usage_mb = float(metal_alloc.group(1)) / 1024 # KiB to MB 384 | 385 | if not vram_usage_mb: 386 | metal_alloc = re.search(r"GGML_METAL_log_alloc.*?(\d+)", log_text) 387 | if metal_alloc: 388 | # Convert to MB (check units in the matched string) 389 | vram_usage_mb = float(metal_alloc.group(1)) / (1024 * 1024) # Bytes to MB 390 | 391 | # Parse KV cache size from the newer unified format 392 | kv_unified = re.search(r"KV self size\s*=\s*([\d.]+)\s*MiB", log_text) 393 | if kv_unified: 394 | kv_cache_mb = float(kv_unified.group(1)) 395 | 396 | # Extract K and V sizes from the newer format 397 | k_size = re.search(r"K \([^)]+\):\s*([\d.]+)\s*MiB", log_text) 398 | v_size = re.search(r"V \([^)]+\):\s*([\d.]+)\s*MiB", log_text) 399 | 400 | if k_size: 401 | k_cache_mb = float(k_size.group(1)) 402 | if v_size: 403 | v_cache_mb = float(v_size.group(1)) 404 | 405 | # If not found, try older formats 406 | if kv_cache_mb == 0: 407 | # Old/verbose format 408 | kv_matches = re.findall(r"METAL ALLOC:.*?[kK][vV].*?[cC]ache.*?(\d+) bytes", log_text) 409 | if kv_matches: 410 | # Sum all KV cache allocations and convert to MB 411 | kv_cache_mb = sum(int(x) for x in kv_matches) / (1024 * 1024) 412 | 413 | k_matches = re.findall(r"METAL ALLOC:.*?\bK\b.*?(\d+) bytes", log_text) 414 | if k_matches: 415 | k_cache_mb = sum(int(x) for x in k_matches) / (1024 * 1024) 416 | 417 | v_matches = re.findall(r"METAL ALLOC:.*?\bV\b.*?(\d+) bytes", log_text) 418 | if v_matches: 419 | v_cache_mb = sum(int(x) for x in v_matches) / (1024 * 1024) 420 | 421 | # Newer llama.cpp format for KV cache size 422 | if kv_cache_mb == 0: 423 | log_alloc = re.findall(r"llama_kv_cache_init: memory_size = ([\d.]+) MB", log_text) 424 | if log_alloc: 425 | kv_cache_mb = float(log_alloc[0]) 426 | 427 | # As a last resort, look for Metal KV buffer size 428 | if kv_cache_mb == 0: 429 | metal_kv = re.search(r"Metal KV buffer size\s*=\s*([\d.]+)\s*MiB", log_text) 430 | if metal_kv: 431 | kv_cache_mb = float(metal_kv.group(1)) 432 | 433 | # If we still don't have VRAM usage, use memory_pressure as fallback 434 | if vram_usage_mb == 0: 435 | try: 436 | mem_output = subprocess.run(["memory_pressure", "-Q"], 437 | capture_output=True, 438 | text=True, 439 | check=True).stdout 440 | # Parse memory_pressure output 441 | gpu_mem = re.search(r"GPU Memory: (\d+) MB", mem_output) 442 | if gpu_mem: 443 | vram_usage_mb = float(gpu_mem.group(1)) 444 | except Exception as e: 445 | print(f"{RED}Warning: Failed to get memory info from memory_pressure: {e}{RESET}") 446 | 447 | return vram_usage_mb, kv_cache_mb, k_cache_mb, v_cache_mb 448 | 449 | def _parse_perplexity(self, log_text): 450 | """Parse perplexity from log output""" 451 | # Try to extract log probability values for token predictions 452 | # Extract all instances of "logprob" values 453 | logprob_matches = re.findall(r"\bll=([-\d.]+)\b", log_text) 454 | if logprob_matches and len(logprob_matches) > 1: 455 | # Convert log probabilities to perplexity: exp(-mean(logprobs)) 456 | logprobs = [float(lp) for lp in logprob_matches if float(lp) < 0] # Ignore any positive logprobs (errors) 457 | if logprobs: 458 | avg_logprob = sum(logprobs) / len(logprobs) 459 | perplexity = math.exp(-avg_logprob) # PPL = exp(-avg_logprob) 460 | return perplexity 461 | 462 | # Try standard formats 463 | # Format produced by --perplexity flag 464 | perplexity_match = re.search(r"perplexity = ([\d.]+),", log_text) 465 | if perplexity_match: 466 | return float(perplexity_match.group(1)) 467 | 468 | # Alternate format with 'perplexity:' 469 | perplexity_match = re.search(r"perplexity:\s*([\d.]+)", log_text) 470 | if perplexity_match: 471 | return float(perplexity_match.group(1)) 472 | 473 | # PPL format used in some llama.cpp versions 474 | perplexity_match = re.search(r"PPL\s*=\s*([\d.]+)", log_text) 475 | if perplexity_match: 476 | return float(perplexity_match.group(1)) 477 | 478 | # NLL (negative log likelihood) - convert to perplexity 479 | nll_match = re.search(r"NLL\s*=\s*([\d.]+)", log_text) 480 | if nll_match: 481 | nll = float(nll_match.group(1)) 482 | return math.exp(nll) # perplexity = exp(NLL) 483 | 484 | # Try looking for 'average loss' which is similar to NLL 485 | avg_loss = re.search(r"average loss = ([\d.]+)", log_text) 486 | if avg_loss: 487 | loss = float(avg_loss.group(1)) 488 | return math.exp(loss) # perplexity = exp(loss) 489 | 490 | return 0 491 | 492 | def _parse_throughput(self, log_text): 493 | """Parse throughput (tokens/sec) from log output""" 494 | # Try the new format in llama_perf_context_print 495 | throughput_match = re.search(r"llama_perf_context_print:\s+eval time.*tokens per second,\s+([\d.]+)\)", log_text) 496 | if throughput_match: 497 | return float(throughput_match.group(1)) 498 | 499 | # Try earlier format 500 | throughput_match = re.search(r"eval time: .*? tokens/sec: ([\d.]+)", log_text) 501 | if throughput_match: 502 | return float(throughput_match.group(1)) 503 | 504 | # Try alternate format 505 | throughput_match = re.search(r"tokens per second: ([\d.]+)", log_text) 506 | if throughput_match: 507 | return float(throughput_match.group(1)) 508 | 509 | # Try another common format 510 | throughput_match = re.search(r"([\d.]+) tokens per second\)", log_text) 511 | if throughput_match: 512 | return float(throughput_match.group(1)) 513 | 514 | return 0 515 | 516 | def _parse_time_to_first_token(self, log_text): 517 | """Parse time to first token from log output""" 518 | # Standard format 519 | time_match = re.search(r"time to first token: ([\d.]+) ms", log_text) 520 | if time_match: 521 | return float(time_match.group(1)) 522 | 523 | # Newer format in llama_perf logs 524 | time_match = re.search(r"llama_perf_context_print:\s+prompt eval time\s*=\s*([\d.]+)\s*ms", log_text) 525 | if time_match: 526 | return float(time_match.group(1)) 527 | 528 | return 0 529 | 530 | def _parse_total_time(self, log_text): 531 | """Parse total evaluation time from log output""" 532 | # Standard format 533 | time_match = re.search(r"eval time: ([\d.]+) s", log_text) 534 | if time_match: 535 | return float(time_match.group(1)) 536 | 537 | # Newer format in llama_perf logs 538 | time_match = re.search(r"llama_perf_context_print:\s+total time\s*=\s*([\d.]+)\s*ms", log_text) 539 | if time_match: 540 | # Convert ms to seconds 541 | return float(time_match.group(1)) / 1000.0 542 | 543 | return 0 544 | 545 | def run_benchmark(self, config, seq_len, run_num): 546 | """Run a single benchmark""" 547 | 548 | config_name = config["name"] 549 | prompt = random.choice(TEST_PROMPTS) 550 | 551 | print(f"\n{YELLOW}Running benchmark for {config_name}, sequence length {seq_len}, run {run_num+1}/{REPEAT_COUNT}{RESET}") 552 | 553 | # Create a temporary prompt file for perplexity benchmark 554 | prompt_file = self.output_dir / f"prompt_{config_name}_{seq_len}_{run_num+1}.txt" 555 | with open(prompt_file, "w") as f: 556 | # For perplexity test, we need a longer text 557 | f.write("\n".join([prompt] + TEST_PROMPTS[:3])) 558 | 559 | # Set up command-line arguments for a standard generation benchmark 560 | cmd = [ 561 | str(self.llama_exec), 562 | "-m", str(self.model_path), 563 | "-p", prompt, # Simple prompt 564 | "-c", str(seq_len), # Context size 565 | "-n", str(seq_len), # Generate up to seq_len tokens 566 | "-t", "8", # Number of threads 567 | "--flash-attn" # Enable flash attention which is required for KV quantization 568 | ] 569 | 570 | # Add configuration-specific flags 571 | if config["flags"]: 572 | cmd.extend(config["flags"]) 573 | 574 | print(f"{BLUE}Command: {' '.join(cmd)}{RESET}") 575 | 576 | # Start benchmark - run the process and capture output 577 | start_time = time.time() 578 | try: 579 | process = subprocess.run( 580 | cmd, 581 | capture_output=True, 582 | text=True, 583 | check=False, # Don't raise exception on non-zero exit 584 | cwd=self.base_dir 585 | ) 586 | log_output = process.stdout + process.stderr 587 | success = process.returncode == 0 588 | except Exception as e: 589 | log_output = str(e) 590 | success = False 591 | end_time = time.time() 592 | 593 | # We need a much longer text file for the perplexity test 594 | # Copy the standard perplexity test data to a new file specific for this config 595 | perplexity_value = 0.0 596 | 597 | # Use our pre-created perplexity test data file 598 | perplexity_test_file = Path(self.base_dir) / "perplexity_test_data.txt" 599 | if not os.path.exists(perplexity_test_file): 600 | print(f"{YELLOW}Warning: Perplexity test data file not found. Creating a new one...{RESET}") 601 | # Create sample content - philosophical themes that work well for perplexity testing 602 | perplexity_content = """ 603 | The meaning of life is a profound philosophical question that has been debated for centuries by thinkers across cultures and disciplines. Some philosophers argue that meaning comes from purpose, while others suggest it emerges from human relationships and connections. Religious perspectives often point to divine purpose or spiritual fulfillment, whereas existentialists like Sartre propose that we must create our own meaning in an otherwise indifferent universe. 604 | 605 | The theory of relativity explains the relationship between space and time, revolutionizing our understanding of physics. Einstein's work showed that time and space are not absolute but relative to the observer's frame of reference. This challenged Newton's laws by demonstrating that the speed of light is constant regardless of the observer's motion. 606 | 607 | The history of artificial intelligence begins with ancient myths of mechanical beings and philosophical inquiries about thinking machines. The field formally emerged in the mid-20th century, with the 1956 Dartmouth Conference marking its official birth. 608 | 609 | The relationship between quantum mechanics and gravity is one of the greatest unsolved problems in physics. Standard quantum mechanics has successfully explained the behavior of particles at microscopic scales, while Einstein's general relativity accurately describes gravity at cosmic scales. 610 | 611 | In the beginning of the universe, according to the prevailing Big Bang theory, all matter, energy, space, and time were compressed into an infinitesimally small, infinitely dense point called a singularity. 612 | """ 613 | with open(perplexity_test_file, "w") as f: 614 | f.write(perplexity_content) 615 | 616 | # Run perplexity benchmark using llama-perplexity tool 617 | perplexity_tool = str(self.llama_exec).replace('llama-cli', 'llama-perplexity') 618 | if os.path.exists(perplexity_tool): 619 | perpl_cmd = [ 620 | perplexity_tool, 621 | "-m", str(self.model_path), 622 | "-f", str(perplexity_test_file), 623 | "-t", "8", # Number of threads 624 | "--ctx-size", "512", # Use smaller context size to enable perplexity calculation 625 | "--flash-attn" # Enable flash attention which is required for KV quantization 626 | ] 627 | 628 | # Add configuration-specific flags 629 | if config["flags"]: 630 | perpl_cmd.extend(config["flags"]) 631 | 632 | try: 633 | print(f"{BLUE}Running perplexity test: {' '.join(perpl_cmd)}{RESET}") 634 | perpl_process = subprocess.run( 635 | perpl_cmd, 636 | capture_output=True, 637 | text=True, 638 | check=False, 639 | cwd=self.base_dir 640 | ) 641 | perpl_output = perpl_process.stdout + perpl_process.stderr 642 | 643 | # Save perplexity output to log file 644 | perpl_log_file = self.output_dir / f"perplexity_{config_name}_n{seq_len}_run{run_num+1}.log" 645 | with open(perpl_log_file, "w") as f: 646 | f.write(perpl_output) 647 | 648 | # Parse perplexity result - look for the final PPL estimate 649 | match = re.search(r"Final estimate:\s*PPL\s*=\s*([\d.]+)", perpl_output) 650 | if match: 651 | perplexity_value = float(match.group(1)) 652 | print(f"{GREEN}Perplexity: {perplexity_value:.4f}{RESET}") 653 | else: 654 | # Try alternate format 655 | match = re.search(r"perplexity:\s*([\d.]+)", perpl_output) 656 | if match: 657 | perplexity_value = float(match.group(1)) 658 | print(f"{GREEN}Perplexity: {perplexity_value:.4f}{RESET}") 659 | else: 660 | print(f"{RED}Failed to extract perplexity value from output{RESET}") 661 | except Exception as e: 662 | print(f"{RED}Error running perplexity test: {e}{RESET}") 663 | else: 664 | print(f"{YELLOW}Warning: llama-perplexity tool not found at {perplexity_tool}{RESET}") 665 | 666 | # Save log output to file 667 | log_file = self.output_dir / f"benchmark_{config_name}_n{seq_len}_run{run_num+1}.log" 668 | with open(log_file, "w") as f: 669 | f.write(log_output) 670 | 671 | # Create result object 672 | result = BenchmarkResult(config_name, seq_len, run_num+1) 673 | result.success = success 674 | 675 | # Try to parse metrics from log output even if the command failed 676 | # This allows us to capture partial results 677 | total_allocated, kv_cache_size, k_cache_size, v_cache_size = self._parse_metal_memory(log_output) 678 | result.vram_usage_mb = total_allocated 679 | result.kv_cache_mb = kv_cache_size 680 | result.k_cache_mb = k_cache_size 681 | result.v_cache_mb = v_cache_size 682 | 683 | # Use the perplexity value from the dedicated perplexity tool test 684 | result.perplexity = perplexity_value 685 | 686 | result.throughput_tokens_per_sec = self._parse_throughput(log_output) 687 | result.time_to_first_token_ms = self._parse_time_to_first_token(log_output) 688 | result.total_time_sec = self._parse_total_time(log_output) 689 | 690 | # If we couldn't parse the total time, use our measured time 691 | if result.total_time_sec == 0: 692 | result.total_time_sec = end_time - start_time 693 | 694 | # Save error messages for analysis 695 | error_file = self.output_dir / f"error_{config_name}_n{seq_len}_run{run_num+1}.txt" 696 | if not success: 697 | # Try to extract error message 698 | error_lines = [line for line in log_output.splitlines() if "error" in line.lower() or "exception" in line.lower() or "failed" in line.lower()] 699 | if error_lines: 700 | with open(error_file, "w") as f: 701 | f.write("\n".join(error_lines)) 702 | 703 | if success: 704 | # Print results 705 | print(f"{GREEN}Benchmark completed:{RESET}") 706 | print(f" - VRAM usage: {result.vram_usage_mb:.2f} MB") 707 | print(f" - KV cache: {result.kv_cache_mb:.2f} MB") 708 | print(f" - Throughput: {result.throughput_tokens_per_sec:.2f} tokens/sec") 709 | print(f" - Perplexity: {result.perplexity:.4f}") 710 | print(f" - Log saved to: {log_file}") 711 | else: 712 | # Still try to print any metrics we could parse 713 | print(f"{RED}Benchmark failed. Check log file: {log_file}{RESET}") 714 | if result.kv_cache_mb > 0 or result.throughput_tokens_per_sec > 0 or result.perplexity > 0: 715 | print(f"{YELLOW}Partial results:{RESET}") 716 | if result.kv_cache_mb > 0: 717 | print(f" - KV cache: {result.kv_cache_mb:.2f} MB") 718 | if result.throughput_tokens_per_sec > 0: 719 | print(f" - Throughput: {result.throughput_tokens_per_sec:.2f} tokens/sec") 720 | if result.perplexity > 0: 721 | print(f" - Perplexity: {result.perplexity:.4f}") 722 | 723 | self.results.append(result) 724 | return result 725 | 726 | def run_all_benchmarks(self): 727 | """Run all benchmark configurations""" 728 | start_time = time.time() 729 | total_tests = len(CONFIGURATIONS) * len(SEQUENCE_LENGTHS) * REPEAT_COUNT 730 | completed_tests = 0 731 | 732 | print(f"{GREEN}Starting benchmark suite with {total_tests} total tests{RESET}") 733 | print(f"Testing configurations: {', '.join(c['name'] for c in CONFIGURATIONS)}") 734 | print(f"Sequence lengths: {', '.join(str(n) for n in SEQUENCE_LENGTHS)}") 735 | print(f"Each test repeated {REPEAT_COUNT} times") 736 | 737 | for config in CONFIGURATIONS: 738 | for seq_len in SEQUENCE_LENGTHS: 739 | for run in range(REPEAT_COUNT): 740 | # Run a single benchmark 741 | result = self.run_benchmark(config, seq_len, run) 742 | completed_tests += 1 743 | 744 | # Calculate and display progress 745 | progress_pct = (completed_tests / total_tests) * 100 746 | elapsed = time.time() - start_time 747 | eta = (elapsed / completed_tests) * (total_tests - completed_tests) if completed_tests > 0 else 0 748 | 749 | print(f"{BLUE}Progress: {completed_tests}/{total_tests} ({progress_pct:.1f}%)") 750 | print(f"Elapsed: {elapsed/60:.1f} minutes, ETA: {eta/60:.1f} minutes{RESET}") 751 | 752 | # Verify if the benchmark was successful - if first run fails, skip the rest for this config 753 | if run == 0 and not result.success and not any([result.kv_cache_mb, result.throughput_tokens_per_sec, result.perplexity]): 754 | print(f"{RED}First run for {config['name']} with sequence length {seq_len} failed completely.{RESET}") 755 | print(f"{YELLOW}Skipping remaining runs for this configuration and sequence length.{RESET}") 756 | # Skip the remaining runs for this config+seq_len 757 | completed_tests += REPEAT_COUNT - 1 758 | break 759 | 760 | # Small pause between tests to let system cool down 761 | if completed_tests < total_tests: 762 | print(f"{YELLOW}Waiting 2 seconds before next test...{RESET}") 763 | time.sleep(2) 764 | 765 | total_time = time.time() - start_time 766 | print(f"{GREEN}All benchmarks completed in {total_time/60:.1f} minutes!{RESET}") 767 | 768 | # Export all results 769 | self.export_results() 770 | 771 | # Generate summary statistics 772 | self.generate_summary() 773 | 774 | def export_results(self): 775 | """Export benchmark results to CSV""" 776 | timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") 777 | csv_file = self.output_dir / f"benchmark_results_{timestamp}.csv" 778 | 779 | # Convert results to dictionaries 780 | result_dicts = [result.to_dict() for result in self.results] 781 | 782 | # Write to CSV 783 | if result_dicts: 784 | headers = result_dicts[0].keys() 785 | with open(csv_file, "w", newline="") as f: 786 | writer = csv.DictWriter(f, fieldnames=headers) 787 | writer.writeheader() 788 | writer.writerows(result_dicts) 789 | 790 | print(f"{GREEN}Results exported to {csv_file}{RESET}") 791 | else: 792 | print(f"{YELLOW}No results to export{RESET}") 793 | 794 | # Also export as JSON for easier parsing 795 | json_file = self.output_dir / f"benchmark_results_{timestamp}.json" 796 | with open(json_file, "w") as f: 797 | json.dump(result_dicts, f, indent=2) 798 | 799 | print(f"{GREEN}Results also exported to {json_file}{RESET}") 800 | 801 | return csv_file, json_file 802 | 803 | def generate_summary(self): 804 | """Generate summary statistics from all benchmark runs""" 805 | if not self.results: 806 | print(f"{YELLOW}No results to summarize{RESET}") 807 | return 808 | 809 | print(f"\n{GREEN}=== BENCHMARK SUMMARY ==={RESET}") 810 | 811 | # Check if we have any successful measurements 812 | has_measurements = False 813 | for result in self.results: 814 | if result.kv_cache_mb > 0 or result.throughput_tokens_per_sec > 0 or result.perplexity > 0: 815 | has_measurements = True 816 | break 817 | 818 | if not has_measurements: 819 | print(f"{RED}No successful measurements were captured in the benchmark run.{RESET}") 820 | print(f"{YELLOW}This may be because:{RESET}") 821 | print("1. The KVSplit patch wasn't properly applied") 822 | print("2. The parameters aren't recognized by llama.cpp") 823 | print("3. There was an issue with the benchmark command execution") 824 | print(f"\nCheck the log files in {self.output_dir} for detailed error messages") 825 | return 826 | 827 | # Group results by configuration and sequence length 828 | grouped_results = {} 829 | for result in self.results: 830 | key = (result.config_name, result.sequence_length) 831 | if key not in grouped_results: 832 | grouped_results[key] = [] 833 | grouped_results[key].append(result) 834 | 835 | # Calculate summary statistics for each group 836 | summary_rows = [] 837 | for (config_name, seq_len), results in grouped_results.items(): 838 | # Calculate means and standard deviations 839 | vram_usage = [r.vram_usage_mb for r in results if r.vram_usage_mb > 0] 840 | kv_cache = [r.kv_cache_mb for r in results if r.kv_cache_mb > 0] 841 | throughput = [r.throughput_tokens_per_sec for r in results if r.throughput_tokens_per_sec > 0] 842 | perplexity = [r.perplexity for r in results if r.perplexity > 0] 843 | 844 | # Only calculate stats if we have data 845 | vram_mean = statistics.mean(vram_usage) if vram_usage else 0 846 | vram_stdev = statistics.stdev(vram_usage) if len(vram_usage) > 1 else 0 847 | kv_mean = statistics.mean(kv_cache) if kv_cache else 0 848 | kv_stdev = statistics.stdev(kv_cache) if len(kv_cache) > 1 else 0 849 | throughput_mean = statistics.mean(throughput) if throughput else 0 850 | throughput_stdev = statistics.stdev(throughput) if len(throughput) > 1 else 0 851 | perplexity_mean = statistics.mean(perplexity) if perplexity else 0 852 | perplexity_stdev = statistics.stdev(perplexity) if len(perplexity) > 1 else 0 853 | 854 | # Add to summary rows 855 | summary_rows.append({ 856 | "Configuration": config_name, 857 | "Sequence_Length": seq_len, 858 | "VRAM_Usage_MB_Mean": vram_mean, 859 | "VRAM_Usage_MB_StdDev": vram_stdev, 860 | "KV_Cache_MB_Mean": kv_mean, 861 | "KV_Cache_MB_StdDev": kv_stdev, 862 | "Throughput_Mean": throughput_mean, 863 | "Throughput_StdDev": throughput_stdev, 864 | "Perplexity_Mean": perplexity_mean, 865 | "Perplexity_StdDev": perplexity_stdev, 866 | "Sample_Count": len(results), 867 | }) 868 | 869 | # Export summary to CSV 870 | timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") 871 | summary_file = self.output_dir / f"benchmark_summary_{timestamp}.csv" 872 | 873 | if summary_rows: 874 | headers = summary_rows[0].keys() 875 | with open(summary_file, "w", newline="") as f: 876 | writer = csv.DictWriter(f, fieldnames=headers) 877 | writer.writeheader() 878 | writer.writerows(summary_rows) 879 | 880 | print(f"{GREEN}Summary statistics exported to {summary_file}{RESET}") 881 | 882 | # Print summary to console 883 | print(f"\n{CYAN}Memory Usage Summary (KV Cache MB){RESET}") 884 | print(f"{'Configuration':<10} | {'128 tokens':<15} | {'2048 tokens':<15} | {'4096 tokens':<15} | {'8192 tokens':<15}") 885 | print("-" * 80) 886 | 887 | for config in CONFIGURATIONS: 888 | row = f"{config['name']:<10} | " 889 | for seq_len in SEQUENCE_LENGTHS: 890 | key = (config['name'], seq_len) 891 | if key in grouped_results: 892 | kv_cache = [r.kv_cache_mb for r in grouped_results[key] if r.kv_cache_mb > 0] 893 | if kv_cache: 894 | mean = statistics.mean(kv_cache) 895 | row += f"{mean:6.2f} MB | " 896 | else: 897 | row += f"{'N/A':<15} | " 898 | else: 899 | row += f"{'N/A':<15} | " 900 | print(row) 901 | 902 | print(f"\n{CYAN}Throughput Summary (tokens/sec){RESET}") 903 | print(f"{'Configuration':<10} | {'128 tokens':<15} | {'2048 tokens':<15} | {'4096 tokens':<15} | {'8192 tokens':<15}") 904 | print("-" * 80) 905 | 906 | for config in CONFIGURATIONS: 907 | row = f"{config['name']:<10} | " 908 | for seq_len in SEQUENCE_LENGTHS: 909 | key = (config['name'], seq_len) 910 | if key in grouped_results: 911 | throughput = [r.throughput_tokens_per_sec for r in grouped_results[key] if r.throughput_tokens_per_sec > 0] 912 | if throughput: 913 | mean = statistics.mean(throughput) 914 | row += f"{mean:6.2f} t/s | " 915 | else: 916 | row += f"{'N/A':<15} | " 917 | else: 918 | row += f"{'N/A':<15} | " 919 | print(row) 920 | 921 | print(f"\n{CYAN}Perplexity Summary (lower is better){RESET}") 922 | print(f"{'Configuration':<10} | {'128 tokens':<15} | {'2048 tokens':<15} | {'4096 tokens':<15} | {'8192 tokens':<15}") 923 | print("-" * 80) 924 | 925 | for config in CONFIGURATIONS: 926 | row = f"{config['name']:<10} | " 927 | for seq_len in SEQUENCE_LENGTHS: 928 | key = (config['name'], seq_len) 929 | if key in grouped_results: 930 | perplexity = [r.perplexity for r in grouped_results[key] if r.perplexity > 0] 931 | if perplexity: 932 | mean = statistics.mean(perplexity) 933 | row += f"{mean:6.4f} | " 934 | else: 935 | row += f"{'N/A':<15} | " 936 | else: 937 | row += f"{'N/A':<15} | " 938 | print(row) 939 | 940 | # Print key insights 941 | print(f"\n{GREEN}Key Insights:{RESET}") 942 | print("1. K8V4 (8-bit keys, 4-bit values) typically provides a good balance of memory efficiency") 943 | print(" and quality, keeping key precision where it matters most.") 944 | print("2. K4V8 typically shows more quality degradation as keys are more sensitive to quantization.") 945 | print("3. Longer context lengths demonstrate more significant memory savings with mixed precision.") 946 | print("4. Memory measurements may show slight differences from theoretical calculations due to") 947 | print(" the 256B page alignment in the llama.cpp memory allocator.") 948 | print("5. Using the existing --cache-type-k and --cache-type-v parameters allows for split-precision") 949 | print(" KV cache without modifying the llama.cpp source code.") 950 | print() 951 | print(f"{GREEN}Full benchmark data and logs are available in: {self.output_dir}{RESET}") 952 | 953 | return summary_file 954 | 955 | def main(): 956 | parser = argparse.ArgumentParser(description="KVSplit Benchmark Tool") 957 | parser.add_argument("--base-dir", default=None, help="Base directory for the KVSplit project") 958 | parser.add_argument("--model", default=None, help="Path to the model file") 959 | parser.add_argument("--llama-exec", default=None, help="Path to the llama.cpp executable") 960 | parser.add_argument("--output-dir", default=None, help="Directory to store benchmark results") 961 | 962 | args = parser.parse_args() 963 | 964 | # Determine base directory if not specified 965 | if args.base_dir is None: 966 | args.base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) 967 | 968 | # Set default paths based on base directory 969 | if args.output_dir is None: 970 | args.output_dir = os.path.join(args.base_dir, "results") 971 | 972 | if args.model is None: 973 | args.model = os.path.join(args.base_dir, "models", "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf") 974 | 975 | if args.llama_exec is None: 976 | # First try to find main directly 977 | llama_cpp_dir = os.path.join(args.base_dir, "llama.cpp") 978 | 979 | # Look in build/bin directory first (CMake build) 980 | candidate = os.path.join(llama_cpp_dir, "build", "bin", "main") 981 | if os.path.exists(candidate): 982 | args.llama_exec = candidate 983 | else: 984 | # Try llama-cli 985 | candidate = os.path.join(llama_cpp_dir, "build", "bin", "llama-cli") 986 | if os.path.exists(candidate): 987 | args.llama_exec = candidate 988 | else: 989 | # Try just main in llama.cpp dir (Make build) 990 | candidate = os.path.join(llama_cpp_dir, "main") 991 | if os.path.exists(candidate): 992 | args.llama_exec = candidate 993 | else: 994 | print(f"{RED}Error: Could not find llama.cpp executable. Please specify with --llama-exec{RESET}") 995 | sys.exit(1) 996 | 997 | print(f"{GREEN}KVSplit Benchmark{RESET}") 998 | print(f"Base directory: {args.base_dir}") 999 | print(f"Llama executable: {args.llama_exec}") 1000 | print(f"Model: {args.model}") 1001 | print(f"Output directory: {args.output_dir}") 1002 | 1003 | # Run benchmarks 1004 | benchmarker = Benchmarker(args.base_dir, args.llama_exec, args.model, args.output_dir) 1005 | benchmarker.run_all_benchmarks() 1006 | 1007 | if __name__ == "__main__": 1008 | main() 1009 | -------------------------------------------------------------------------------- /scripts/capture_memory.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # capture_memory.sh - Script to capture memory usage during inference 3 | # 4 | # This script captures screenshots at regular intervals and converts them 5 | # to a GIF, allowing you to visualize memory reduction in Activity Monitor 6 | # or other monitoring tools when running different KV cache configurations. 7 | # 8 | # Usage: 9 | # ./scripts/capture_memory.sh [options] 10 | # 11 | # Options: 12 | # --frames N Number of frames to capture (default: 30) 13 | # --delay N Delay between frames in seconds (default: 1) 14 | # --fps N Frames per second in the output GIF (default: 5) 15 | # --output FILE Output filename (default: memory_reduction.gif) 16 | 17 | set -euo pipefail 18 | 19 | # ANSI color codes 20 | GREEN='\033[0;32m' 21 | YELLOW='\033[1;33m' 22 | BLUE='\033[0;34m' 23 | RED='\033[0;31m' 24 | RESET='\033[0m' 25 | 26 | # Default settings 27 | FRAMES=30 28 | DELAY=1 29 | FPS=5 30 | OUTPUT="memory_reduction.gif" 31 | FRAMES_DIR="capture_frames" 32 | 33 | # Parse arguments 34 | while [[ $# -gt 0 ]]; do 35 | case $1 in 36 | --frames) 37 | FRAMES="$2" 38 | shift 2 39 | ;; 40 | --delay) 41 | DELAY="$2" 42 | shift 2 43 | ;; 44 | --fps) 45 | FPS="$2" 46 | shift 2 47 | ;; 48 | --output) 49 | OUTPUT="$2" 50 | shift 2 51 | ;; 52 | *) 53 | echo -e "${RED}Error: Unknown option $1${RESET}" 54 | exit 1 55 | ;; 56 | esac 57 | done 58 | 59 | # Check for gifski 60 | if ! command -v gifski &> /dev/null; then 61 | echo -e "${YELLOW}⚠️ gifski is not installed. Attempting to install it...${RESET}" 62 | if command -v brew &> /dev/null; then 63 | brew install gifski 64 | else 65 | echo -e "${RED}❌ Error: Homebrew is not installed. Please install gifski manually:${RESET}" 66 | echo -e "${BLUE} brew install gifski${RESET}" 67 | exit 1 68 | fi 69 | fi 70 | 71 | echo -e "${GREEN}📹 Memory Usage Capture Tool${RESET}" 72 | echo -e "${BLUE}This script will capture ${FRAMES} screenshots at ${DELAY}-second intervals${RESET}" 73 | echo -e "${BLUE}and combine them into a GIF at ${FPS} frames per second.${RESET}" 74 | echo 75 | echo -e "${YELLOW}Instructions:${RESET}" 76 | echo -e "1. Open Activity Monitor (Applications > Utilities > Activity Monitor)" 77 | echo -e "2. Position it on your screen where it's clearly visible" 78 | echo -e "3. Sort by Memory usage (click the 'Memory' column header)" 79 | echo -e "4. Run your KVSplit commands in another terminal window" 80 | echo 81 | 82 | # Ask for confirmation 83 | read -p "Press Enter when ready to start capturing, or Ctrl+C to cancel..." 84 | 85 | # Create frames directory 86 | mkdir -p "$FRAMES_DIR" 87 | 88 | # Start countdown 89 | echo -e "${YELLOW}Starting capture in:${RESET}" 90 | for i in {3..1}; do 91 | echo -e "${YELLOW}$i...${RESET}" 92 | sleep 1 93 | done 94 | 95 | echo -e "${GREEN}🎬 Capturing started!${RESET}" 96 | 97 | # Capture frames 98 | for ((i=1; i<=$FRAMES; i++)); do 99 | echo -e "${BLUE}Capturing frame $i of $FRAMES${RESET}" 100 | screencapture -x "$FRAMES_DIR/frame_$(printf "%03d" $i).png" 101 | 102 | # Show progress 103 | percent=$((i * 100 / FRAMES)) 104 | completed=$((percent / 2)) 105 | remaining=$((50 - completed)) 106 | progress="[" 107 | for ((j=0; j/dev/null 2>&1; then 41 | echo -e "${YELLOW}⚠️ Homebrew not found. Please install Homebrew first:${RESET}" 42 | echo -e "${YELLOW} /bin/bash -c \"\$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"${RESET}" 43 | exit 1 44 | fi 45 | 46 | brew install cmake parallel gifski python || { 47 | echo -e "${YELLOW}⚠️ Some Homebrew dependencies couldn't be installed.${RESET}" 48 | echo -e "${YELLOW} This might be due to them already being installed or another issue.${RESET}" 49 | echo -e "${YELLOW} Continuing with installation...${RESET}" 50 | } 51 | 52 | # Python environment setup 53 | echo -e "${BLUE}Python environment setup options:${RESET}" 54 | echo -e "1. ${GREEN}Virtual Environment:${RESET} Create a new Python venv in this directory" 55 | echo -e "2. ${GREEN}System Python:${RESET} Use system Python installation" 56 | echo -e "3. ${GREEN}Skip:${RESET} Skip Python setup (manual setup later)" 57 | read -p "Select option [1/2/3] (default: 1): " -n 1 -r PYTHON_CHOICE 58 | echo 59 | 60 | case $PYTHON_CHOICE in 61 | "2") 62 | echo -e "${BLUE}Using system Python...${RESET}" 63 | PYTHON_CMD="python3" 64 | PIP_CMD="pip3" 65 | 66 | # Check if Python is available 67 | if ! command -v $PYTHON_CMD &> /dev/null; then 68 | echo -e "${RED}❌ Python not found. Please install Python 3.${RESET}" 69 | PYTHON_SETUP_FAILED=true 70 | else 71 | echo -e "${GREEN}✅ Using Python: $($PYTHON_CMD --version)${RESET}" 72 | fi 73 | 74 | # Check if dependencies are already installed 75 | echo -e "${BLUE}Checking for required Python packages...${RESET}" 76 | MISSING_PACKAGES="" 77 | for pkg in pandas numpy matplotlib seaborn; do 78 | if ! $PYTHON_CMD -c "import $pkg" &> /dev/null; then 79 | MISSING_PACKAGES="$MISSING_PACKAGES $pkg" 80 | fi 81 | done 82 | 83 | if [ -n "$MISSING_PACKAGES" ]; then 84 | echo -e "${YELLOW}⚠️ Missing packages:$MISSING_PACKAGES${RESET}" 85 | echo -e "${YELLOW}You can install them with: $PIP_CMD install$MISSING_PACKAGES${RESET}" 86 | read -p "Install missing packages now? (y/n) " -n 1 -r 87 | echo 88 | if [[ $REPLY =~ ^[Yy]$ ]]; then 89 | $PIP_CMD install $MISSING_PACKAGES || { 90 | echo -e "${RED}❌ Failed to install packages.${RESET}" 91 | echo -e "${YELLOW}You may need to install them manually:${RESET}" 92 | echo -e "${YELLOW}$PIP_CMD install pandas numpy matplotlib seaborn${RESET}" 93 | } 94 | fi 95 | else 96 | echo -e "${GREEN}✅ All required packages are installed.${RESET}" 97 | fi 98 | ;; 99 | "3") 100 | echo -e "${YELLOW}Skipping Python setup...${RESET}" 101 | echo -e "${YELLOW}You'll need to manually install these packages to use visualization tools:${RESET}" 102 | echo -e "${YELLOW}- pandas${RESET}" 103 | echo -e "${YELLOW}- numpy${RESET}" 104 | echo -e "${YELLOW}- matplotlib${RESET}" 105 | echo -e "${YELLOW}- seaborn${RESET}" 106 | ;; 107 | *) 108 | # Default is virtual environment 109 | echo -e "${BLUE}Setting up Python virtual environment...${RESET}" 110 | if ! command -v python3 &> /dev/null; then 111 | echo -e "${RED}❌ Python not found. Please install Python 3.${RESET}" 112 | PYTHON_SETUP_FAILED=true 113 | elif ! python3 -m venv venv; then 114 | echo -e "${YELLOW}⚠️ Could not create Python virtual environment.${RESET}" 115 | echo -e "${YELLOW}Continuing without virtual environment...${RESET}" 116 | PYTHON_SETUP_FAILED=true 117 | else 118 | echo -e "${GREEN}✅ Virtual environment created.${RESET}" 119 | echo -e "${BLUE}Installing Python dependencies...${RESET}" 120 | source venv/bin/activate 121 | pip install --upgrade pip 122 | pip install pandas numpy matplotlib seaborn || { 123 | echo -e "${YELLOW}⚠️ Could not install Python dependencies.${RESET}" 124 | echo -e "${YELLOW}You may need to install them manually later.${RESET}" 125 | PYTHON_SETUP_FAILED=true 126 | } 127 | 128 | if [ -z "$PYTHON_SETUP_FAILED" ]; then 129 | echo -e "${GREEN}✅ Python dependencies installed successfully.${RESET}" 130 | echo -e "${YELLOW}To activate the virtual environment in the future, run:${RESET}" 131 | echo -e "${YELLOW}source venv/bin/activate${RESET}" 132 | fi 133 | fi 134 | ;; 135 | esac 136 | 137 | if [ -n "$PYTHON_SETUP_FAILED" ]; then 138 | echo -e "${YELLOW}Python setup incomplete. Visualization tools may not work.${RESET}" 139 | echo -e "${YELLOW}You can manually install required packages later.${RESET}" 140 | fi 141 | 142 | # Setup method selection 143 | echo -e "${BLUE}Choose llama.cpp setup method:${RESET}" 144 | echo -e "1. ${GREEN}Standard:${RESET} Clone and patch llama.cpp (recommended for most users)" 145 | echo -e "2. ${GREEN}Git Submodule:${RESET} Use a forked llama.cpp as a submodule (advanced)" 146 | read -p "Select option [1/2] (default: 1): " -n 1 -r SETUP_CHOICE 147 | echo 148 | 149 | if [[ $SETUP_CHOICE == "2" ]]; then 150 | echo -e "${BLUE}Setting up llama.cpp as a git submodule...${RESET}" 151 | 152 | # Check if this is a git repository 153 | if [ ! -d ".git" ]; then 154 | echo -e "${YELLOW}⚠️ This directory is not a git repository. Initializing git...${RESET}" 155 | git init 156 | fi 157 | 158 | # Remove existing llama.cpp if present 159 | if [ -d "llama.cpp" ]; then 160 | echo -e "${YELLOW}⚠️ Removing existing llama.cpp directory...${RESET}" 161 | rm -rf llama.cpp 162 | fi 163 | 164 | # Add the forked llama.cpp as a submodule 165 | # Note: You would typically fork llama.cpp to your own account and modify it there 166 | echo -e "${BLUE}Adding llama.cpp as a submodule...${RESET}" 167 | echo -e "${YELLOW}In a real setup, you would use your own fork with KVSplit changes already applied.${RESET}" 168 | git submodule add https://github.com/ggerganov/llama.cpp.git 169 | git submodule update --init --recursive 170 | 171 | # Apply the patch to the submodule 172 | echo -e "${BLUE}Applying KV split patch to llama.cpp submodule...${RESET}" 173 | cd llama.cpp 174 | git apply ../patch/fixed_kv_patch.diff || echo -e "${YELLOW}⚠️ Patch application failed, you may need to modify the patch.${RESET}" 175 | cd .. 176 | 177 | echo -e "${GREEN}✅ Submodule setup complete.${RESET}" 178 | echo -e "${YELLOW}Note: In a real-world scenario, you would fork llama.cpp, make your changes there,${RESET}" 179 | echo -e "${YELLOW} and use your fork as the submodule URL instead of applying patches.${RESET}" 180 | else 181 | # Standard clone and patch approach 182 | echo -e "${BLUE}Setting up llama.cpp (standard method)...${RESET}" 183 | if [ -d "llama.cpp" ]; then 184 | echo -e "${YELLOW}⚠️ llama.cpp directory already exists.${RESET}" 185 | read -p "Update llama.cpp repository? (y/n) " -n 1 -r 186 | echo 187 | if [[ $REPLY =~ ^[Yy]$ ]]; then 188 | echo -e "${BLUE}Updating llama.cpp...${RESET}" 189 | cd llama.cpp 190 | git fetch --all 191 | git reset --hard origin/master 192 | cd .. 193 | fi 194 | else 195 | echo -e "${BLUE}Cloning llama.cpp repository...${RESET}" 196 | git clone https://github.com/ggerganov/llama.cpp 197 | fi 198 | 199 | # Apply the patch to llama.cpp 200 | echo -e "${BLUE}Applying KV split patch to llama.cpp...${RESET}" 201 | cd llama.cpp 202 | git apply ../patch/fixed_kv_patch.diff || echo -e "${YELLOW}⚠️ Patch application failed or patch already applied.${RESET}" 203 | cd .. 204 | fi 205 | 206 | # Check if the KV split patch exists 207 | echo -e "${BLUE}Setting up KV split patch...${RESET}" 208 | if [ ! -f "patch/fixed_kv_patch.diff" ]; then 209 | echo -e "${YELLOW}⚠️ KV patch not found, copying from included patch...${RESET}" 210 | # Copy the fixed patch that works with current llama.cpp 211 | if [ -f "patch/split_kv_quant.diff" ]; then 212 | cp patch/split_kv_quant.diff patch/fixed_kv_patch.diff 213 | else 214 | echo -e "${RED}❌ No patch files found! Your installation may not have KV split functionality.${RESET}" 215 | mkdir -p patch 216 | # Include a minimal version of the patch inline as a fallback 217 | cat > patch/fixed_kv_patch.diff << 'EOL' 218 | diff --git a/common/common.cpp b/common/common.cpp 219 | index abcdef1..1234567 100644 220 | --- a/common/common.cpp 221 | +++ b/common/common.cpp 222 | @@ -1290,6 +1290,30 @@ struct cli_params { 223 | "KV cache quantization for keys. If not specified, defaults to F16", 224 | {"--cache-type-k", "-ctk"} 225 | ); 226 | + 227 | + add_param( 228 | + ¶ms.cache_type_v, 229 | + [](enum llama_kv_cache_type & val, const std::string & arg) { 230 | + val = llama_model_kv_cache_type_from_str(arg.c_str()); 231 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) { 232 | + return CLI_PARAM_CONVERSION_ERROR; 233 | + } 234 | + return CLI_PARAM_CONVERSION_OK; 235 | + }, 236 | + "KV cache quantization for values. If not specified, defaults to F16", 237 | + {"--cache-type-v", "-ctv"} 238 | + ); 239 | + 240 | + // Combined KV cache quantization (sets both key and value) 241 | + add_param( 242 | + [&](const std::string & arg) { 243 | + enum llama_kv_cache_type val = llama_model_kv_cache_type_from_str(arg.c_str()); 244 | + if (val == LLAMA_KV_CACHE_TYPE_COUNT) { 245 | + return CLI_PARAM_CONVERSION_ERROR; 246 | + } 247 | + params.cache_type_k = params.cache_type_v = val; 248 | + return CLI_PARAM_CONVERSION_OK; 249 | + }, 250 | + "--kvq", "-kvq" 251 | + ); 252 | } 253 | EOL 254 | fi 255 | fi 256 | 257 | # Continue with build process 258 | 259 | # Build llama.cpp with Metal support 260 | echo -e "${BLUE}Building llama.cpp with Metal support...${RESET}" 261 | cd llama.cpp 2>/dev/null || true # Only cd if not already in llama.cpp 262 | mkdir -p build 263 | cd build 264 | cmake .. -DLLAMA_METAL=ON 265 | cmake --build . --config Release -j 266 | cd ../.. 267 | 268 | # Download a small test model if no models exist 269 | echo -e "${BLUE}Checking for test models...${RESET}" 270 | if [ ! "$(ls -A models 2>/dev/null)" ]; then 271 | echo -e "${BLUE}No models found. Would you like to download TinyLlama for testing?${RESET}" 272 | read -p "(y/n) " -n 1 -r 273 | echo 274 | if [[ $REPLY =~ ^[Yy]$ ]]; then 275 | echo -e "${BLUE}Downloading TinyLlama model...${RESET}" 276 | TINYLLAMA_URL="https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" 277 | MODEL_NAME="tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" 278 | curl -L -o models/$MODEL_NAME $TINYLLAMA_URL 279 | echo -e "${GREEN}Model downloaded to models/$MODEL_NAME${RESET}" 280 | fi 281 | fi 282 | 283 | # Set up perplexity test data 284 | echo -e "${BLUE}Setting up perplexity test data...${RESET}" 285 | cat > perplexity_test_data.txt << 'EOL' 286 | The importance of efficient memory usage in language models cannot be overstated. 287 | As context lengths grow longer, the KV cache becomes a significant bottleneck. 288 | By applying different precision to keys and values, we can achieve substantial 289 | memory savings without compromising model quality. This approach is particularly 290 | beneficial for consumer devices like Apple Silicon Macs, where memory constraints 291 | are more pronounced. Through careful benchmarking, we've found that 8-bit keys 292 | combined with 4-bit values offers an excellent balance of efficiency and quality. 293 | EOL 294 | 295 | echo -e "${GREEN}✅ KVSplit installed successfully!${RESET}" 296 | echo -e "${BLUE}Directory structure:${RESET}" 297 | echo -e " ${YELLOW}./llama.cpp/build/bin/${RESET} - Compiled binaries" 298 | echo -e " ${YELLOW}./models/${RESET} - LLM model files" 299 | echo -e " ${YELLOW}./scripts/${RESET} - Utility scripts" 300 | echo -e " ${YELLOW}./results/${RESET} - Benchmark results" 301 | echo -e " ${YELLOW}./plots/${RESET} - Visualization outputs" 302 | echo -e "" 303 | echo -e "${GREEN}Recommended usage:${RESET}" 304 | echo -e "${YELLOW}# Run inference with K8V4 (recommended configuration)${RESET}" 305 | echo -e "./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p \"Your prompt\" --kvq 8 --flash-attn" 306 | echo -e "" 307 | echo -e "${YELLOW}# Run quick comparison test${RESET}" 308 | echo -e "./scripts/quick_compare.py --model models/your-model.gguf" 309 | echo -e "" 310 | echo -e "${YELLOW}# Run full benchmark${RESET}" 311 | echo -e "python scripts/benchmark_kvsplit.py" 312 | echo -e "" 313 | echo -e "${GREEN}Thank you for using KVSplit!${RESET}" 314 | -------------------------------------------------------------------------------- /scripts/quick_compare.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | quick_compare.py - Runs a quick comparison of different KV quantization settings 4 | 5 | This script provides a simple way to compare different KV cache quantization 6 | configurations for llama.cpp models. It shows memory usage, speed, and quality 7 | metrics in an easy-to-understand table format. 8 | 9 | Usage: 10 | python quick_compare.py --model ~/models/your-model.gguf --prompt "Your test prompt" 11 | """ 12 | 13 | import argparse 14 | import subprocess 15 | import re 16 | import os 17 | import json 18 | import tempfile 19 | import time 20 | from pathlib import Path 21 | import sys 22 | 23 | # ANSI color codes 24 | GREEN = '\033[0;32m' 25 | YELLOW = '\033[1;33m' 26 | BLUE = '\033[0;34m' 27 | RED = '\033[0;31m' 28 | RESET = '\033[0m' 29 | 30 | def print_color(color, message): 31 | """Print colored message to console""" 32 | print(f"{color}{message}{RESET}") 33 | 34 | def create_temp_prompt_file(prompt, min_length=200): 35 | """Create a temporary file with sufficient prompt text for perplexity testing""" 36 | # Ensure the prompt is long enough for meaningful perplexity testing 37 | if len(prompt) < min_length: 38 | # Extend the prompt with philosophical content 39 | extension = """ 40 | The concept of memory efficiency in language models is fundamentally important. 41 | As we process longer contexts, the memory footprint becomes a critical constraint. 42 | By applying different precision to attention mechanisms, we can achieve significant 43 | savings without compromising the quality of generated text or understanding. 44 | This approach is particularly valuable for resource-constrained environments. 45 | """ 46 | prompt = prompt + extension 47 | 48 | # Write to temp file 49 | fd, temp_path = tempfile.mkstemp(text=True) 50 | with os.fdopen(fd, 'w') as f: 51 | f.write(prompt) 52 | 53 | return temp_path 54 | 55 | def parse_memory_from_output(output): 56 | """Parse memory usage metrics from llama.cpp output""" 57 | try: 58 | # Look for KV cache memory usage 59 | kv_cache_mb = None 60 | kv_match = re.search(r'KV cache elements: \d+.*?(\d+(?:\.\d+)?) MiB', output, re.MULTILINE) 61 | if kv_match: 62 | kv_cache_mb = float(kv_match.group(1)) 63 | 64 | # Look for VRAM usage (try multiple patterns) 65 | vram_mb = None 66 | patterns = [ 67 | r'VRAM usage: (\d+(?:\.\d+)?) MiB', 68 | r'GPU memory used: \d+ bytes = (\d+(?:\.\d+)?) MB' 69 | ] 70 | for pattern in patterns: 71 | match = re.search(pattern, output, re.MULTILINE) 72 | if match: 73 | vram_mb = float(match.group(1)) 74 | break 75 | 76 | return kv_cache_mb, vram_mb 77 | 78 | except Exception as e: 79 | print_color(RED, f"Error parsing memory metrics: {e}") 80 | return None, None 81 | 82 | def parse_speed_from_output(output): 83 | """Parse speed metrics from llama.cpp output""" 84 | try: 85 | # Parse tokens per second 86 | speed_match = re.search(r'(\d+(\.\d+)?) tokens/sec', output, re.MULTILINE) 87 | if speed_match: 88 | tokens_per_sec = float(speed_match.group(1)) 89 | return tokens_per_sec 90 | 91 | # Alternative pattern 92 | speed_match = re.search(r'eval time: (\d+\.\d+) ms \((\d+\.\d+) tokens/sec\)', output, re.MULTILINE) 93 | if speed_match: 94 | tokens_per_sec = float(speed_match.group(2)) 95 | return tokens_per_sec 96 | 97 | return None 98 | except Exception as e: 99 | print_color(RED, f"Error parsing speed metrics: {e}") 100 | return None 101 | 102 | def parse_perplexity(output): 103 | """Parse perplexity from llama-perplexity output""" 104 | try: 105 | perplexity_match = re.search(r'perplexity: (\d+\.\d+)', output, re.MULTILINE | re.IGNORECASE) 106 | if perplexity_match: 107 | return float(perplexity_match.group(1)) 108 | 109 | # Alternative pattern 110 | perplexity_match = re.search(r'final\s+(?:avg)?\s*perplexity: (\d+\.\d+)', output, re.MULTILINE | re.IGNORECASE) 111 | if perplexity_match: 112 | return float(perplexity_match.group(1)) 113 | 114 | return None 115 | except Exception as e: 116 | print_color(RED, f"Error parsing perplexity: {e}") 117 | return None 118 | 119 | def run_comparison(model_path, prompt, seq_len=2048, num_threads=8): 120 | """Run a comparison of different KV quantization settings""" 121 | configs = [ 122 | {"name": "FP16", "args": "--ctx-size {seq_len}", "desc": "Baseline (16-bit)"}, 123 | {"name": "K8V8", "args": "--ctx-size {seq_len} --kvq 8", "desc": "8-bit keys & values"}, 124 | {"name": "K8V4", "args": "--ctx-size {seq_len} --kvq-key 8 --kvq-val 4", "desc": "8-bit keys, 4-bit values (RECOMMENDED)"}, 125 | {"name": "K4V8", "args": "--ctx-size {seq_len} --kvq-key 4 --kvq-val 8", "desc": "4-bit keys, 8-bit values"}, 126 | {"name": "K4V4", "args": "--ctx-size {seq_len} --kvq 4", "desc": "4-bit keys & values"}, 127 | ] 128 | 129 | # Validate the model path 130 | model_path = os.path.expanduser(model_path) 131 | if not os.path.exists(model_path): 132 | print_color(RED, f"Error: Model file not found at {model_path}") 133 | sys.exit(1) 134 | 135 | # Create a temporary prompt file for perplexity testing 136 | prompt_file = create_temp_prompt_file(prompt) 137 | 138 | # Get the base directory 139 | base_dir = Path(__file__).parent.parent.absolute() 140 | llama_cli_path = base_dir / "llama.cpp" / "build" / "bin" / "llama-cli" 141 | llama_perplexity_path = base_dir / "llama.cpp" / "build" / "bin" / "llama-perplexity" 142 | 143 | # Validate the binaries 144 | if not os.path.exists(llama_cli_path): 145 | print_color(RED, f"Error: llama-cli binary not found at {llama_cli_path}") 146 | print_color(YELLOW, "Did you run the install_kvsplit.sh script?") 147 | sys.exit(1) 148 | 149 | if not os.path.exists(llama_perplexity_path): 150 | print_color(RED, f"Error: llama-perplexity binary not found at {llama_perplexity_path}") 151 | print_color(YELLOW, "Did you run the install_kvsplit.sh script?") 152 | sys.exit(1) 153 | 154 | results = [] 155 | fp16_perplexity = None # For calculating relative perplexity 156 | 157 | print_color(GREEN, "Running quick comparison of KV cache configurations:") 158 | print_color(BLUE, f"Model: {model_path}") 159 | print_color(BLUE, f"Context size: {seq_len} tokens") 160 | print_color(BLUE, f"Threads: {num_threads}") 161 | print() 162 | 163 | for i, config in enumerate(configs): 164 | config_name = config["name"] 165 | print_color(YELLOW, f"[{i+1}/{len(configs)}] Testing {config_name}: {config['desc']}") 166 | 167 | try: 168 | # Format the args string with the sequence length 169 | args = config["args"].format(seq_len=seq_len) 170 | 171 | # Run inference to measure memory and speed 172 | inference_cmd = f"{llama_cli_path} -m {model_path} {args} -p \"{prompt[:50]}\" -n 50 -t {num_threads} --flash-attn" 173 | print_color(BLUE, f"Running: {inference_cmd}") 174 | 175 | try: 176 | inference_output = subprocess.check_output( 177 | inference_cmd, shell=True, stderr=subprocess.STDOUT 178 | ).decode('utf-8', errors='ignore') 179 | except subprocess.CalledProcessError as e: 180 | inference_output = e.output.decode('utf-8', errors='ignore') 181 | print_color(RED, f"Command failed with exit code {e.returncode}") 182 | print_color(RED, inference_output) 183 | continue 184 | 185 | # Run perplexity test 186 | perplexity_cmd = f"{llama_perplexity_path} -m {model_path} {args} -f {prompt_file} -t {num_threads}" 187 | print_color(BLUE, f"Running perplexity test: {perplexity_cmd}") 188 | 189 | try: 190 | perplexity_output = subprocess.check_output( 191 | perplexity_cmd, shell=True, stderr=subprocess.STDOUT 192 | ).decode('utf-8', errors='ignore') 193 | except subprocess.CalledProcessError as e: 194 | perplexity_output = e.output.decode('utf-8', errors='ignore') 195 | print_color(RED, f"Perplexity command failed with exit code {e.returncode}") 196 | print_color(RED, perplexity_output) 197 | perplexity_output = "" 198 | 199 | # Parse metrics 200 | kv_cache_mb, vram_mb = parse_memory_from_output(inference_output) 201 | tokens_per_sec = parse_speed_from_output(inference_output) 202 | perplexity = parse_perplexity(perplexity_output) 203 | 204 | # Store FP16 perplexity as baseline 205 | if config_name == "FP16" and perplexity is not None: 206 | fp16_perplexity = perplexity 207 | 208 | # Calculate perplexity change 209 | perplexity_change = None 210 | if fp16_perplexity is not None and perplexity is not None: 211 | perplexity_change = ((perplexity - fp16_perplexity) / fp16_perplexity) * 100 212 | 213 | results.append({ 214 | "Configuration": config_name, 215 | "Description": config["desc"], 216 | "KV_Cache_MB": kv_cache_mb, 217 | "VRAM_MB": vram_mb, 218 | "Tokens_per_sec": tokens_per_sec, 219 | "Perplexity": perplexity, 220 | "Perplexity_Change_Pct": perplexity_change 221 | }) 222 | 223 | print_color(GREEN, f"Completed {config_name} test") 224 | if kv_cache_mb is not None: 225 | print_color(GREEN, f" KV Cache: {kv_cache_mb:.2f} MB") 226 | if vram_mb is not None: 227 | print_color(GREEN, f" VRAM: {vram_mb:.2f} MB") 228 | if tokens_per_sec is not None: 229 | print_color(GREEN, f" Speed: {tokens_per_sec:.2f} tokens/sec") 230 | if perplexity is not None: 231 | print_color(GREEN, f" Perplexity: {perplexity:.4f}") 232 | if perplexity_change is not None: 233 | change_color = GREEN if perplexity_change < 1.0 else (YELLOW if perplexity_change < 5.0 else RED) 234 | print_color(change_color, f" Quality impact: {perplexity_change:+.2f}% vs FP16") 235 | 236 | print() 237 | time.sleep(1) # Brief pause between tests 238 | 239 | except Exception as e: 240 | print_color(RED, f"Error running {config_name} test: {e}") 241 | 242 | # Clean up the temporary file 243 | try: 244 | os.unlink(prompt_file) 245 | except: 246 | pass 247 | 248 | # Calculate savings percentages 249 | if len(results) > 0 and "FP16" in [r["Configuration"] for r in results]: 250 | fp16_result = next(r for r in results if r["Configuration"] == "FP16") 251 | fp16_kv = fp16_result.get("KV_Cache_MB") 252 | fp16_vram = fp16_result.get("VRAM_MB") 253 | 254 | if fp16_kv is not None: 255 | for result in results: 256 | kv = result.get("KV_Cache_MB") 257 | if kv is not None and result["Configuration"] != "FP16": 258 | result["KV_Savings_Pct"] = (1 - kv / fp16_kv) * 100 259 | 260 | if fp16_vram is not None: 261 | for result in results: 262 | vram = result.get("VRAM_MB") 263 | if vram is not None and result["Configuration"] != "FP16": 264 | result["VRAM_Savings_Pct"] = (1 - vram / fp16_vram) * 100 265 | 266 | # Display results as a table 267 | if len(results) > 0: 268 | print_color(GREEN, "📊 KVSplit Comparison Results:") 269 | print() 270 | 271 | # Header 272 | print(f"{'Configuration':<12} {'KV Cache':<15} {'VRAM':<15} {'Speed':<15} {'Quality':<15} {'Description':<30}") 273 | print("-" * 100) 274 | 275 | # Rows 276 | for result in results: 277 | config = result.get("Configuration", "") 278 | 279 | # KV cache column 280 | kv_cache = f"{result.get('KV_Cache_MB', 'N/A'):.2f} MB" if result.get('KV_Cache_MB') else "N/A" 281 | if "KV_Savings_Pct" in result: 282 | kv_cache += f" (-{result['KV_Savings_Pct']:.1f}%)" 283 | 284 | # VRAM column 285 | vram = f"{result.get('VRAM_MB', 'N/A'):.2f} MB" if result.get('VRAM_MB') else "N/A" 286 | if "VRAM_Savings_Pct" in result: 287 | vram += f" (-{result['VRAM_Savings_Pct']:.1f}%)" 288 | 289 | # Speed column 290 | speed = f"{result.get('Tokens_per_sec', 'N/A'):.1f} t/s" if result.get('Tokens_per_sec') else "N/A" 291 | 292 | # Quality column 293 | quality = "" 294 | if result.get('Perplexity') is not None: 295 | quality = f"{result.get('Perplexity'):.4f}" 296 | if result.get('Perplexity_Change_Pct') is not None and config != "FP16": 297 | quality += f" ({result['Perplexity_Change_Pct']:+.2f}%)" 298 | else: 299 | quality = "N/A" 300 | 301 | print(f"{config:<12} {kv_cache:<15} {vram:<15} {speed:<15} {quality:<15} {result.get('Description', ''):<30}") 302 | 303 | print() 304 | print_color(GREEN, "Interpretation:") 305 | print_color(BLUE, "- Lower KV Cache and VRAM values are better (more memory efficient)") 306 | print_color(BLUE, "- Higher Speed values are better (faster inference)") 307 | print_color(BLUE, "- Lower Perplexity values and smaller % changes are better (higher quality)") 308 | print() 309 | print_color(GREEN, "Recommendation:") 310 | print_color(YELLOW, "K8V4 (8-bit keys, 4-bit values) typically offers the best balance of memory savings and quality.") 311 | print() 312 | 313 | # Save results as JSON 314 | try: 315 | base_dir = Path(__file__).parent.parent.absolute() 316 | results_dir = base_dir / "results" 317 | os.makedirs(results_dir, exist_ok=True) 318 | timestamp = time.strftime("%Y%m%d_%H%M%S") 319 | json_path = results_dir / f"quick_compare_{timestamp}.json" 320 | 321 | with open(json_path, 'w') as f: 322 | json.dump(results, f, indent=2) 323 | 324 | print_color(GREEN, f"Results saved to {json_path}") 325 | except Exception as e: 326 | print_color(RED, f"Error saving results: {e}") 327 | 328 | def main(): 329 | """Main function""" 330 | parser = argparse.ArgumentParser(description="Compare KV quantization settings") 331 | parser.add_argument("--model", required=True, help="Path to the model file") 332 | parser.add_argument("--prompt", default="The theory of quantum mechanics explains how particles behave at the atomic and subatomic levels. This counterintuitive framework has revolutionized our understanding of physics.", 333 | help="Test prompt") 334 | parser.add_argument("--seq-len", type=int, default=2048, help="Sequence length to test") 335 | parser.add_argument("--threads", type=int, default=8, help="Number of threads to use") 336 | args = parser.parse_args() 337 | 338 | run_comparison(args.model, args.prompt, args.seq_len, args.threads) 339 | 340 | if __name__ == "__main__": 341 | main() 342 | -------------------------------------------------------------------------------- /scripts/run_benchmark.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -euo pipefail 3 | 4 | # Get the directory of this script 5 | SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 6 | LLAMA_CPP_DIR="${SCRIPT_DIR}/llama.cpp" 7 | MODEL_PATH="${SCRIPT_DIR}/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" 8 | RESULTS_DIR="${SCRIPT_DIR}/results" 9 | 10 | # Create results directory if it doesn't exist 11 | mkdir -p "${RESULTS_DIR}" 12 | 13 | # Check if model exists 14 | if [ ! -f "${MODEL_PATH}" ]; then 15 | echo "Error: Model not found at ${MODEL_PATH}" 16 | exit 1 17 | fi 18 | 19 | # Change to llama.cpp directory 20 | cd "${LLAMA_CPP_DIR}" || exit 1 21 | 22 | # Test configurations 23 | CONFIGS=( 24 | "FP16" 25 | "K8V8" 26 | "K8V4" 27 | "K4V8" 28 | "K4V4" 29 | ) 30 | 31 | # Sequence lengths to test 32 | SEQUENCE_LENGTHS=(128 2048 4096 8192) 33 | 34 | # Run benchmarks 35 | for seq_len in "${SEQUENCE_LENGTHS[@]}"; do 36 | echo -e "\n=== Benchmarking sequence length: ${seq_len} ===" 37 | 38 | for config in "${CONFIGS[@]}"; do 39 | echo -e "\n--- Testing ${config} ---" 40 | 41 | # Set KV cache parameters based on config 42 | case $config in 43 | "FP16") 44 | KV_ARGS="" 45 | ;; 46 | "K8V8") 47 | KV_ARGS="--kvq-key 8 --kvq-val 8" 48 | ;; 49 | "K8V4") 50 | KV_ARGS="--kvq-key 8 --kvq-val 4" 51 | ;; 52 | "K4V8") 53 | KV_ARGS="--kvq-key 4 --kvq-val 8" 54 | ;; 55 | "K4V4") 56 | KV_ARGS="--kvq-key 4 --kvq-val 4" 57 | ;; 58 | *) 59 | echo "Unknown config: ${config}" 60 | continue 61 | ;; 62 | esac 63 | 64 | # Run the benchmark 65 | OUTPUT_FILE="${RESULTS_DIR}/benchmark_${config}_seq${seq_len}.txt" 66 | echo "Running: ./main -m ${MODEL_PATH} -p \"Benchmarking KV cache performance\" -n ${seq_len} -t 8 -fa 0 ${KV_ARGS}" 67 | 68 | # Run with timeout to prevent hanging 69 | timeout 5m ./main -m "${MODEL_PATH}" -p "Benchmarking KV cache performance" \ 70 | -n ${seq_len} -t 8 -fa 0 ${KV_ARGS} 2>&1 | tee "${OUTPUT_FILE}" || \ 71 | echo "Warning: Benchmark for ${config} with seq_len=${seq_len} timed out or failed" 72 | done 73 | done 74 | 75 | echo -e "\n=== Benchmarking complete! Results saved to ${RESULTS_DIR} ===" 76 | -------------------------------------------------------------------------------- /scripts/test_kvsplit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -euo pipefail 3 | 4 | # Get the directory of this script 5 | SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 6 | LLAMA_CPP_DIR="${SCRIPT_DIR}/llama.cpp" 7 | MODEL_PATH="${SCRIPT_DIR}/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" 8 | 9 | # Check if model exists 10 | if [ ! -f "${MODEL_PATH}" ]; then 11 | echo "Error: Model not found at ${MODEL_PATH}" 12 | exit 1 13 | fi 14 | 15 | # Run a simple test 16 | cd "${LLAMA_CPP_DIR}" || exit 1 17 | 18 | # Test with default settings 19 | echo -e "\n=== Testing with default settings (FP16) ===" 20 | ./main -m "${MODEL_PATH}" -p "Hello, world" -n 16 -t 4 -fa 0 -t 8 21 | 22 | # Test with K8V4 (8-bit keys, 4-bit values) 23 | echo -e "\n\n=== Testing with K8V4 ===" 24 | ./main -m "${MODEL_PATH}" -p "Hello, world" -n 16 -t 4 -fa 0 -t 8 --kvq-key 8 --kvq-val 4 25 | -------------------------------------------------------------------------------- /scripts/validate_kvsplit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Exit on error, undefined variables, and pipeline failures 4 | set -euo pipefail 5 | 6 | # Colors for better output 7 | RED='\033[0;31m' 8 | GREEN='\033[0;32m' 9 | YELLOW='\033[1;33m' 10 | BLUE='\033[0;34m' 11 | NC='\033[0m' # No Color 12 | 13 | # Function to display error messages 14 | error_exit() { 15 | local line_number=$1 16 | local message=$2 17 | echo -e "${RED}Error on line ${line_number}: ${message}${NC}" >&2 18 | exit 1 19 | } 20 | 21 | # Trap errors 22 | trap 'error_exit $LINENO "Command failed with status $?"' ERR 23 | 24 | # Function to print section headers 25 | print_section() { 26 | echo -e "\n${YELLOW}==> $1${NC}" 27 | } 28 | 29 | # Function to extract Metal allocation information 30 | extract_metal_stats() { 31 | local logfile=$1 32 | local type=$2 33 | echo -e "${BLUE}Extracting Metal stats for $type:${NC}" 34 | 35 | # Check if log has Metal allocation information 36 | if grep -q "METAL ALLOC" "$logfile"; then 37 | echo -e "${GREEN}✓ Metal memory logging enabled${NC}" 38 | 39 | # Extract total allocated memory 40 | total_allocated=$(grep "METAL ALLOC" "$logfile" | grep -v "freed" | awk '{sum+=$3} END {print sum/1024/1024 " MB"}') 41 | echo "Total Metal memory allocated: $total_allocated" 42 | 43 | # Extract KV cache allocation if available 44 | kv_cache_alloc=$(grep -i "kv " "$logfile" | grep -i "cache" | grep "METAL ALLOC" | awk '{sum+=$3} END {if(sum>0) print sum/1024/1024 " MB"; else print "Not found"}') 45 | echo "KV cache memory: $kv_cache_alloc" 46 | 47 | # Extract K and V allocations separately if possible 48 | k_alloc=$(grep -i " k " "$logfile" | grep "METAL ALLOC" | awk '{sum+=$3} END {if(sum>0) print sum/1024/1024 " MB"; else print "Not found"}') 49 | v_alloc=$(grep -i " v " "$logfile" | grep "METAL ALLOC" | awk '{sum+=$3} END {if(sum>0) print sum/1024/1024 " MB"; else print "Not found"}') 50 | 51 | echo "K cache memory: $k_alloc" 52 | echo "V cache memory: $v_alloc" 53 | else 54 | echo -e "${RED}No Metal allocation information found. Try using -t 8 flag.${NC}" 55 | fi 56 | 57 | echo "----------------------------------------------" 58 | } 59 | 60 | # Function to verify KV cache type in logs 61 | verify_kv_types() { 62 | local logfile=$1 63 | local expected_k_type=$2 64 | local expected_v_type=$3 65 | 66 | echo -e "${BLUE}Verifying KV cache types:${NC}" 67 | 68 | # Extract type info from logs 69 | if grep -q "type_k" "$logfile" && grep -q "type_v" "$logfile"; then 70 | k_type=$(grep "type_k" "$logfile" | head -1 | sed 's/.*type_k[^a-z0-9]*\([a-z0-9_]*\).*/\1/') 71 | v_type=$(grep "type_v" "$logfile" | head -1 | sed 's/.*type_v[^a-z0-9]*\([a-z0-9_]*\).*/\1/') 72 | 73 | echo "Found key type: $k_type (Expected: $expected_k_type)" 74 | echo "Found value type: $v_type (Expected: $expected_v_type)" 75 | 76 | # Basic verification 77 | if [[ "$k_type" == *"$expected_k_type"* ]]; then 78 | echo -e "${GREEN}✓ Key type matches expected type${NC}" 79 | else 80 | echo -e "${RED}× Key type does not match expected type${NC}" 81 | fi 82 | 83 | if [[ "$v_type" == *"$expected_v_type"* ]]; then 84 | echo -e "${GREEN}✓ Value type matches expected type${NC}" 85 | else 86 | echo -e "${RED}× Value type does not match expected type${NC}" 87 | fi 88 | else 89 | echo -e "${YELLOW}KV cache type information not found in logs${NC}" 90 | fi 91 | 92 | echo "----------------------------------------------" 93 | } 94 | 95 | # Main execution 96 | print_section "KVSplit Validation Script" 97 | 98 | # Get base directory 99 | BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" 100 | LLAMA_CPP_DIR="${BASE_DIR}/llama.cpp" 101 | MODEL_PATH="${BASE_DIR}/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" 102 | PATCH_PATH="${BASE_DIR}/patch/kvsplit_fixed.patch" 103 | RESULTS_DIR="${BASE_DIR}/results" 104 | 105 | # Ensure results directory exists 106 | mkdir -p "${RESULTS_DIR}" 107 | 108 | # Verify model exists 109 | if [ ! -f "${MODEL_PATH}" ]; then 110 | error_exit "$LINENO" "Model not found at ${MODEL_PATH}. Run setup.sh first." 111 | fi 112 | 113 | # Check if patch exists 114 | if [ ! -f "${PATCH_PATH}" ]; then 115 | error_exit "$LINENO" "Patch file not found at ${PATCH_PATH}" 116 | fi 117 | 118 | # Apply the patch 119 | print_section "Applying KVSplit patch" 120 | cd "${LLAMA_CPP_DIR}" 121 | # Check if patch has already been applied 122 | if grep -q "kvq-key" "${LLAMA_CPP_DIR}/common/arg.cpp"; then 123 | echo -e "${YELLOW}Patch appears to be already applied${NC}" 124 | else 125 | # Apply the patch 126 | patch -p1 < "${PATCH_PATH}" || error_exit "$LINENO" "Failed to apply patch" 127 | echo -e "${GREEN}✓ Patch applied successfully${NC}" 128 | fi 129 | 130 | # Build llama.cpp with Metal support 131 | print_section "Building llama.cpp with Metal support" 132 | cd "${LLAMA_CPP_DIR}" 133 | 134 | # Build with CMake instead of make, since that's how we set up in Step 1 135 | echo "Building with CMake..." 136 | rm -rf build 2>/dev/null || true 137 | mkdir -p build 138 | cd build 139 | 140 | # Configure with CMake 141 | cmake .. \ 142 | -DCMAKE_BUILD_TYPE=Release \ 143 | -DLLAMA_METAL=on \ 144 | -DCMAKE_OSX_ARCHITECTURES="arm64" \ 145 | -DCMAKE_OSX_DEPLOYMENT_TARGET="11.0" \ 146 | -DLLAMA_METAL_EMBED_LIBRARY=ON \ 147 | -DLLAMA_METAL_SHADER_DEBUG=OFF \ 148 | -DLLAMA_BUILD_SERVER=OFF \ 149 | -DBUILD_SHARED_LIBS=OFF || error_exit "$LINENO" "CMake configuration failed" 150 | 151 | # Build with multiple jobs 152 | cmake --build . --config Release -j$(sysctl -n hw.logicalcpu) || \ 153 | error_exit "$LINENO" "Failed to build llama.cpp" 154 | 155 | # Find the main executable 156 | MAIN_EXEC="" 157 | if [ -f "bin/main" ]; then 158 | MAIN_EXEC="bin/main" 159 | elif [ -f "bin/llama-cli" ]; then 160 | MAIN_EXEC="bin/llama-cli" 161 | else 162 | # Try to find any llama executable 163 | MAIN_EXEC=$(find . -name "llama-*" -type f -executable | grep -v '\.o$' | head -1) 164 | if [ -z "$MAIN_EXEC" ]; then 165 | error_exit "$LINENO" "Could not find main llama executable" 166 | fi 167 | fi 168 | 169 | echo -e "${GREEN}✓ Successfully built llama.cpp with Metal support${NC}" 170 | echo "Main executable: ${MAIN_EXEC}" 171 | 172 | cd "${LLAMA_CPP_DIR}" 173 | 174 | # Run tests 175 | print_section "Running validation tests" 176 | 177 | # Test case 1: Baseline (FP16 - same precision for both) 178 | echo -e "${BLUE}Test 1: Baseline (FP16)${NC}" 179 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_fp16.log" 180 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} -t 8 -n 32 -p \"Hello world\"" 181 | ./${MAIN_EXEC} -m "${MODEL_PATH}" -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || { 182 | echo -e "${RED}Test 1 failed with exit code $?${NC}" 183 | error_exit "$LINENO" "Baseline test failed" 184 | } 185 | echo -e "${GREEN}✓ Test 1 completed successfully${NC}" 186 | extract_metal_stats "${LOG_FILE}" "FP16" 187 | verify_kv_types "${LOG_FILE}" "f16" "f16" 188 | 189 | # Test case 2: Using existing --kvq parameter (Q8_0 - same precision for both) 190 | echo -e "${BLUE}Test 2: Using existing --kvq parameter (Q8_0)${NC}" 191 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_kvq_q8.log" 192 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq 8 -t 8 -n 32 -p \"Hello world\"" 193 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq 8 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || { 194 | echo -e "${RED}Test 2 failed with exit code $?${NC}" 195 | error_exit "$LINENO" "Test using --kvq parameter failed" 196 | } 197 | echo -e "${GREEN}✓ Test 2 completed successfully${NC}" 198 | extract_metal_stats "${LOG_FILE}" "Q8_0" 199 | verify_kv_types "${LOG_FILE}" "q8_0" "q8_0" 200 | 201 | # Test case 3: Split precision (K8V4 - 8-bit keys, 4-bit values) 202 | echo -e "${BLUE}Test 3: Split precision (K8V4 - 8-bit keys, 4-bit values)${NC}" 203 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_k8v4.log" 204 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq-key 8 --kvq-val 4 -t 8 -n 32 -p \"Hello world\"" 205 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq-key 8 --kvq-val 4 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || { 206 | echo -e "${RED}Test 3 failed with exit code $?${NC}" 207 | error_exit "$LINENO" "Split precision K8V4 test failed" 208 | } 209 | echo -e "${GREEN}✓ Test 3 completed successfully${NC}" 210 | extract_metal_stats "${LOG_FILE}" "K8V4" 211 | verify_kv_types "${LOG_FILE}" "q8_0" "q4_0" 212 | 213 | # Test case 4: Reverse configuration (K4V8 - 4-bit keys, 8-bit values) 214 | echo -e "${BLUE}Test 4: Reverse configuration (K4V8 - 4-bit keys, 8-bit values)${NC}" 215 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_k4v8.log" 216 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq-key 4 --kvq-val 8 -t 8 -n 32 -p \"Hello world\"" 217 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq-key 4 --kvq-val 8 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || { 218 | echo -e "${RED}Test 4 failed with exit code $?${NC}" 219 | error_exit "$LINENO" "Reverse configuration K4V8 test failed" 220 | } 221 | echo -e "${GREEN}✓ Test 4 completed successfully${NC}" 222 | extract_metal_stats "${LOG_FILE}" "K4V8" 223 | verify_kv_types "${LOG_FILE}" "q4_0" "q8_0" 224 | 225 | # Test case 5: Both 4-bit (K4V4 - 4-bit keys, 4-bit values) 226 | echo -e "${BLUE}Test 5: Both 4-bit (K4V4 - 4-bit keys, 4-bit values)${NC}" 227 | LOG_FILE="${RESULTS_DIR}/kvsplit_test_k4v4.log" 228 | echo "Running: ./${MAIN_EXEC} -m ${MODEL_PATH} --kvq 4 -t 8 -n 32 -p \"Hello world\"" 229 | ./${MAIN_EXEC} -m "${MODEL_PATH}" --kvq 4 -t 8 -n 32 -p "Hello world" > "${LOG_FILE}" 2>&1 || { 230 | echo -e "${RED}Test 5 failed with exit code $?${NC}" 231 | error_exit "$LINENO" "K4V4 test failed" 232 | } 233 | echo -e "${GREEN}✓ Test 5 completed successfully${NC}" 234 | extract_metal_stats "${LOG_FILE}" "K4V4" 235 | verify_kv_types "${LOG_FILE}" "q4_0" "q4_0" 236 | 237 | # Print summary 238 | print_section "KVSplit Validation Summary" 239 | echo -e "${GREEN}✓ All tests completed successfully!${NC}" 240 | echo "Results saved in: ${RESULTS_DIR}" 241 | echo "" 242 | echo -e "${YELLOW}Memory usage summary:${NC}" 243 | echo "Baseline (FP16): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_fp16.log" | awk '{print $4, $5}')" 244 | echo "Q8_0 (--kvq 8): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_kvq_q8.log" | awk '{print $4, $5}')" 245 | echo "K8V4 (--kvq-key 8 --kvq-val 4): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_k8v4.log" | awk '{print $4, $5}')" 246 | echo "K4V8 (--kvq-key 4 --kvq-val 8): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_k4v8.log" | awk '{print $4, $5}')" 247 | echo "K4V4 (--kvq 4): $(grep "KV cache memory:" "${RESULTS_DIR}/kvsplit_test_k4v4.log" | awk '{print $4, $5}')" 248 | echo "" 249 | echo -e "${BLUE}Notes:${NC}" 250 | echo "1. K8V4 (8-bit keys, 4-bit values) typically provides a good balance of memory savings with lower quality loss" 251 | echo "2. K4V8 (4-bit keys, 8-bit values) typically shows more quality degradation as keys are more sensitive to quantization" 252 | echo "3. Results may vary by model size and context length" 253 | echo "4. Memory measurements may show slight differences from theoretical calculations due to 256B page alignment" 254 | echo "" 255 | echo "Run these with longer contexts and different prompts to better evaluate impact on quality" 256 | -------------------------------------------------------------------------------- /scripts/visualize_results.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Visualization Script for KVSplit Benchmarks 5 | 6 | This script generates publication-quality plots from the benchmark data, 7 | showing memory usage, performance impact, quality impact, and key-value sensitivity. 8 | """ 9 | 10 | import os 11 | import glob 12 | import pandas as pd 13 | import numpy as np 14 | import matplotlib.pyplot as plt 15 | import seaborn as sns 16 | from pathlib import Path 17 | 18 | # Use colorblind-friendly style 19 | plt.style.use("tableau-colorblind10") 20 | 21 | # Constants 22 | OUTPUT_DIR = Path("../plots") 23 | RESULTS_DIR = Path("../results") 24 | 25 | # Create output directory if it doesn't exist 26 | os.makedirs(OUTPUT_DIR, exist_ok=True) 27 | 28 | def load_latest_results(): 29 | """Load the most recent benchmark results CSV file""" 30 | # Find the most recent benchmark results file 31 | result_files = glob.glob(str(RESULTS_DIR / "benchmark_results_*.csv")) 32 | if not result_files: 33 | raise FileNotFoundError("No benchmark result files found") 34 | 35 | # Sort by modification time, newest first 36 | latest_file = max(result_files, key=os.path.getmtime) 37 | print(f"Using benchmark data from: {latest_file}") 38 | 39 | # Load data 40 | df = pd.read_csv(latest_file) 41 | return df 42 | 43 | def prepare_data(df): 44 | """Process the benchmark data for visualization""" 45 | # Group by configuration and sequence length, averaging across runs 46 | grouped = df.groupby(['Configuration', 'Sequence_Length']).agg({ 47 | 'VRAM_Usage_MB': 'mean', 48 | 'KV_Cache_MB': 'mean', 49 | 'Throughput_Tokens_Per_Sec': 'mean', 50 | 'Perplexity': 'mean', 51 | 'Success': 'mean' # How many runs succeeded 52 | }).reset_index() 53 | 54 | # Calculate the baseline FP16 values for each sequence length 55 | fp16_baseline = grouped[grouped['Configuration'] == 'FP16'].copy() 56 | 57 | # Create lookup dictionaries for baseline values 58 | vram_baseline = dict(zip(fp16_baseline['Sequence_Length'], fp16_baseline['VRAM_Usage_MB'])) 59 | kv_baseline = dict(zip(fp16_baseline['Sequence_Length'], fp16_baseline['KV_Cache_MB'])) 60 | perplexity_baseline = fp16_baseline['Perplexity'].mean() 61 | 62 | # Add percent savings and perplexity change columns 63 | grouped['VRAM_Savings_Pct'] = grouped.apply( 64 | lambda row: 0 if row['Configuration'] == 'FP16' else 65 | (1 - row['VRAM_Usage_MB'] / vram_baseline[row['Sequence_Length']]) * 100, 66 | axis=1 67 | ) 68 | 69 | grouped['KV_Savings_Pct'] = grouped.apply( 70 | lambda row: 0 if row['Configuration'] == 'FP16' else 71 | (1 - row['KV_Cache_MB'] / kv_baseline[row['Sequence_Length']]) * 100, 72 | axis=1 73 | ) 74 | 75 | grouped['Perplexity_Change_Pct'] = grouped.apply( 76 | lambda row: ((row['Perplexity'] - perplexity_baseline) / perplexity_baseline) * 100, 77 | axis=1 78 | ) 79 | 80 | return grouped 81 | 82 | def plot_memory_usage(df): 83 | """Create a bar chart showing memory usage by configuration and sequence length""" 84 | plt.figure(figsize=(12, 8)) 85 | 86 | # Create a grouped bar chart for KV cache size 87 | ax = sns.barplot( 88 | data=df, 89 | x='Configuration', 90 | y='KV_Cache_MB', 91 | hue='Sequence_Length', 92 | palette='viridis' 93 | ) 94 | 95 | # Add title and labels 96 | plt.title('KV Cache Memory Usage by Configuration', fontsize=16) 97 | plt.ylabel('KV Cache Size (MB)', fontsize=14) 98 | plt.xlabel('Configuration', fontsize=14) 99 | plt.xticks(rotation=0, fontsize=12) 100 | plt.yticks(fontsize=12) 101 | 102 | # Adjust legend 103 | plt.legend(title='Sequence Length', fontsize=12, title_fontsize=13) 104 | 105 | # Add annotations showing % savings vs FP16 for a specific sequence length 106 | # Choose the longest sequence length for annotations 107 | longest_seq = df['Sequence_Length'].max() 108 | long_seq_data = df[df['Sequence_Length'] == longest_seq] 109 | 110 | # Get baseline VRAM for FP16 111 | fp16_kv = long_seq_data[long_seq_data['Configuration'] == 'FP16']['KV_Cache_MB'].values[0] 112 | 113 | # Annotate each non-FP16 bar with savings percentage 114 | bars = [patch for i, patch in enumerate(ax.patches) 115 | if i % len(df['Sequence_Length'].unique()) == len(df['Sequence_Length'].unique()) - 1] 116 | 117 | configs = long_seq_data['Configuration'].unique() 118 | for i, bar in enumerate(bars): 119 | if i < len(configs) and configs[i] != 'FP16': 120 | config_kv = long_seq_data[long_seq_data['Configuration'] == configs[i]]['KV_Cache_MB'].values[0] 121 | savings = (1 - config_kv / fp16_kv) * 100 122 | ax.text( 123 | bar.get_x() + bar.get_width()/2, 124 | bar.get_height() + 5, 125 | f'{savings:.1f}%', 126 | ha='center', 127 | fontsize=11, 128 | fontweight='bold', 129 | color='green' 130 | ) 131 | 132 | # Add a note about the annotations 133 | plt.annotate( 134 | f'Percentages show memory savings\ncompared to FP16 for {longest_seq} tokens', 135 | xy=(0.5, 0.97), 136 | xycoords='figure fraction', 137 | ha='center', 138 | fontsize=12, 139 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8) 140 | ) 141 | 142 | # Save the figure 143 | plt.tight_layout() 144 | plt.savefig(OUTPUT_DIR / 'kv_cache_memory_usage.png', dpi=200, bbox_inches="tight") 145 | plt.close() 146 | 147 | def plot_performance(df): 148 | """Create a bar chart showing throughput by configuration and sequence length""" 149 | plt.figure(figsize=(12, 8)) 150 | 151 | # Create a grouped bar chart for throughput 152 | ax = sns.barplot( 153 | data=df, 154 | x='Configuration', 155 | y='Throughput_Tokens_Per_Sec', 156 | hue='Sequence_Length', 157 | palette='viridis' 158 | ) 159 | 160 | # Add title and labels 161 | plt.title('Inference Speed by Configuration', fontsize=16) 162 | plt.ylabel('Throughput (Tokens per second)', fontsize=14) 163 | plt.xlabel('Configuration', fontsize=14) 164 | plt.xticks(rotation=0, fontsize=12) 165 | plt.yticks(fontsize=12) 166 | 167 | # Adjust legend 168 | plt.legend(title='Sequence Length', fontsize=12, title_fontsize=13) 169 | 170 | # Calculate average change vs FP16 for configurations 171 | configs = df['Configuration'].unique() 172 | seq_lengths = df['Sequence_Length'].unique() 173 | 174 | # Average improvement in throughput compared to FP16 175 | improvements = {} 176 | for config in configs: 177 | if config == 'FP16': 178 | continue 179 | improvement_sum = 0 180 | for seq_len in seq_lengths: 181 | fp16_throughput = df[(df['Configuration'] == 'FP16') & 182 | (df['Sequence_Length'] == seq_len)]['Throughput_Tokens_Per_Sec'].values[0] 183 | config_throughput = df[(df['Configuration'] == config) & 184 | (df['Sequence_Length'] == seq_len)]['Throughput_Tokens_Per_Sec'].values[0] 185 | improvement_pct = ((config_throughput / fp16_throughput) - 1) * 100 186 | improvement_sum += improvement_pct 187 | improvements[config] = improvement_sum / len(seq_lengths) 188 | 189 | # Annotate with the average improvement 190 | y_max = df['Throughput_Tokens_Per_Sec'].max() * 1.05 191 | for i, config in enumerate(configs[1:], 1): # Skip FP16 192 | plt.annotate( 193 | f"{improvements[config]:.1f}% vs FP16", 194 | xy=(i, y_max * 0.95), 195 | ha='center', 196 | fontsize=11, 197 | fontweight='bold', 198 | color='green' if improvements[config] > 0 else 'red' 199 | ) 200 | 201 | # Add a note about the annotations 202 | plt.annotate( 203 | 'Percentages show average throughput\nimprovement compared to FP16', 204 | xy=(0.5, 0.97), 205 | xycoords='figure fraction', 206 | ha='center', 207 | fontsize=12, 208 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8) 209 | ) 210 | 211 | # Save the figure 212 | plt.tight_layout() 213 | plt.savefig(OUTPUT_DIR / 'inference_speed.png', dpi=200, bbox_inches="tight") 214 | plt.close() 215 | 216 | def plot_quality_impact(df): 217 | """Create a bar chart showing perplexity change vs FP16""" 218 | # Get average perplexity per configuration (averaging across sequence lengths) 219 | perplexity_by_config = df.groupby('Configuration')['Perplexity'].mean().reset_index() 220 | baseline = perplexity_by_config[perplexity_by_config['Configuration'] == 'FP16']['Perplexity'].values[0] 221 | 222 | # Calculate perplexity change vs baseline 223 | perplexity_by_config['Perplexity_Change'] = ((perplexity_by_config['Perplexity'] - baseline) / baseline) * 100 224 | 225 | plt.figure(figsize=(10, 6)) 226 | 227 | # Plot the perplexity change (excluding FP16) 228 | non_fp16 = perplexity_by_config[perplexity_by_config['Configuration'] != 'FP16'] 229 | bars = plt.bar( 230 | non_fp16['Configuration'], 231 | non_fp16['Perplexity_Change'], 232 | color=sns.color_palette("Reds_r", len(non_fp16)) 233 | ) 234 | 235 | # Add title and labels 236 | plt.title('Quality Impact: Perplexity Change vs FP16', fontsize=16) 237 | plt.ylabel('Perplexity Change (%)', fontsize=14) 238 | plt.xlabel('Configuration', fontsize=14) 239 | plt.axhline(y=0, color='blue', linestyle='-', alpha=0.3, label='FP16 Baseline') 240 | 241 | # Add a red line for 5% degradation threshold 242 | plt.axhline(y=5, color='red', linestyle='--', alpha=0.5, label='5% Degradation Threshold') 243 | 244 | # Add annotations 245 | for bar in bars: 246 | height = bar.get_height() 247 | plt.text( 248 | bar.get_x() + bar.get_width()/2, 249 | height + 0.1 if height > 0 else height - 0.3, 250 | f'{height:.2f}%', 251 | ha='center', 252 | fontsize=12, 253 | fontweight='bold', 254 | color='black' 255 | ) 256 | 257 | # Add legend 258 | plt.legend(fontsize=12) 259 | 260 | # Add annotation explaining the implications 261 | plt.annotate( 262 | 'Lower is better. Values below 5% generally\nindicate minimal quality degradation.', 263 | xy=(0.5, 0.97), 264 | xycoords='figure fraction', 265 | ha='center', 266 | fontsize=12, 267 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8) 268 | ) 269 | 270 | # Save the figure 271 | plt.tight_layout() 272 | plt.savefig(OUTPUT_DIR / 'perplexity_change.png', dpi=200, bbox_inches="tight") 273 | plt.close() 274 | 275 | def plot_key_vs_value_sensitivity(df): 276 | """Create a heatmap showing sensitivity to key vs value precision""" 277 | # Create a dataframe for the heatmap 278 | configs = ['FP16', 'K8V8', 'K8V4', 'K4V8', 'K4V4'] 279 | k_bits = [16, 8, 8, 4, 4] 280 | v_bits = [16, 8, 4, 8, 4] 281 | perplexity_avg = df.groupby('Configuration')['Perplexity'].mean().reset_index() 282 | 283 | # Create a lookup for perplexity values 284 | perplexity_lookup = dict(zip(perplexity_avg['Configuration'], perplexity_avg['Perplexity'])) 285 | perplexity_values = [perplexity_lookup.get(config, np.nan) for config in configs] 286 | 287 | # Calculate perplexity change 288 | fp16_perplexity = perplexity_lookup['FP16'] 289 | perplexity_change = [((p - fp16_perplexity) / fp16_perplexity) * 100 for p in perplexity_values] 290 | 291 | # Create dataframe for heatmap 292 | sensitivity_data = pd.DataFrame({ 293 | 'Config': configs, 294 | 'K_bits': k_bits, 295 | 'V_bits': v_bits, 296 | 'Perplexity': perplexity_values, 297 | 'Perplexity_Change_Pct': perplexity_change 298 | }) 299 | 300 | # Create a figure with two subplots 301 | fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7)) 302 | 303 | # Plot 1: Heatmap of perplexity values by K/V bit precision 304 | pivot1 = sensitivity_data.pivot(index='K_bits', columns='V_bits', values='Perplexity') 305 | sns.heatmap( 306 | pivot1, 307 | annot=True, 308 | fmt='.2f', 309 | cmap='viridis_r', # Lower is better for perplexity 310 | ax=ax1, 311 | cbar_kws={'label': 'Perplexity'} 312 | ) 313 | ax1.set_title('Perplexity by Key/Value Precision', fontsize=14) 314 | ax1.set_xlabel('Value Bits', fontsize=12) 315 | ax1.set_ylabel('Key Bits', fontsize=12) 316 | 317 | # Plot 2: Heatmap of perplexity change percentage 318 | pivot2 = sensitivity_data.pivot(index='K_bits', columns='V_bits', values='Perplexity_Change_Pct') 319 | sns.heatmap( 320 | pivot2, 321 | annot=True, 322 | fmt='.2f', 323 | cmap='RdYlGn_r', # Red for worse, green for better 324 | ax=ax2, 325 | cbar_kws={'label': 'Perplexity Change (%)'} 326 | ) 327 | ax2.set_title('Perplexity Change vs FP16 (%)', fontsize=14) 328 | ax2.set_xlabel('Value Bits', fontsize=12) 329 | ax2.set_ylabel('Key Bits', fontsize=12) 330 | 331 | # Add overall title 332 | plt.suptitle('Key vs Value Precision Sensitivity', fontsize=16) 333 | 334 | # Add annotation explaining the key findings 335 | fig.text( 336 | 0.5, 0.02, 337 | 'Key precision (rows) has a larger impact on quality than value precision (columns).\nK8V4 offers an excellent balance of quality and memory efficiency.', 338 | ha='center', 339 | fontsize=12, 340 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8) 341 | ) 342 | 343 | # Save the figure 344 | plt.tight_layout(rect=[0, 0.05, 1, 0.95]) 345 | plt.savefig(OUTPUT_DIR / 'key_value_sensitivity.png', dpi=200, bbox_inches="tight") 346 | plt.close() 347 | 348 | def plot_memory_vs_quality(df): 349 | """Create a scatter plot showing the tradeoff between memory usage and quality""" 350 | # Average across sequence lengths 351 | avg_by_config = df.groupby('Configuration').agg({ 352 | 'KV_Cache_MB': 'mean', 353 | 'Perplexity': 'mean', 354 | 'KV_Savings_Pct': 'mean', 355 | 'Perplexity_Change_Pct': 'mean' 356 | }).reset_index() 357 | 358 | # Create custom color map for the configurations 359 | colors = { 360 | 'FP16': 'blue', 361 | 'K8V8': 'green', 362 | 'K8V4': 'purple', 363 | 'K4V8': 'orange', 364 | 'K4V4': 'red' 365 | } 366 | 367 | plt.figure(figsize=(10, 8)) 368 | 369 | # Create scatter plot 370 | for config in avg_by_config['Configuration']: 371 | data = avg_by_config[avg_by_config['Configuration'] == config] 372 | plt.scatter( 373 | data['KV_Savings_Pct'], 374 | data['Perplexity_Change_Pct'], 375 | s=200, # Size 376 | c=colors[config], # Color 377 | label=config, 378 | alpha=0.7, 379 | edgecolors='black' 380 | ) 381 | 382 | # Add config label to each point 383 | plt.annotate( 384 | config, 385 | xy=(data['KV_Savings_Pct'].values[0], data['Perplexity_Change_Pct'].values[0]), 386 | xytext=(5, 5), 387 | textcoords='offset points', 388 | fontsize=12, 389 | fontweight='bold' 390 | ) 391 | 392 | # Add quadrant labels 393 | plt.axhline(y=5, color='red', linestyle='--', alpha=0.5, label='5% Quality Threshold') 394 | plt.axvline(x=50, color='green', linestyle='--', alpha=0.5, label='50% Memory Savings Threshold') 395 | 396 | # Add shaded quadrants with labels 397 | plt.fill_between( 398 | [-10, 50], 5, 15, color='red', alpha=0.1 399 | ) 400 | plt.fill_between( 401 | [50, 100], 5, 15, color='orange', alpha=0.1 402 | ) 403 | plt.fill_between( 404 | [-10, 50], -10, 5, color='yellow', alpha=0.1 405 | ) 406 | plt.fill_between( 407 | [50, 100], -10, 5, color='green', alpha=0.1 408 | ) 409 | 410 | # Add quadrant labels 411 | plt.text(25, 10, "Low Savings, Worse Quality", ha='center', fontsize=10) 412 | plt.text(75, 10, "High Savings, Worse Quality", ha='center', fontsize=10) 413 | plt.text(25, 2, "Low Savings, Better Quality", ha='center', fontsize=10) 414 | plt.text(75, 2, "High Savings, Better Quality", ha='center', fontsize=10) 415 | 416 | # Add title and labels 417 | plt.title('Memory Savings vs Quality Tradeoff', fontsize=16) 418 | plt.xlabel('KV Cache Memory Savings (%)', fontsize=14) 419 | plt.ylabel('Perplexity Change vs FP16 (%)', fontsize=14) 420 | plt.xlim(-10, 100) 421 | plt.ylim(-2, 15) 422 | 423 | # Add legend 424 | plt.legend(fontsize=12) 425 | 426 | # Add annotation explaining the optimal region 427 | plt.annotate( 428 | 'Optimal configurations maximize memory savings\nwhile minimizing perplexity increase.', 429 | xy=(0.5, 0.97), 430 | xycoords='figure fraction', 431 | ha='center', 432 | fontsize=12, 433 | bbox=dict(boxstyle="round,pad=0.5", facecolor='white', alpha=0.8) 434 | ) 435 | 436 | # Save the figure 437 | plt.tight_layout() 438 | plt.savefig(OUTPUT_DIR / 'memory_vs_quality.png', dpi=200, bbox_inches="tight") 439 | plt.close() 440 | 441 | def create_summary_table(df): 442 | """Create a summary table visual showing the key metrics for each configuration""" 443 | # Average across sequence lengths 444 | avg_by_config = df.groupby('Configuration').agg({ 445 | 'KV_Cache_MB': 'mean', 446 | 'Throughput_Tokens_Per_Sec': 'mean', 447 | 'Perplexity': 'mean', 448 | 'KV_Savings_Pct': 'mean', 449 | 'Perplexity_Change_Pct': 'mean' 450 | }).reset_index() 451 | 452 | # Add a column for throughput change 453 | fp16_throughput = avg_by_config[avg_by_config['Configuration'] == 'FP16']['Throughput_Tokens_Per_Sec'].values[0] 454 | avg_by_config['Throughput_Change_Pct'] = ((avg_by_config['Throughput_Tokens_Per_Sec'] / fp16_throughput) - 1) * 100 455 | 456 | # Sort configurations in a specific order 457 | order = ['FP16', 'K8V8', 'K8V4', 'K4V8', 'K4V4'] 458 | avg_by_config['Order'] = avg_by_config['Configuration'].map({k: i for i, k in enumerate(order)}) 459 | avg_by_config = avg_by_config.sort_values('Order').drop('Order', axis=1) 460 | 461 | # Create a figure for the table 462 | fig, ax = plt.subplots(figsize=(12, 6)) 463 | ax.axis('tight') 464 | ax.axis('off') 465 | 466 | # Define table data 467 | table_data = [ 468 | avg_by_config['Configuration'].tolist(), 469 | [f"{x:.2f} MB" for x in avg_by_config['KV_Cache_MB']], 470 | [f"{x:.1f}%" for x in avg_by_config['KV_Savings_Pct']], 471 | [f"{x:.0f}" for x in avg_by_config['Throughput_Tokens_Per_Sec']], 472 | [f"{x:+.1f}%" for x in avg_by_config['Throughput_Change_Pct']], 473 | [f"{x:.2f}" for x in avg_by_config['Perplexity']], 474 | [f"{x:+.2f}%" for x in avg_by_config['Perplexity_Change_Pct']] 475 | ] 476 | 477 | # Define row labels 478 | row_labels = [ 479 | 'Configuration', 480 | 'Avg KV Cache (MB)', 481 | 'Memory Savings', 482 | 'Throughput (t/s)', 483 | 'Throughput vs FP16', 484 | 'Perplexity', 485 | 'Quality Impact' 486 | ] 487 | 488 | # Create table 489 | table = ax.table( 490 | cellText=table_data, 491 | rowLabels=row_labels, 492 | loc='center', 493 | cellLoc='center', 494 | colWidths=[0.15] * len(avg_by_config) 495 | ) 496 | 497 | # Customize table appearance 498 | table.auto_set_font_size(False) 499 | table.set_fontsize(12) 500 | table.scale(1, 1.5) 501 | 502 | # Color cells based on values 503 | for i in range(len(row_labels)): 504 | for j in range(len(avg_by_config)): 505 | cell = table[(i, j)] 506 | 507 | # Memory savings - green for higher savings 508 | if i == 2 and j > 0: # KV_Savings_Pct row, excluding FP16 509 | savings = avg_by_config.iloc[j]['KV_Savings_Pct'] 510 | intensity = min(savings / 80, 1) # Normalize to [0, 1] 511 | cell.set_facecolor((1 - intensity, 1, 1 - intensity)) 512 | 513 | # Throughput - green for better, red for worse 514 | elif i == 4 and j > 0: # Throughput_Change_Pct row, excluding FP16 515 | change = avg_by_config.iloc[j]['Throughput_Change_Pct'] 516 | if change > 0: 517 | intensity = min(change / 20, 1) # Normalize to [0, 1] 518 | cell.set_facecolor((1 - intensity, 1, 1 - intensity)) 519 | else: 520 | intensity = min(abs(change) / 20, 1) # Normalize to [0, 1] 521 | cell.set_facecolor((1, 1 - intensity, 1 - intensity)) 522 | 523 | # Perplexity - red for worse (higher values) 524 | elif i == 6 and j > 0: # Perplexity_Change_Pct row, excluding FP16 525 | change = avg_by_config.iloc[j]['Perplexity_Change_Pct'] 526 | if change > 0: 527 | intensity = min(change / 10, 1) # Normalize to [0, 1] 528 | cell.set_facecolor((1, 1 - intensity, 1 - intensity)) 529 | else: 530 | intensity = min(abs(change) / 10, 1) # Normalize to [0, 1] 531 | cell.set_facecolor((1 - intensity, 1, 1 - intensity)) 532 | 533 | # Add title 534 | plt.suptitle('KVSplit Configuration Summary', fontsize=16) 535 | 536 | # Add annotation 537 | plt.figtext( 538 | 0.5, 0.01, 539 | 'Values are averaged across all sequence lengths. Green indicates better performance, red indicates worse.', 540 | ha='center', 541 | fontsize=12 542 | ) 543 | 544 | # Save the figure 545 | plt.tight_layout(rect=[0, 0.05, 1, 0.95]) 546 | plt.savefig(OUTPUT_DIR / 'configuration_summary.png', dpi=200, bbox_inches="tight") 547 | plt.close() 548 | 549 | def main(): 550 | """Main function to generate all visualizations""" 551 | # Load data 552 | try: 553 | df = load_latest_results() 554 | df_processed = prepare_data(df) 555 | 556 | # Generate plots 557 | plot_memory_usage(df_processed) 558 | plot_performance(df_processed) 559 | plot_quality_impact(df_processed) 560 | plot_key_vs_value_sensitivity(df_processed) 561 | plot_memory_vs_quality(df_processed) 562 | create_summary_table(df_processed) 563 | 564 | print(f"Visualizations saved to {OUTPUT_DIR}") 565 | except Exception as e: 566 | print(f"Error generating visualizations: {e}") 567 | 568 | if __name__ == "__main__": 569 | main() 570 | --------------------------------------------------------------------------------