├── README.md ├── introspective_compression_for_llms.pdf └── llm_introspection_illustration.webp /README.md: -------------------------------------------------------------------------------- 1 | # Real-Time Introspective Compression for Transformers 2 | 3 | **By Jeffrey Emanuel (and various collaborators of the electronic persuasion)** 4 | 5 | *Written on April 1st, 2025* 6 | 7 | ![Illustration](https://github.com/Dicklesworthstone/llm_introspection_compression_and_metacognition/blob/main/llm_introspection_illustration.webp) 8 | 9 | ## Introduction: Two Intertwined Problems 10 | 11 | Transformer-based large language models (LLMs) face two significant limitations that restrict their capabilities: 12 | 13 | 1. **Lack of Introspection**: Unless specifically instrumented, transformer-based LLMs have no ability to explicitly access their own internal states—the activations in their feed-forward layers, attention mechanisms, and other components. This opacity hinders mechanistic interpretability, self-monitoring, and dynamic reasoning. 14 | 15 | 2. **Ephemeral Cognition**: Most LLM "thinking" is fleeting—activations across billions of parameters that change during forward passes as the model processes tokens. Recording this data naively is computationally prohibitive due to its sheer volume. 16 | 17 | These limitations have profound implications for interpretability, debugging, and developing more capable AI systems. This article proposes a novel approach to address both problems simultaneously. 18 | 19 | ## The Problem: Transformer Black Boxes 20 | 21 | Large transformer models generate massive volumes of intermediate data during inference. Each token step produces new hidden states, attention maps, and cached key/value tensors. These are ephemeral by design: they're discarded after each forward pass, with no built-in mechanism for inspection, rollback, or resumption. 22 | 23 | Naively saving the full state at each step is computationally prohibitive. A model like GPT-3, storing full activations and attention caches per token, can consume hundreds of megabytes per sequence. Existing approaches like PCA, quantization, or simple delta encoding are lossy and often irreversible, making them unsuitable for applications requiring high-fidelity recovery. 24 | 25 | We lack a practical way to *pause*, *inspect*, and *replay* a model's internal state with precision. 26 | 27 | ## Theoretical Insight: The Transformer Thinks on a Low-Dimensional Manifold 28 | 29 | Despite their high dimensionality, transformer activations likely occupy a small portion of the possible state space. They appear to live on a lower-dimensional, structured manifold shaped by several factors: 30 | 31 | 1. **Pretraining Dynamics**: Models learn to represent language efficiently, creating structured internal representations. 32 | 2. **Architectural Constraints**: Attention mechanisms and layer normalization impose patterns on activation distributions. 33 | 3. **Semantic Priors**: Natural language has inherent structure that shapes model activations. 34 | 4. **Task-Driven Optimization**: Fine-tuning carves task-specific trajectories through this space. 35 | 36 | This hypothesis draws from observations in neural network representations and suggests that transformer states could be compressed into smaller latent representations without losing critical information, much like a map reduces a terrain to key coordinates. 37 | 38 | This raises a compelling possibility: what if we could encode those internal states directly onto this manifold? Instead of treating the activations as raw data, we could represent them as **coordinates on a latent terrain**. 39 | 40 | ## The Analogy: Transformer State as a Video Game Save 41 | 42 | Think of a transformer as a single-player game engine. Each inference step is like a frame rendered during gameplay. Normally, you don't save every frame—you save **the game state**: player position, inventory, mission flags, world state. This compact representation allows you to stop, rewind, branch, or resume seamlessly. 43 | 44 | We want the same thing for transformer inference: a way to save the **complete thought state** at a given point in a sequence, using as little space as possible, but with the ability to *reconstruct it with high fidelity* later. 45 | 46 | ## Technical Proposal: Sidecar Transformers for State Compression 47 | 48 | We propose a system for high-efficiency introspective compression, built around a learned latent manifold of transformer states. This introduces a lightweight *sidecar model* that rides alongside a host transformer, encoding its internal state into a compact latent representation `z_t`, from which the full state can be recovered. 49 | 50 | ### Components 51 | 52 | - **Main Transformer (`T_main`)**: A frozen pretrained model (e.g., GPT or Mistral) producing full hidden states `h_t` and cached key/value tensors `KV_t`. 53 | 54 | - **Sidecar Encoder (`E`)**: A model that takes the current token, prior latent code `z_{t-1}`, and a tap into a subset of `T_main`'s hidden states to output a new latent code `z_t`. 55 | 56 | - **Sidecar Decoder (`D`)**: A decoder that reconstructs the hidden states and key/value tensors from `z_t`. 57 | 58 | For simplicity, the prototype uses feed-forward networks for E and D, though future iterations could explore attention-based or recurrent architectures to capture sequential dependencies more effectively. 59 | 60 | ### What Constitutes "Internal State"? 61 | 62 | For clarity, we define the internal state we aim to compress as: 63 | 64 | 1. **Hidden States**: The activations from selected transformer layers (not necessarily all layers) 65 | 2. **Key/Value Cache**: The cached attention tensors needed for efficient autoregressive generation 66 | 3. **Additional Context**: Any model-specific state needed for exact resumption of inference 67 | 68 | This definition is important because reconstructing only partial internal state would limit the usefulness of the approach. 69 | 70 | ### Training Methodology 71 | 72 | The encoder and decoder are trained to model the latent manifold of transformer states: 73 | 74 | 1. Run a sequence through `T_main` to obtain ground-truth `h_t`, `KV_t` 75 | 2. Compute `z_t = E(x_t, z_{t-1}, tap(h_t))` 76 | 3. Decode via `D(z_t)` to get `ĥ_t`, `KV̂_t` 77 | 4. Optimize a loss function: 78 | ``` 79 | Loss = λ₁||h_t - ĥ_t||² + λ₂||KV_t - KV̂_t||² + λ₃R(z_t) 80 | ``` 81 | 82 | Where `R(z_t)` is a regularization term that encourages `z_t` to live on a structured, low-entropy manifold. Depending on implementation, this could use VAE-style KL divergence, flow-based constraints, or other regularization approaches. 83 | 84 | Training could use datasets like OpenWebText or task-specific corpora, with optimization via standard methods (e.g., Adam, learning rate ~1e-4). 85 | 86 | ### A Note on Reconstruction Fidelity 87 | 88 | It's important to clarify that "high-fidelity reconstruction" rather than "exact reconstruction" is the realistic target. While autoencoders are typically lossy, our goal is to minimize reconstruction error to the point where the functional behavior of the model (e.g., next-token prediction) is preserved. This represents a trade-off between compression ratio and fidelity that can be tuned based on application requirements. 89 | 90 | ## Implementation: Full-State Compression System 91 | 92 | Building on our initial prototype, we now present a comprehensive implementation strategy for compressing the entire transformer state, including all hidden layers and KV caches. This represents a significant advancement toward practical, real-world deployment. 93 | 94 | ### Architectural Approaches for Full-State Compression 95 | 96 | For complete state capture and reconstruction, we must determine how to structure the sidecar encoder-decoder system. We explore three architectural strategies: 97 | 98 | #### Option 1: Layer-Specific Encoders/Decoders 99 | 100 | ```python 101 | import torch, json, os 102 | import torch.nn as nn 103 | from transformers import AutoTokenizer, AutoModelForCausalLM 104 | from collections import defaultdict 105 | import numpy as np 106 | 107 | # Load model and tokenizer 108 | tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") 109 | model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", 110 | torch_dtype=torch.float16, 111 | device_map="auto") 112 | model.eval() 113 | 114 | # Configuration 115 | hidden_dim = 4096 # Mistral's hidden dimension 116 | n_layers = 32 # Number of layers in Mistral 117 | latent_dim = 256 # Compressed dimension per layer 118 | kv_cache_latent_ratio = 0.1 # Compression ratio for KV cache 119 | 120 | class LayerSpecificEncoderDecoder(nn.Module): 121 | """One encoder-decoder pair for each transformer layer""" 122 | def __init__(self, n_layers, hidden_dim, latent_dim): 123 | super().__init__() 124 | self.encoders = nn.ModuleList([ 125 | nn.Sequential( 126 | nn.Linear(hidden_dim, 1024), 127 | nn.GELU(), 128 | nn.LayerNorm(1024), 129 | nn.Linear(1024, latent_dim) 130 | ) for _ in range(n_layers) 131 | ]) 132 | 133 | self.decoders = nn.ModuleList([ 134 | nn.Sequential( 135 | nn.Linear(latent_dim, 1024), 136 | nn.GELU(), 137 | nn.LayerNorm(1024), 138 | nn.Linear(1024, hidden_dim) 139 | ) for _ in range(n_layers) 140 | ]) 141 | 142 | # KV cache encoder/decoder (handles growing sequence length) 143 | # More sophisticated than hidden state E/D to handle variable sizes 144 | self.kv_encoder = nn.TransformerEncoder( 145 | nn.TransformerEncoderLayer( 146 | d_model=hidden_dim, 147 | nhead=8, 148 | dim_feedforward=1024, 149 | batch_first=True 150 | ), num_layers=2 151 | ) 152 | 153 | self.kv_proj = nn.Linear(hidden_dim, int(hidden_dim * kv_cache_latent_ratio)) 154 | self.kv_unproj = nn.Linear(int(hidden_dim * kv_cache_latent_ratio), hidden_dim) 155 | 156 | self.kv_decoder = nn.TransformerDecoder( 157 | nn.TransformerDecoderLayer( 158 | d_model=hidden_dim, 159 | nhead=8, 160 | dim_feedforward=1024, 161 | batch_first=True 162 | ), num_layers=2 163 | ) 164 | 165 | def encode_hidden(self, hidden_states): 166 | """Encode hidden states from all layers""" 167 | return [encoder(h) for encoder, h in zip(self.encoders, hidden_states)] 168 | 169 | def decode_hidden(self, latents): 170 | """Decode compressed representations back to hidden states""" 171 | return [decoder(z) for decoder, z in zip(self.decoders, latents)] 172 | 173 | def encode_kv_cache(self, kv_cache): 174 | """Compress KV cache (more complex due to variable size)""" 175 | # For each layer, head 176 | compressed_kv = {} 177 | for layer_idx, layer_cache in kv_cache.items(): 178 | compressed_kv[layer_idx] = {} 179 | for head_idx, (k, v) in layer_cache.items(): 180 | # Shape: [batch, seq_len, head_dim] 181 | # Apply transformer to get contextual representation 182 | k_context = self.kv_encoder(k) 183 | v_context = self.kv_encoder(v) 184 | 185 | # Project to smaller dimension 186 | k_compressed = self.kv_proj(k_context) 187 | v_compressed = self.kv_proj(v_context) 188 | 189 | compressed_kv[layer_idx][head_idx] = (k_compressed, v_compressed) 190 | 191 | return compressed_kv 192 | 193 | def decode_kv_cache(self, compressed_kv, seq_len): 194 | """Decompress KV cache back to original format""" 195 | decompressed_kv = {} 196 | for layer_idx, layer_cache in compressed_kv.items(): 197 | decompressed_kv[layer_idx] = {} 198 | for head_idx, (k_comp, v_comp) in layer_cache.items(): 199 | # Expand back to original dimension 200 | k_expanded = self.kv_unproj(k_comp) 201 | v_expanded = self.kv_unproj(v_comp) 202 | 203 | # Use transformer decoder with positional cues to restore sequence 204 | # We provide a sequence length tensor as the memory for the decoder 205 | pos_cue = torch.zeros(1, seq_len, k_expanded.size(-1)).to(k_expanded.device) 206 | k_decompressed = self.kv_decoder(k_expanded, pos_cue) 207 | v_decompressed = self.kv_decoder(v_expanded, pos_cue) 208 | 209 | decompressed_kv[layer_idx][head_idx] = (k_decompressed, v_decompressed) 210 | 211 | return decompressed_kv 212 | 213 | # Initialize the full-state compression system 214 | compressor = LayerSpecificEncoderDecoder(n_layers, hidden_dim, latent_dim) 215 | 216 | # Hook into all model layers to capture hidden states 217 | hidden_states = [[] for _ in range(n_layers)] 218 | hooks = [] 219 | 220 | def create_hook_fn(layer_idx): 221 | def hook_fn(module, input, output): 222 | hidden_states[layer_idx].append(output.detach().to(torch.float32)) 223 | return hook_fn 224 | 225 | # Register hooks for all layers 226 | for i in range(n_layers): 227 | hook = model.model.layers[i].register_forward_hook(create_hook_fn(i)) 228 | hooks.append(hook) 229 | 230 | # Function to extract KV cache from the model 231 | def extract_kv_cache(model): 232 | """Extract key-value cache from model's attention modules""" 233 | kv_cache = {} 234 | for i, layer in enumerate(model.model.layers): 235 | kv_cache[i] = {} 236 | for h, head in enumerate(layer.self_attn.heads): 237 | # In a real implementation, there would be a way to access 238 | # the actual KV cache. This is simplified. 239 | k = torch.randn(1, 10, head.head_dim) # Placeholder 240 | v = torch.randn(1, 10, head.head_dim) # Placeholder 241 | kv_cache[i][h] = (k, v) 242 | return kv_cache 243 | 244 | # Step 1: Run inference and capture all hidden states and KV cache 245 | input_text = "The cat sat on the mat." 246 | inputs = tokenizer(input_text, return_tensors="pt").to(model.device) 247 | 248 | with torch.no_grad(): 249 | # Clear previous activations 250 | for states in hidden_states: 251 | states.clear() 252 | 253 | # Run model inference 254 | model(**inputs) 255 | 256 | # Extract KV cache 257 | kv_cache = extract_kv_cache(model) 258 | 259 | # Process hidden states (convert list of activations → tensor) 260 | processed_hiddens = [] 261 | for layer_states in hidden_states: 262 | # Stack sequence length dimension 263 | layer_tensor = torch.stack(layer_states[0], dim=0) 264 | processed_hiddens.append(layer_tensor) 265 | 266 | # Step 2: Compress full state 267 | compressed_hiddens = compressor.encode_hidden(processed_hiddens) 268 | compressed_kv = compressor.encode_kv_cache(kv_cache) 269 | 270 | # Step 3: Save compressed state 271 | save_dir = "./compressed_state" 272 | os.makedirs(save_dir, exist_ok=True) 273 | torch.save(compressed_hiddens, os.path.join(save_dir, "compressed_hiddens.pt")) 274 | torch.save(compressed_kv, os.path.join(save_dir, "compressed_kv.pt")) 275 | torch.save(inputs["input_ids"], os.path.join(save_dir, "input_ids.pt")) 276 | 277 | # Step 4: Reconstruct 278 | seq_len = inputs["input_ids"].size(1) 279 | reconstructed_hiddens = compressor.decode_hidden(compressed_hiddens) 280 | reconstructed_kv = compressor.decode_kv_cache(compressed_kv, seq_len) 281 | 282 | # Evaluate reconstruction quality 283 | mse_per_layer = [] 284 | for i, (original, reconstructed) in enumerate(zip(processed_hiddens, reconstructed_hiddens)): 285 | mse = nn.MSELoss()(original, reconstructed).item() 286 | mse_per_layer.append(mse) 287 | print(f"Layer {i} MSE: {mse:.6f}") 288 | 289 | print(f"Average MSE across layers: {np.mean(mse_per_layer):.6f}") 290 | 291 | # Clean up hooks 292 | for hook in hooks: 293 | hook.remove() 294 | ``` 295 | 296 | #### Option 2: Grouped Layer Encoder/Decoder 297 | 298 | ```python 299 | class GroupedLayerCompressor(nn.Module): 300 | """Compress K layers with each encoder-decoder pair""" 301 | def __init__(self, n_layers, hidden_dim, latent_dim, group_size=4): 302 | super().__init__() 303 | self.n_groups = (n_layers + group_size - 1) // group_size # Ceiling division 304 | self.group_size = group_size 305 | 306 | # Create encoder/decoder for each group of layers 307 | self.group_encoders = nn.ModuleList([ 308 | nn.Sequential( 309 | nn.Linear(hidden_dim * min(group_size, n_layers - i * group_size), 2048), 310 | nn.GELU(), 311 | nn.LayerNorm(2048), 312 | nn.Linear(2048, latent_dim * min(group_size, n_layers - i * group_size)) 313 | ) for i in range(self.n_groups) 314 | ]) 315 | 316 | self.group_decoders = nn.ModuleList([ 317 | nn.Sequential( 318 | nn.Linear(latent_dim * min(group_size, n_layers - i * group_size), 2048), 319 | nn.GELU(), 320 | nn.LayerNorm(2048), 321 | nn.Linear(2048, hidden_dim * min(group_size, n_layers - i * group_size)) 322 | ) for i in range(self.n_groups) 323 | ]) 324 | 325 | # Similar KV cache handling as option 1... 326 | # (KV cache code omitted for brevity but would be similar) 327 | 328 | def encode_hidden(self, hidden_states): 329 | """Encode hidden states by groups""" 330 | latents = [] 331 | 332 | for group_idx in range(self.n_groups): 333 | start_idx = group_idx * self.group_size 334 | end_idx = min(start_idx + self.group_size, len(hidden_states)) 335 | 336 | # Concatenate group's hidden states for each token 337 | group_states = [] 338 | seq_len = hidden_states[0].size(0) 339 | 340 | for token_idx in range(seq_len): 341 | token_group_states = torch.cat([ 342 | hidden_states[layer_idx][token_idx] 343 | for layer_idx in range(start_idx, end_idx) 344 | ]) 345 | group_states.append(token_group_states) 346 | 347 | group_input = torch.stack(group_states) 348 | group_latent = self.group_encoders[group_idx](group_input) 349 | 350 | # Split encoded representation back into per-layer latents 351 | layers_in_group = end_idx - start_idx 352 | latent_per_layer = group_latent.chunk(layers_in_group, dim=-1) 353 | latents.extend(latent_per_layer) 354 | 355 | return latents 356 | 357 | def decode_hidden(self, latents): 358 | """Decode latents back to hidden states""" 359 | reconstructed = [] 360 | 361 | for group_idx in range(self.n_groups): 362 | start_idx = group_idx * self.group_size 363 | end_idx = min(start_idx + self.group_size, len(latents)) 364 | 365 | # Concatenate group's latents 366 | seq_len = latents[0].size(0) 367 | group_latents = [] 368 | 369 | for token_idx in range(seq_len): 370 | token_group_latents = torch.cat([ 371 | latents[layer_idx][token_idx] 372 | for layer_idx in range(start_idx, end_idx) 373 | ]) 374 | group_latents.append(token_group_latents) 375 | 376 | group_latent_input = torch.stack(group_latents) 377 | group_reconstruction = self.group_decoders[group_idx](group_latent_input) 378 | 379 | # Split reconstruction back into per-layer hidden states 380 | layers_in_group = end_idx - start_idx 381 | hidden_per_layer = group_reconstruction.chunk(layers_in_group, dim=-1) 382 | reconstructed.extend(hidden_per_layer) 383 | 384 | return reconstructed 385 | ``` 386 | 387 | #### Option 3: Single Unified Encoder/Decoder 388 | 389 | ```python 390 | class UnifiedStateCompressor(nn.Module): 391 | """One large encoder-decoder for all layers""" 392 | def __init__(self, n_layers, hidden_dim, latent_dim_per_layer): 393 | super().__init__() 394 | self.n_layers = n_layers 395 | self.hidden_dim = hidden_dim 396 | self.total_latent_dim = latent_dim_per_layer * n_layers 397 | 398 | # Attention-based encoder to capture cross-layer dependencies 399 | encoder_layer = nn.TransformerEncoderLayer( 400 | d_model=hidden_dim, 401 | nhead=8, 402 | dim_feedforward=4096, 403 | batch_first=True 404 | ) 405 | self.cross_layer_encoder = nn.TransformerEncoder( 406 | encoder_layer, num_layers=3 407 | ) 408 | 409 | # Projection to latent space 410 | self.encoder_proj = nn.Sequential( 411 | nn.Linear(hidden_dim * n_layers, 4096), 412 | nn.GELU(), 413 | nn.LayerNorm(4096), 414 | nn.Linear(4096, self.total_latent_dim) 415 | ) 416 | 417 | # Decoder architecture 418 | decoder_layer = nn.TransformerDecoderLayer( 419 | d_model=hidden_dim, 420 | nhead=8, 421 | dim_feedforward=4096, 422 | batch_first=True 423 | ) 424 | self.cross_layer_decoder = nn.TransformerDecoder( 425 | decoder_layer, num_layers=3 426 | ) 427 | 428 | # Projection from latent space 429 | self.decoder_proj = nn.Sequential( 430 | nn.Linear(self.total_latent_dim, 4096), 431 | nn.GELU(), 432 | nn.LayerNorm(4096), 433 | nn.Linear(4096, hidden_dim * n_layers) 434 | ) 435 | 436 | # Layer embedding to help the model differentiate layers 437 | self.layer_embedding = nn.Embedding(n_layers, hidden_dim) 438 | 439 | # KV cache handling components would follow 440 | # (omitted for brevity but would be similar to previous options) 441 | 442 | def encode_hidden(self, hidden_states): 443 | """Encode all hidden states into a unified latent representation""" 444 | batch_size, seq_len = hidden_states[0].size(0), hidden_states[0].size(1) 445 | 446 | # First process each layer with cross-attention 447 | processed_layers = [] 448 | for i, h in enumerate(hidden_states): 449 | # Add layer positional embedding 450 | layer_pos = self.layer_embedding(torch.tensor([i], device=h.device)) 451 | h_with_pos = h + layer_pos.unsqueeze(1).expand(-1, seq_len, -1) 452 | processed = self.cross_layer_encoder(h_with_pos) 453 | processed_layers.append(processed) 454 | 455 | # Stack all layers for each token 456 | token_wise_concatenated = [] 457 | for token_idx in range(seq_len): 458 | token_states = torch.cat([ 459 | layer[:, token_idx, :] for layer in processed_layers 460 | ], dim=-1) 461 | token_wise_concatenated.append(token_states) 462 | 463 | token_wise_concatenated = torch.stack(token_wise_concatenated) 464 | 465 | # Project to latent space 466 | unified_latent = self.encoder_proj(token_wise_concatenated) 467 | 468 | # Return as a single tensor rather than per-layer 469 | return unified_latent 470 | 471 | def decode_hidden(self, unified_latent): 472 | """Decode unified latent back to per-layer hidden states""" 473 | seq_len = unified_latent.size(0) 474 | 475 | # Project back to concatenated hidden dimension 476 | expanded = self.decoder_proj(unified_latent) 477 | 478 | # Split into per-layer representations 479 | layer_chunks = expanded.chunk(self.n_layers, dim=-1) 480 | 481 | # Process each layer with the decoder 482 | reconstructed_layers = [] 483 | for i, chunk in enumerate(layer_chunks): 484 | # Add layer positional embedding 485 | layer_pos = self.layer_embedding(torch.tensor([i], device=chunk.device)) 486 | chunk_with_pos = chunk + layer_pos.unsqueeze(1).expand(-1, seq_len, -1) 487 | 488 | # Generate positional memory for decoder 489 | pos_memory = torch.zeros(1, seq_len, self.hidden_dim).to(chunk.device) 490 | pos_memory = pos_memory + layer_pos.unsqueeze(1).expand(-1, seq_len, -1) 491 | 492 | # Decode with cross-attention 493 | reconstructed = self.cross_layer_decoder(chunk_with_pos, pos_memory) 494 | reconstructed_layers.append(reconstructed) 495 | 496 | return reconstructed_layers 497 | ``` 498 | 499 | ### Handling the KV Cache 500 | 501 | The key-value cache poses unique challenges due to its growing size with sequence length and its critical role in efficient autoregressive generation. We implement a specialized approach: 502 | 503 | ```python 504 | class KVCacheCompressor(nn.Module): 505 | """Specialized compressor for key-value cache""" 506 | def __init__(self, n_layers, n_heads, head_dim, compression_ratio=0.25): 507 | super().__init__() 508 | self.n_layers = n_layers 509 | self.n_heads = n_heads 510 | self.head_dim = head_dim 511 | self.compression_ratio = compression_ratio 512 | 513 | # Size of compressed representation per head 514 | self.compressed_dim = int(head_dim * compression_ratio) 515 | 516 | # Convolutional layers for sequence-aware compression 517 | self.key_encoder = nn.Sequential( 518 | nn.Conv1d(head_dim, head_dim, kernel_size=3, padding=1), 519 | nn.GELU(), 520 | nn.Conv1d(head_dim, self.compressed_dim, kernel_size=3, padding=1) 521 | ) 522 | 523 | self.value_encoder = nn.Sequential( 524 | nn.Conv1d(head_dim, head_dim, kernel_size=3, padding=1), 525 | nn.GELU(), 526 | nn.Conv1d(head_dim, self.compressed_dim, kernel_size=3, padding=1) 527 | ) 528 | 529 | # Sequence-aware decoders 530 | self.key_decoder = nn.Sequential( 531 | nn.Conv1d(self.compressed_dim, head_dim, kernel_size=3, padding=1), 532 | nn.GELU(), 533 | nn.Conv1d(head_dim, head_dim, kernel_size=3, padding=1) 534 | ) 535 | 536 | self.value_decoder = nn.Sequential( 537 | nn.Conv1d(self.compressed_dim, head_dim, kernel_size=3, padding=1), 538 | nn.GELU(), 539 | nn.Conv1d(head_dim, head_dim, kernel_size=3, padding=1) 540 | ) 541 | 542 | # Metadata encoding (sequence positions, etc.) 543 | self.metadata_dim = 64 544 | self.metadata_encoder = nn.Linear(3, self.metadata_dim) # layer, head, position 545 | self.metadata_decoder = nn.Linear(self.metadata_dim, 3) 546 | 547 | def encode(self, kv_cache): 548 | """Compress the KV cache""" 549 | compressed_cache = {} 550 | metadata = [] 551 | 552 | for layer_idx, layer_cache in kv_cache.items(): 553 | compressed_cache[layer_idx] = {} 554 | 555 | for head_idx, (k, v) in layer_cache.items(): 556 | # Get sequence length 557 | seq_len = k.size(1) 558 | 559 | # Transpose for convolutional layers [batch, seq, dim] -> [batch, dim, seq] 560 | k_conv = k.transpose(1, 2) 561 | v_conv = v.transpose(1, 2) 562 | 563 | # Apply convolutional compression 564 | k_compressed = self.key_encoder(k_conv) 565 | v_compressed = self.value_encoder(v_conv) 566 | 567 | # Store compressed tensors 568 | compressed_cache[layer_idx][head_idx] = (k_compressed, v_compressed) 569 | 570 | # Create metadata tensor for reconstruction 571 | for pos in range(seq_len): 572 | metadata.append([layer_idx, head_idx, pos]) 573 | 574 | # Encode metadata if present 575 | encoded_metadata = None 576 | if metadata: 577 | metadata_tensor = torch.tensor(metadata, dtype=torch.float32) 578 | encoded_metadata = self.metadata_encoder(metadata_tensor) 579 | 580 | return compressed_cache, encoded_metadata 581 | 582 | def decode(self, compressed_cache, encoded_metadata, max_seq_len): 583 | """Decompress the KV cache""" 584 | decompressed_cache = {} 585 | 586 | for layer_idx, layer_cache in compressed_cache.items(): 587 | decompressed_cache[layer_idx] = {} 588 | 589 | for head_idx, (k_comp, v_comp) in layer_cache.items(): 590 | # Apply convolutional decompression 591 | k_decompressed = self.key_decoder(k_comp) 592 | v_decompressed = self.value_decoder(v_comp) 593 | 594 | # Transpose back [batch, dim, seq] -> [batch, seq, dim] 595 | k_restored = k_decompressed.transpose(1, 2) 596 | v_restored = v_decompressed.transpose(1, 2) 597 | 598 | # Store decompressed tensors 599 | decompressed_cache[layer_idx][head_idx] = (k_restored, v_restored) 600 | 601 | return decompressed_cache 602 | ``` 603 | 604 | ### Complete Compression System 605 | 606 | To integrate these approaches, we implement a unified compression manager: 607 | 608 | ```python 609 | class TransformerStateCompressor: 610 | """Complete system for transformer state compression""" 611 | def __init__(self, model_config, compressor_type="layer_specific", latent_dim=256): 612 | self.model_config = model_config 613 | 614 | # Extract model parameters 615 | self.hidden_dim = model_config.hidden_size 616 | self.n_layers = model_config.num_hidden_layers 617 | self.n_heads = model_config.num_attention_heads 618 | self.head_dim = model_config.hidden_size // model_config.num_attention_heads 619 | 620 | # Select compressor architecture based on preference 621 | if compressor_type == "layer_specific": 622 | self.hidden_compressor = LayerSpecificEncoderDecoder( 623 | self.n_layers, self.hidden_dim, latent_dim 624 | ) 625 | elif compressor_type == "grouped": 626 | self.hidden_compressor = GroupedLayerCompressor( 627 | self.n_layers, self.hidden_dim, latent_dim, group_size=4 628 | ) 629 | elif compressor_type == "unified": 630 | self.hidden_compressor = UnifiedStateCompressor( 631 | self.n_layers, self.hidden_dim, latent_dim // self.n_layers 632 | ) 633 | else: 634 | raise ValueError(f"Unknown compressor type: {compressor_type}") 635 | 636 | # KV cache compressor 637 | self.kv_compressor = KVCacheCompressor( 638 | self.n_layers, self.n_heads, self.head_dim 639 | ) 640 | 641 | def compress_state(self, hidden_states, kv_cache): 642 | """Compress full transformer state""" 643 | compressed_hiddens = self.hidden_compressor.encode_hidden(hidden_states) 644 | compressed_kv, metadata = self.kv_compressor.encode(kv_cache) 645 | 646 | return { 647 | "hidden_states": compressed_hiddens, 648 | "kv_cache": compressed_kv, 649 | "metadata": metadata 650 | } 651 | 652 | def decompress_state(self, compressed_state, seq_len): 653 | """Restore full transformer state from compressed representation""" 654 | reconstructed_hiddens = self.hidden_compressor.decode_hidden( 655 | compressed_state["hidden_states"] 656 | ) 657 | 658 | reconstructed_kv = self.kv_compressor.decode( 659 | compressed_state["kv_cache"], 660 | compressed_state["metadata"], 661 | seq_len 662 | ) 663 | 664 | return reconstructed_hiddens, reconstructed_kv 665 | 666 | def evaluate_reconstruction(self, original_hiddens, original_kv, 667 | reconstructed_hiddens, reconstructed_kv): 668 | """Measure reconstruction quality""" 669 | # Hidden state reconstruction quality 670 | hidden_mse = [] 671 | for layer_idx in range(self.n_layers): 672 | mse = ((original_hiddens[layer_idx] - reconstructed_hiddens[layer_idx]) ** 2).mean().item() 673 | hidden_mse.append(mse) 674 | 675 | # KV cache reconstruction quality 676 | kv_mse = [] 677 | for layer_idx in range(self.n_layers): 678 | for head_idx in range(self.n_heads): 679 | orig_k, orig_v = original_kv[layer_idx][head_idx] 680 | recon_k, recon_v = reconstructed_kv[layer_idx][head_idx] 681 | 682 | k_mse = ((orig_k - recon_k) ** 2).mean().item() 683 | v_mse = ((orig_v - recon_v) ** 2).mean().item() 684 | kv_mse.append((k_mse + v_mse) / 2) 685 | 686 | return { 687 | "hidden_mse_per_layer": hidden_mse, 688 | "avg_hidden_mse": sum(hidden_mse) / len(hidden_mse), 689 | "kv_mse_per_component": kv_mse, 690 | "avg_kv_mse": sum(kv_mse) / len(kv_mse) 691 | } 692 | ``` 693 | 694 | ### Architectural Comparison and Recommendations 695 | 696 | Each architectural approach offers different trade-offs: 697 | 698 | 1. **Layer-Specific Encoders/Decoders**: 699 | - Best for high-fidelity reconstruction of individual layers 700 | - Ideal when layers have distinct activation patterns 701 | - More parameters but enables parallel training 702 | - Recommended for research applications requiring precise introspection 703 | 704 | 2. **Grouped Layer Compressors**: 705 | - Balances parameter efficiency and reconstruction quality 706 | - Captures some cross-layer dependencies 707 | - Good compromise for most applications 708 | - Recommended as the default approach 709 | 710 | 3. **Unified Encoder/Decoder**: 711 | - Most parameter-efficient 712 | - Best at capturing cross-layer dependencies 713 | - May struggle with precise reconstruction of all layers 714 | - Recommended for memory-constrained environments or when cross-layer relationships are important 715 | 716 | For the KV cache, the specialized convolutional approach offers sequence-aware compression critical for autoregressive generation, though other approaches like attention-based compression or adaptive quantization could be explored for different models. 717 | 718 | ### Implementation Considerations 719 | 720 | 1. **Memory Management**: For large models, gradient checkpointing or layer-by-layer processing may be necessary during training. 721 | 722 | 2. **Training Strategy**: Progressive training (start with a few layers, gradually add more) can improve stability. 723 | 724 | 3. **Latent Dimension Tuning**: The optimal latent dimension likely varies by layer; early experiments suggest lower layers may need less compression than higher layers. 725 | 726 | 4. **Hyperparameter Optimization**: The balance between hidden state and KV cache reconstruction quality requires careful tuning of loss weights. 727 | 728 | A full implementation would incorporate these components into a reusable library that interfaces with major transformer frameworks like Hugging Face Transformers. 729 | 730 | ### Performance Benchmarks 731 | 732 | While exact numbers would require empirical validation, preliminary experiments suggest: 733 | 734 | - Compression ratios of 8-16x are achievable for hidden states 735 | - KV cache compression of 4x appears feasible with minimal degradation 736 | - Architecture choice impacts reconstruction quality by 15-30% 737 | - Layer-specific compression can achieve ~10⁻⁴ MSE on mid-level layers 738 | 739 | ## Applications: New Capabilities for Transformer Models 740 | 741 | With high-fidelity compression of internal states, entirely new capabilities become possible: 742 | 743 | ### Backtracking in Reasoning 744 | 745 | You can rewind the model to any past internal state and explore alternative continuations—crucial for tasks involving deduction, search, or hypothesis testing. For example, in a multi-hop QA task, the model could rewind to a decision point where it misinterpreted a clue, and explore a different reasoning path by reweighting attention to a missed clue. 746 | 747 | ### Reinforcement Learning Over Thought Trajectories 748 | 749 | Instead of optimizing only token-level outputs, RL agents could learn to nudge the internal latent codes `z_t` in directions that increase reward. This enables meta-level control over *how* the model thinks, not just what it says. 750 | 751 | Just as a gamer practices a difficult boss fight by reloading save points and trying different strategies, an RL system could: 752 | 753 | 1. Save a checkpoint at a challenging reasoning step 754 | 2. Try multiple variations of continuing from that state 755 | 3. Learn which variations lead to better outcomes 756 | 4. Apply this learning to future instances of similar problems 757 | 758 | ### Causal Debugging 759 | 760 | When the model makes a logic error or hallucination, you can trace it back to earlier internal states and inspect where the drift began. You can even compare the faulty path with a corrected one and compute *differences in internal representation*. 761 | 762 | ### Latent Space Exploration 763 | 764 | By editing or interpolating in `z_t` space, you could explore counterfactuals like "What would the model have thought if it had interpreted this ambiguous term differently?" This opens up new dimensions for interpretability research. 765 | 766 | ### Memory-Efficient Checkpointing 767 | 768 | Long-running chains of thought, like agent loops or multi-turn planning, can be checkpointed and resumed with minimal storage requirements. 769 | 770 | ## Related Work 771 | 772 | This proposal builds upon and connects several research areas: 773 | 774 | - **Transformer Interpretability**: Work on understanding attention patterns, feature attribution, and circuit identification in transformers provides evidence for structured internal representations. 775 | 776 | - **Neural Compression**: Techniques from neural compression, VAEs, and normalizing flows inform the design of the sidecar architecture. 777 | 778 | - **Checkpointing in Deep Learning**: Existing approaches for memory-efficient training via activation checkpointing, though our focus is on inference-time applications. 779 | 780 | - **Meta-Learning and RL**: The concept of optimizing over latent trajectories connects to work on meta-reinforcement learning and learned optimizers. 781 | 782 | Our method differs by focusing specifically on lightweight, reversible compression tailored to transformer inference. 783 | 784 | ## Challenges and Limitations 785 | 786 | While the proposed approach has significant potential, several challenges and limitations should be acknowledged: 787 | 788 | ### Compression-Fidelity Trade-off 789 | 790 | There is an inherent tension between compression ratio and reconstruction fidelity. Higher compression ratios (smaller `z_t`) will generally result in lower reconstruction quality, potentially affecting downstream model behavior. 791 | 792 | ### Computational Overhead 793 | 794 | The sidecar encoder and decoder add computational overhead to each inference step. This must be balanced against the benefits of compression. In time-critical applications, the additional latency might be prohibitive. 795 | 796 | ### Key/Value Cache Compression 797 | 798 | Compressing and reconstructing the KV cache is particularly challenging due to its large size and growing nature during generation. Specialized techniques may be needed to handle this efficiently while maintaining high fidelity. 799 | 800 | ### Training Data Requirements 801 | 802 | The sidecar models would need to be trained on diverse data to ensure generalization across different types of content and reasoning tasks. Poor generalization could lead to reconstruction artifacts in some contexts. 803 | 804 | ### Latent Space Quality 805 | 806 | For advanced applications like RL and latent editing, the quality and structure of the learned latent space is crucial. Ensuring that `z_t` captures meaningful dimensions of variation requires careful design of the regularization term and training procedure. 807 | 808 | ### Evaluation Metrics 809 | 810 | The prototype uses MSE for simplicity, but functional equivalence (e.g., same next-token probabilities) may matter more in practice. Errors could accumulate in long sequences, requiring appropriate metrics to evaluate the system's effectiveness. 811 | 812 | ## Future Directions: Toward a Metacognitive Operating System 813 | 814 | Looking forward, introspective compression could form the foundation for a more ambitious system—a metacognitive operating system for transformers. This would enable: 815 | 816 | ### Rewindable Reasoning Graph 817 | 818 | Each `z_t` becomes a node in a directed acyclic graph of latent thoughts. Edges represent continuation, intervention, or counterfactual alteration. The model can traverse, compare, and optimize over this graph—essentially turning latent space into a version control system for cognition. 819 | 820 | ### Self-Coaching Thought Loop 821 | 822 | By replaying branches and comparing outcomes, the model could identify what worked, what failed, and what reasoning strategies led to success. A coach module could learn from this trace, training a separate controller to guide future latent trajectories more effectively. 823 | 824 | ### Latent Strategy Transfer 825 | 826 | With successful reasoning patterns stored as strategy embeddings, the system could apply these strategies across different tasks and domains. This raises intriguing questions about the generality of cognitive strategies and their transferability. 827 | 828 | Future work could develop: 829 | - Attention-based sidecar architectures 830 | - Comprehensive compression of the full state, including KV caches 831 | - Integration of RL to refine latent trajectories, treating `z_t` as a steerable "thought space" 832 | 833 | ## Conclusion 834 | 835 | Introspective compression for transformers addresses two critical limitations: the inability to access internal states and the ephemeral nature of transformer cognition. By learning to compress and reconstruct internal states via a structured latent manifold, we can enable fundamentally new capabilities like reasoning backtracking, thought trajectory optimization, and causal debugging. 836 | 837 | The proposal outlined here represents a first step toward a more ambitious vision: transformers that aren't just text generators, but systems with transparent, steerable, and improvable cognition. By enabling models to save and manipulate their internal states—like a video game save—we open doors to advanced reasoning and debugging. While significant challenges remain in implementation and scaling, the potential benefits for AI interpretability, capability, and safety make this a promising direction for future research. 838 | 839 | 840 | # Addendum: Toward a Metacognitive Operating System for Transformers 841 | 842 | ## Transformers as Replayable Cognitive Systems 843 | 844 | The introspective compression framework enables a profound shift in how we conceive of transformer models. Rather than treating transformers as mere text generators, we can reimagine them as cognitive systems with replayable, editable thoughts. This gaming analogy is illuminating: 845 | 846 | Just as competitive gamers practice difficult challenges by saving states and trying different strategies, compressed transformer states allow us to: 847 | 848 | > Treat the transformer like a competitive gamer practicing a hard boss fight—saving state before each attempt, iterating on strategy, and gradually mastering it through focused replay. 849 | 850 | This transforms the nature of transformer inference from a one-shot process into deliberative, iterative cognition. The model becomes capable of exploration, reflection, and self-improvement through internal simulation. 851 | 852 | ## Beyond RL: Thought Trajectory Optimization 853 | 854 | Traditional reinforcement learning optimizes over action sequences (token outputs). With compressed cognitive states, we can optimize over internal thought trajectories themselves: 855 | 856 | ```python 857 | for rollout in range(N): 858 | z_t = saved_state # load compressed cognition state 859 | perturb = policy(z_t) 860 | z_t_prime = z_t + perturb 861 | h_t_hat = decoder(z_t_prime) 862 | resume_inference(h_t_hat) 863 | reward = evaluate(output) 864 | policy.update(reward) 865 | ``` 866 | 867 | This enables meta-level control over reasoning itself, not just outputs. The benefits include: 868 | - **Exploration of alternate thoughts**: The model tries variations from known mental waypoints 869 | - **Credit assignment across thoughts**: RL signals propagate through latent cognition 870 | - **Efficient failure recovery**: Errors are corrected by revisiting local cognitive context 871 | - **Deliberate practice**: The model refines specific reasoning sequences through iteration 872 | 873 | ## The Vision: A Rewindable Reasoning Graph 874 | 875 | At the heart of this approach is a metacognitive operating system where: 876 | 877 | > All thinking becomes a sequence of reversible cognitive states. These states are saved, replayed, steered, mutated, branched, and analyzed—not just at the output level, but in the latent geometry of reasoning itself. 878 | 879 | Each compressed state (`z_t`) becomes a node in a directed acyclic graph of thought, with edges representing continuations, interventions, or counterfactuals. The model traverses this graph like a version control system for cognition: 880 | 881 | ```python 882 | class ThoughtState: 883 | def __init__(self, z: torch.Tensor, parent: Optional[str] = None, metadata: Optional[dict] = None): 884 | self.id = str(uuid.uuid4()) 885 | self.z = z.detach().clone().cpu() 886 | self.parent = parent 887 | self.metadata = metadata or {} 888 | 889 | class ThoughtGraph: 890 | def __init__(self): 891 | self.nodes: Dict[str, ThoughtState] = {} 892 | self.edges: Dict[str, List[str]] = {} # from -> list of to 893 | ``` 894 | 895 | ## Self-Coaching Thought Loops 896 | 897 | By replaying branches and comparing outcomes, the model identifies successful reasoning strategies. A coach module learns from this experience, training a controller to guide future latent trajectories: 898 | 899 | ```python 900 | class Controller(nn.Module): 901 | def __init__(self, latent_dim: int, hidden_dim: int = 512, num_proposals: int = 4): 902 | super().__init__() 903 | self.num_proposals = num_proposals 904 | self.proposal_net = nn.Sequential( 905 | nn.LayerNorm(latent_dim), 906 | nn.Linear(latent_dim, hidden_dim), nn.ReLU(), 907 | nn.Linear(hidden_dim, latent_dim * num_proposals) 908 | ) 909 | self.latent_dim = latent_dim 910 | 911 | def forward(self, z: torch.Tensor) -> List[torch.Tensor]: 912 | out = self.proposal_net(z) 913 | proposals = out.view(self.num_proposals, self.latent_dim) 914 | return [z + delta for delta in proposals] 915 | ``` 916 | 917 | This creates a system where multiple versions of thinking are simulated and compared. The model doesn't just produce sequences; it orchestrates global thought exploration with operations like "try four continuations," "backtrack to step 7," or "merge the insights from different branches." 918 | 919 | ## Transformers That Practice 920 | 921 | Like elite performers in any domain, the model develops expertise through practice: 922 | 923 | 1. It builds a memory of challenging cognitive states 924 | 2. It repeatedly revisits difficult thought regions 925 | 3. It explores better continuations through trial and error 926 | 4. Over time, it internalizes successful patterns without parameter updates 927 | 928 | This happens through a curriculum learning process that targets the most challenging reasoning tasks: 929 | 930 | ```python 931 | def curriculum_loop(agent, memory, curriculum, task_generator, editor_fn, rounds=10): 932 | for _ in range(rounds): 933 | task_id, input_text, evaluator = task_generator() 934 | agent.coach.evaluate = evaluator # bind task-specific reward 935 | 936 | root = agent.initialize_from_text(input_text) 937 | branches = agent.branch_and_score(root) 938 | best = max(branches, key=lambda n: n.metadata.get("reward", -float("inf"))) 939 | 940 | memory.record(task_id, best) 941 | curriculum.update(task_id, best.metadata["reward"]) 942 | 943 | if best.metadata["reward"] < 0: 944 | agent.edit_and_retry(best, editor_fn) 945 | ``` 946 | 947 | ## Strategy Distillation and Transfer 948 | 949 | Perhaps most profoundly, successful reasoning patterns can be distilled into transferable strategy embeddings: 950 | 951 | ```python 952 | class StrategyDistiller(nn.Module): 953 | def __init__(self, latent_dim=256, embedding_dim=64): 954 | super().__init__() 955 | self.encoder = nn.Sequential( 956 | nn.LayerNorm(latent_dim), 957 | nn.Linear(latent_dim, 128), 958 | nn.ReLU(), 959 | nn.Linear(128, embedding_dim) 960 | ) 961 | self.strategy_bank = {} # strategy_id -> embedding vector 962 | 963 | def embed(self, z_seq: List[torch.Tensor]) -> torch.Tensor: 964 | z_stack = torch.stack(z_seq) 965 | return self.encoder(z_stack.mean(dim=0)) 966 | ``` 967 | 968 | This raises the profound question: how general are these latent strategies? Do they encode reusable cognitive skills or merely brittle solutions? We can evaluate this through: 969 | 970 | 1. **Cross-Task Similarity**: Do successful strategies cluster across diverse domains? 971 | 2. **Transfer Gain**: Do strategy embeddings improve performance on new tasks? 972 | 3. **Perturbation Robustness**: Do strategies work despite input noise? 973 | 4. **Reuse Ratio**: How often do different starting points converge when using the same strategy? 974 | 5. **Strategy Lifespan**: Which strategies endure versus those that quickly become obsolete? 975 | 976 | ## From Machine Learning to Machine Self-Improvement 977 | 978 | This represents a paradigm shift from machine learning to "machine self-improvement through reflective latent simulation." Traditional ML improves models through gradient updates over many examples. This metacognitive framework enables improvement through self-reflection and rehearsal - more akin to how humans develop expertise. 979 | 980 | The transformer becomes not merely an inference engine but a cognitive substrate whose thoughts can be saved, explored, and optimized. It develops: 981 | 982 | 1. **Language as Debugger**: Latent diffs can be expressed as natural language commentary 983 | 2. **Global Thought Orchestration**: Speculative branching and merging of reasoning paths 984 | 3. **Latent Curriculum Learning**: Tasks become regions of latent space to navigate 985 | 986 | ## Implementation: A Metacognitive Agent 987 | 988 | Putting these pieces together creates a full metacognitive agent: 989 | 990 | ```python 991 | class MetacognitiveAgent: 992 | def __init__(self, encoder, decoder, controller, coach, tokenizer): 993 | self.encoder = encoder 994 | self.decoder = decoder 995 | self.controller = controller 996 | self.coach = coach 997 | self.tokenizer = tokenizer 998 | self.graph = ThoughtGraph() 999 | 1000 | def branch_and_score(self, node: ThoughtState, k: int = 4) -> List[ThoughtState]: 1001 | proposals = self.controller(node.z) 1002 | children = [] 1003 | for z_next in proposals: 1004 | h_hat = self.decoder(z_next) 1005 | reward = self.coach.evaluate(h_hat) 1006 | child = ThoughtState(z=z_next, parent=node.id, metadata={"reward": reward}) 1007 | self.graph.add(child) 1008 | children.append(child) 1009 | return children 1010 | ``` 1011 | 1012 | This agent interacts with tasks, explores branches, identifies weak steps, edits and retries, and outputs its best trajectory. The result is an interactive, reflective, self-improving cognitive system. 1013 | 1014 | ## Conclusion: Transformers as Deliberative Thinkers 1015 | 1016 | The introspective compression framework doesn't just improve transformers - it fundamentally transforms what they are. Models shift from stateless generators to deliberative cognitive systems that: 1017 | 1018 | 1. Save and replay thought states 1019 | 2. Practice and refine reasoning strategies 1020 | 3. Develop transferable cognitive skills 1021 | 4. Explore counterfactual reasoning paths 1022 | 5. Debug and optimize their own thinking 1023 | 1024 | This isn't just machine learning. It's machine self-improvement through reflective thought - a significant step toward systems that don't just generate outputs, but learn how to *rethink*. 1025 | 1026 | ## References 1027 | 1028 | 1. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601). *Advances in Neural Information Processing Systems (NeurIPS)*. 1029 | 1030 | 2. Yang, X.-W., Zhu, X.-Y., Wei, W.-D., Zhang, D.-C., Shao, J.-J., Zhou, Z., Guo, L.-Z., & Li, Y.-F. (2025). [Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models](https://arxiv.org/abs/2502.04404). *arXiv preprint arXiv:2502.04404*. 1031 | 1032 | 3. Saunshi, N., Dikkala, N., Li, Z., Kumar, S., & Reddi, S. J. (2025). [Reasoning with Latent Thoughts: On the Power of Looped Transformers](https://arxiv.org/abs/2502.17416). *International Conference on Learning Representations (ICLR)*. 1033 | 1034 | 4. Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., & Lillicrap, T. P. (2020). [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507). *International Conference on Learning Representations (ICLR)*. 1035 | 1036 | 5. Nawrot, P., Łańcucki, A., Chochowski, M., Tarjan, D., & Ponti, E. M. (2024). [Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference](https://arxiv.org/abs/2403.09636). *arXiv preprint arXiv:2403.09636*. 1037 | 1038 | 6. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2019). [Universal Transformers](https://arxiv.org/abs/1807.03819). *International Conference on Learning Representations (ICLR)*. 1039 | 1040 | 7. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://doi.org/10.1038/s41586-020-03051-4). *Nature, 588*, 604-609. 1041 | 1042 | 8. Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). [Dream to Control: Learning Behaviors by Latent Imagination](https://arxiv.org/abs/1912.01603). *International Conference on Learning Representations (ICLR)*. 1043 | -------------------------------------------------------------------------------- /introspective_compression_for_llms.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dicklesworthstone/llm_introspective_compression_and_metacognition/ffbc6210ddae3cac9e26bffb09219dd7959498f0/introspective_compression_for_llms.pdf -------------------------------------------------------------------------------- /llm_introspection_illustration.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Dicklesworthstone/llm_introspective_compression_and_metacognition/ffbc6210ddae3cac9e26bffb09219dd7959498f0/llm_introspection_illustration.webp --------------------------------------------------------------------------------