The KV Cache Is the New Bottleneck: A Practitioner's Guide to Quantizing Attention Memory
Weight quantization changed the memory profile of LLM serving. It made the fixed part of the model smaller. But long-context inference exposes a different allocation: the KV cache, which grows with every token and every concurrent sequence.
I keep hearing the same thing from engineers deploying LLMs: "We quantized the weights to 4 bits, the model fits on one GPU now, but the moment we turn on 128K context, we are out of memory again."
For long-context serving, weights are no longer the only memory problem. The KV cache is often the larger one.
This post covers the fundamentals of KV cache memory growth, why weight quantization does not address it, and a 2024-2026 wave of compression techniques that tackle it from different angles. I wrote it for my students at Columbia and for the teams I work with at IBM Research.
1. Background: attention, the KV cache, and inference phases
Each token in a transformer layer is projected through three learned weight matrices to produce a Query (Q), a Key (K), and a Value (V). The query represents what this token is looking for. The key represents what this token offers for matching. The value carries the information content to retrieve.
Attention computes similarity between the current token's Q and all previous tokens' K vectors, normalizes those scores with softmax, then takes a weighted sum of the corresponding V vectors:
Attention(Q, K, V) = softmax(Q · K^T / sqrt(d_head)) · V
Each token produces Q, K, V vectors. The new token's Q attends to all previous K vectors, takes a weighted sum of their V vectors, and produces a single output. The KV cache stores all previous K and V so they are never recomputed.
To generate a full sequence without storing intermediate results, each decode step must re-run attention over the entire prefix. Dense self-attention over a prefix of length t costs O(t²), so generating a length-n sequence by recomputing the full prefix at every step costs roughly Σt² = O(n³). The KV cache eliminates this: previous K and V tensors are stored at write time and reused at every subsequent step, so each new token attends over O(t) cached positions. Total attention reads drop to O(n²). The cost is O(n) memory growth.
Every production serving system (vLLM, TensorRT-LLM, SGLang) uses a KV cache. Without it, autoregressive decoding beyond a few hundred tokens is impractical.
Without the cache, generating n tokens costs O(n³) total (each step re-attends over the full prefix at O(t²)). With the cache, total attention reads are O(n²) but memory grows linearly.
Prefill and decode
LLM inference has two phases with different hardware characteristics.
Prefill processes the entire input prompt in one forward pass. All prompt tokens run through the model in parallel, producing K and V for every token at every layer. These fill the KV cache. Prefill is compute-bound: GPU matrix-multiply units are fully utilized on the large token batch.
Decode generates output tokens one at a time. Each step computes Q, K, V for the single new token, appends K and V to the cache, then reads the entire cache to compute attention. Decode is memory-bandwidth-bound: the GPU spends most of its time reading cached tensors from HBM, not doing arithmetic.
Because the cache grows by one K,V pair per layer per step, generating 1,000 tokens on top of a 127K-token prompt means the system must read 128K entries per layer in full for every subsequent token.
Prefill processes all input tokens in parallel and fills the KV cache. Decode generates one token at a time, reading the full cache at every step. Prefill is compute-bound. Decode is memory-bandwidth-bound.
2. The memory math
The KV cache size for a single sequence is:
KV memory = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element
For Llama 3.1 70B in FP16:
- 80 layers, 8 KV heads (Grouped-Query Attention, GQA), head dimension 128
- At 128K context: 2 × 80 × 8 × 128 × 131,072 × 2 bytes = ~40 GiB (~43 GB) per sequence
Without grouped-query attention, that number would be 8× larger: roughly 320 GiB for one 128K sequence. Llama 3.1 70B uses GQA with 8 KV heads instead of the full 64 attention heads (64/8 = 8× reduction), which is the only reason ~40 GiB is survivable. Models that still use full multi-head attention pay the full multiplier.
That is per sequence. Serving 32 concurrent requests requires 32× that for the KV cache alone, before model weights. With INT4 weights, a 70B model is around 35 GB. At long context, the KV cache exceeds the model size.
INT4 model weights stay fixed at ~35 GB. The KV cache scales linearly with both sequence length and batch size, exceeding the model size at 128K tokens.
3. Why weight quantization does not fix this
Weight quantization (GPTQ, AWQ, llama.cpp GGUF) reduces model weights from FP16 to INT4 or INT8. This reduces memory bandwidth pressure during decode. But it does not touch the KV cache, which is a separate allocation that grows with every generated token.
In a typical long-context deployment:
- Model weights (INT4): ~35 GB, fixed
- KV cache (FP16): ~40 GB per sequence at 128K, growing
The more you compress the weights, the larger the KV cache becomes as a fraction of total memory. While weight quantization allows LLMs to fit on fewer GPUs, unchecked KV cache growth quickly exhausts that reclaimed memory during long-context serving.
4. Seven design patterns in KV cache compression (2024-2026)
Recent research (2024-2026) has established several distinct design patterns for KV cache compression. Earlier systems such as KIVI [10] and KVQuant demonstrated that KV cache quantization could work below 4 bits; the newer work surveyed here pushes the design space toward mixed precision, joint eviction, transform coding, and hardware-native execution.
The KV cache hurts in two ways, and different techniques target different bottlenecks:
- Capacity: it consumes HBM space, limiting context length and concurrency.
- Bandwidth: it consumes HBM bandwidth during decode, limiting tokens/sec.
Eviction reduces the number of reads. Quantization reduces bytes per read. Transform coding reduces storage and offload pressure. Hardware-native FP4 reduces both bandwidth and dequantization overhead.
4.1 Uniform quantization
Quantize cached K and V tensors to lower precision (INT8, INT4, INT2) at write time, and dequantize at read time during attention.
QServe (MIT Han Lab) [3] uses W4A8KV4 (4-bit weights, 8-bit activations, 4-bit KV cache) with SmoothAttention to mitigate 4-bit KV accuracy loss. It achieves 1.2x to 3.5x higher throughput than TensorRT-LLM. Qwen1.5-72B sees 3.5x improvement on L40S GPUs. QServe is best understood as an end-to-end serving co-design result, not just a KV-cache quantizer: its reported speedups combine weight quantization, activation quantization, KV quantization, kernel design, and compute-aware layout choices.
The limitation of uniform approaches is that all attention heads receive the same bit-width. Attention heads exhibit wildly different tolerances for quantization error, as the RateQuant results below demonstrate (β ranges from 3.6 to 5.3 across heads).
4.2 Asymmetric key-value allocation
"Quantize What Counts" [2] proves that key projection matrices have systematically larger spectral and Frobenius norms than value matrices. Keys are harder to quantize than values.
A K4V2 configuration (4-bit keys, 2-bit values) retains 98.3% of the accuracy of uniform K4V4 allocation, tested across 7 models from 0.6B to 32B parameters on CoQA, EQ-Bench, and GSM8K. Note: that 98.3% is relative to the uniform K4V4 baseline within the paper's evaluation, not versus full-precision FP16 KV.
K4V2 averages 3 bits per KV element, so the ideal compression versus FP16 is 16/3 ≈ 5.3× before scale/metadata overhead. This requires no additional compute at inference time. You configure different bit-widths for K and V.
4.3 Per-head mixed-precision: RateQuant
RateQuant [6] (May 2026) applies rate-distortion theory to KV cache quantization. This is the same framework used in JPEG and video codec bit allocation.
Each attention head's quantization error follows D(b) = α·β^{−b}, where α and β vary across heads (β ranges from 3.6 to 5.3). Some heads tolerate aggressive quantization. Others do not.
RateQuant solves optimal per-head bit allocation in closed form using reverse waterfilling. On Qwen3-8B at 2.5 average bits per element, it reduces perplexity from 49.3 to 14.9 compared to KIVI's uniform allocation. That is a 70% perplexity reduction at the same memory budget, with 1.6 seconds of calibration and zero inference overhead.
Uniform bit-width assignment leaves significant quality unused. The variance across heads is large enough that mixed-precision consistently outperforms uniform quantization.
4.4 Hardware-native FP4: UltraQuant on AMD CDNA4
UltraQuant [1] (AMD/UCLA/Purdue, June 2026) uses native FP4 hardware support on AMD MI355X GPUs (CDNA4 architecture). KV tensors are stored in FP4, queries in FP8, and the scaled-MFMA (Matrix Fused Multiply-Add) units consume 4-bit values directly without software dequantization.
Results for multi-turn agentic workloads (32 concurrent sessions, MiniMax-M2.5 model, TP=2):
- 3.47x P50 TTFT (Time To First Token) reduction in cache-pressured late rounds (rounds 4 to 6)
- 2.3x TTFT reduction across all rounds
- 1.63x output throughput increase over FP8 KV baseline
The benefit is strongest under cache pressure. In warm early rounds where the cache is small, FP8 is slightly faster. UltraQuant also shows accuracy regressions on AIME25, indicating that FP4's limited dynamic range degrades math-heavy tasks.
UltraQuant is not a general software-only KV quantizer. It is a hardware-aligned FP4 path whose gains depend on AMD CDNA4 scaled-MFMA support and the serving stack around it.
4.5 Unified eviction and quantization: RDKV
Most prior work treats token eviction (dropping tokens from the cache) and quantization (reducing precision) as separate problems. RDKV [7] (May 2026) treats eviction as 0-bit allocation, one endpoint of a continuous precision range.
Using reverse water-filling, RDKV assigns each cached token a precision between zero bits (eviction) and full precision. It retains 97.81% of full-cache accuracy with only 2.48% cache retention on LongBench.
At 128K context:
- 4.5x decode speedup (vs. full-cache FlashAttention-2)
- 1.9x peak memory reduction
- 9.1% average accuracy improvement over the best evaluated baseline
The joint formulation solves one optimization problem instead of two separate decisions (which tokens to evict, how to quantize the rest).
4.6 Transform coding: KVTC
KVTC [5] (ICLR 2026, NVIDIA/University of Warsaw) applies signal processing techniques from image and video compression to KV cache data.
Three components:
- PCA-based feature decorrelation: rotate KV vectors into a statistically independent basis (analogous to DCT in JPEG)
- Adaptive quantization: allocate more bits to high-variance components
- Entropy coding: losslessly compress the quantized symbols
KVTC achieves up to 20x compression (40x in specific cases) while maintaining reasoning and long-context accuracy. Tested on Llama 3, Mistral NeMo, R1-Qwen 2.5 across eight benchmarks. It requires no model parameter changes. Only brief calibration.
KVTC is especially relevant when KV caches are reused, retained, offloaded, or stored across conversation turns. Its compression ratio should not be read as a direct decode-throughput multiplier in the same sense as QServe or UltraQuant.
4.7 Three-component compression: GEAR
GEAR [4] applies three separate mechanisms to three types of values in the KV cache:
- Uniform quantization for bulk entries (near-zero values)
- Low-rank matrix approximation for systematic residual error
- Sparse matrix correction for individual outlier entries
It achieves near-lossless 4-bit compression with 2.38x throughput improvement and 2.29x peak memory reduction. At 2-bit settings, throughput gains reach 5x.
Positions are approximate. Each paper uses different models, benchmarks, and definitions of compression. This is a conceptual trade-off map, not a normalized benchmark comparison.
5. Choosing a technique
The choice depends on context length, hardware target, and acceptable accuracy loss.
| Your constraint | Approach | Expected gain |
|---|---|---|
| General serving, want simplicity | Asymmetric K4V2 | ~5x vs FP16, <2% loss vs K4V4 baseline |
| Maximum quality at low bits | RateQuant (per-head) | 70% PPL reduction vs. uniform at same bits |
| AMD MI355X hardware | UltraQuant FP4 | 2.3 to 3.47x TTFT reduction |
| Ultra-long context (128K+) | RDKV (unified eviction+quant) | 97.8% accuracy at 2.5% cache retention |
| Maximum compression ratio | KVTC (transform coding) | 20 to 40x compression |
| Edge deployment, total budget | KV Pareto (joint weight+KV) [8] | 68 to 78% total memory reduction |
The lowest-risk starting point is asymmetric K/V allocation, because it follows a robust empirical pattern (keys are more sensitive than values), requires no calibration, and adds zero inference overhead. But I would treat K4V2 as a starting configuration, not a universal answer. The right choice depends on kernel support, calibration budget, retrieval sensitivity, and whether your bottleneck is capacity, bandwidth, or cache reuse.
These approaches are not mutually exclusive. Asymmetric K/V allocation can be combined with per-head mixed-precision, and transform coding can be combined with an eviction policy. No published system composes all of them.
6. End-to-end deployment example
Llama 3.1 70B serving a multi-turn agent workload at 128K context, 4 concurrent sequences, INT4 AWQ weights throughout:
| Baseline (FP16 KV) | KV Pareto (INT4 KV) | RDKV (2.48% retention) | |
|---|---|---|---|
| Model weights | 35 GB | 35 GB | 35 GB |
| KV cache / sequence | ~40 GB | ~10 GB | ~1 GB |
| KV cache × 4 sequences | ~160 GB | ~40 GB | ~4 GB |
| Total memory | ~195 GB | ~75 GB | ~39 GB |
| Hardware required | 3× H100 80GB | 1× H100 80GB | 1× H100 (headroom for more sessions) |
This table is capacity arithmetic, not a deployment guarantee. A real serving stack needs additional memory for runtime buffers, allocator fragmentation, paged-cache block tables, CUDA context, attention workspace, quantization scales/metadata, and scheduler overhead. Budget 10-20% headroom above these numbers.
Going from 3 GPUs to 1 changes what latency targets and concurrency levels are achievable, not just cost.
A caveat on the RDKV column: 2.48% retention means 97.5% of cached tokens are evicted. For retrieval-augmented workloads where the answer depends on a specific passage buried in a long context, aggressive eviction can drop the exact tokens a RAG query needs. Test retrieval accuracy on your actual query distribution before deploying at these retention rates.
7. Caveats
Most of these results are on arXiv preprints without formal peer review. The confirmed exception is KVTC (ICLR 2026). QServe appeared at MLSys. I have not independently verified venue acceptance for the remaining papers.
UltraQuant's gains are specific to AMD CDNA4. QServe was benchmarked on NVIDIA A100/L40S. GEAR was tested on V100. Results do not transfer directly across hardware.
Papers report "up to" numbers under favorable conditions. UltraQuant degrades on math benchmarks. RDKV's extreme retention rates may drop retrieval-critical tokens. Test on your actual workload distribution.
No published work combines RateQuant's per-head allocation with KVTC's transform coding or RDKV's eviction. The gains from combining techniques may not be additive.
Cross-paper comparison is difficult because papers use different models, benchmarks, hardware, compression definitions, quality metrics, and serving stacks. The Pareto frontier in Fig. 2 is a conceptual map, not a normalized benchmark comparison.
Production checklist
Before deploying KV cache compression, measure on your actual traffic distribution:
- Decode tokens/sec at target context length (bandwidth sensitivity)
- Peak HBM usage under realistic concurrency (capacity)
- KV cache residency and reuse patterns across turns
- TTFT and TPOT (Time Per Output Token) separately (they respond to different optimizations)
- Long-context retrieval accuracy (especially for RAG)
- Math and code generation degradation at target bit-width
- Memory overhead of quantization scales, metadata, and runtime buffers
In a production serving stack, I would not start with the most aggressive retention number. I would first measure cache residency, decode bandwidth, and retrieval sensitivity on the actual traffic distribution. KV compression changes which tokens the model can reliably use.
8. Open questions
Can transform coding, per-head bit allocation, and eviction compose for multiplicative gains? The math is compatible. No implementation exists.
What happens to factual retrieval accuracy in RAG pipelines when the KV cache is at 2 to 4 bits? Existing benchmarks do not test this adequately.
NVIDIA Blackwell has native FP4 tensor cores. AMD CDNA4 has scaled-MFMA. If native low-precision attention becomes standard hardware, the software complexity of these techniques becomes less necessary.
Can we get formal error bounds on quantized attention? Calver [9] proposes runtime-certified bounded-error quantized attention, where the system guarantees that output deviation stays within a specified tolerance. If that guarantee can be made cheap enough to enforce per-token, it changes the risk calculus for aggressive quantization in production.
The gap between arXiv numbers and production metrics is real. We need deployment reports from teams running these systems at scale with real traffic distributions.
Several papers report 3x to 40x compression under their evaluated workloads, with 1-3% accuracy loss on their chosen benchmarks. What we still lack is a single system that composes per-head allocation, transform coding, and eviction into one pipeline, and production-grade retrieval benchmarks that test whether aggressive compression drops the tokens a RAG query actually needs. Those two gaps are where the next year of work will land.
References
[1] Chakrabarti, A. et al., "UltraQuant: 4-bit KV Caching for Context-Heavy Agents," arXiv:2606.20474, June 2026. (Note: this paper may not yet be indexed; verify arXiv ID before citing externally.)
[2] "Quantize What Counts: More for Keys, Less for Values," arXiv:2502.15075, February 2025.
[3] Lin, Y. et al., "QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving," arXiv:2405.04532, MIT Han Lab.
[4] Kang, Z. et al., "GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM," arXiv:2403.05527, 2024.
[5] Staniszewski, M. and Lancucki, A., "KVTC: KV Cache Transform Coding for Efficient LLM Inference," ICLR 2026.
[6] "RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory," arXiv:2605.06675, May 2026.
[7] Zhang, Y. et al., "RDKV: Unified Token Eviction and Quantization for KV Caches via Rate-Distortion Optimization," arXiv:2605.08317, May 2026. (Note: this paper may not yet be indexed; verify arXiv ID before citing externally.)
[8] "Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference," arXiv:2512.01953, December 2025.
[9] Calver, D., "Runtime-Certified Bounded-Error Quantized Attention," arXiv:2605.20868, May 2026.
[10] Liu, Z. et al., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache," arXiv:2402.02750, February 2024.
Discussion
Sign in with GitHub to leave a comment or react. Threads are public and live in this site's GitHub Discussions.