This post expands on our ASPLOS 2026 paper, "STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems." I co-supervised this work with Prof. Liu Liu at Rensselaer Polytechnic Institute (RPI). The lead author is Zehao Fan, with contributions from Yunzhen Liu, Garrett Gagnon, Zhenyu Liu, Yayue Hou, and Hadjer Benmeziane — a collaboration across RPI, the University of Massachusetts Amherst, and IBM Research.
In this post: Why LLM inference is hitting a memory wall · What Processing-in-Memory gets right and where it falls short · How STARC bridges the gap · The hardware-algorithm co-design behind our key design choices · Results across accuracy and hardware efficiency · What this means for the field
The next breakthrough in large language model (LLM) inference is not going to come from bigger models. It is going to come from rethinking how models access memory. That is the thesis behind STARC, and the results back it up — up to 93% latency reduction and 92% energy reduction on the attention layer. Here is how we got there.
1. The Memory Wall Is the Real Bottleneck
In the current discourse around AI infrastructure, compute dominates the conversation — how many GPUs, how many FLOPs. But for LLM inference, the binding constraint is increasingly memory bandwidth, not compute.
To understand why, consider how an LLM generates text. There are two phases. Prefill is where the model processes your entire input prompt in parallel — this is compute-heavy and runs efficiently on GPUs. Decoding is where the model generates tokens one at a time, each one depending on all the tokens that came before. This is where things break down.
At every decoding step, the model needs to look back at its memory of all previous tokens to decide what to generate next. In technical terms, it reads the Key-Value (KV) cache — a data structure that stores compressed representations of every token the model has seen so far. Think of it as the model's short-term memory: each time it writes a new word, it consults everything it has written and read before to figure out what comes next. This consultation happens through an operation called attention, which computes similarity scores between the current token and every previous one.
The problem is that this operation has very low arithmetic intensity — it is a simple matrix-vector multiplication. The GPU has enormous compute capacity, but it spends most of its time waiting for the KV cache data to arrive from memory. The compute units sit idle [1]. The bottleneck is not processing power. It is the speed at which data can be moved.
As context lengths grow — 32K, 128K, now 1M+ tokens — the KV cache grows proportionally. The volume of data that must be read at every step becomes staggering. This is not a theoretical concern. It is the single biggest barrier to efficient long-context LLM serving today. The attention layer during decoding is fundamentally memory-bound, and adding more compute does not help. You need to fix how data moves.
2. Processing-in-Memory Gets You Halfway There
Processing-in-Memory (PIM) is an architectural approach that addresses the data movement problem at its root: instead of moving data from memory to the processor, you place lightweight compute units directly inside the memory chips. The data stays where it is, and computation happens right next to it.
For attention — which consists of reading KV vectors from memory and computing simple dot products — PIM is a natural fit. In our system, which builds on the AttAcc architecture [2], the workload partitions cleanly. Compute-intensive operations like Query-Key-Value (QKV) generation and Feed-Forward Networks (FFNs) stay on the GPU, where they run efficiently. The memory-bound attention layer gets offloaded to PIM units sitting inside High Bandwidth Memory (HBM) — the same memory where the KV cache already resides. No costly data transfers across the memory bus. The internal bandwidth of our HBM-PIM configuration reaches 242 TB/s — for comparison, an NVIDIA H100 GPU delivers about 3.35 TB/s of HBM bandwidth, and even the B200 peaks at around 8 TB/s. PIM operates at 30–70x the bandwidth available through the external memory interface.
The problem that had not been solved: existing PIM designs assume dense attention, meaning the model reads every single token in the KV cache at every step. That works for short contexts. For long contexts — 16K, 32K tokens and beyond — you end up activating tens of thousands of memory rows per decoding step. Each row activation costs energy. Each row switch costs time. PIM's row-level granularity — the fact that it reads data one full row at a time — is its strength for dense workloads, but becomes its weakness when most of the data in those rows is irrelevant.
PIM gives you the bandwidth. But without a way to skip irrelevant data, you are still doing far too much work.
3. Sparsity Sounds Great — Until You Try It on PIM
Research has shown that attention is naturally sparse. In many cases, fewer than 10% of tokens contribute meaningfully to the model's output. The rest receive near-zero attention weights. The natural response is: only fetch the tokens that actually matter.
The field has explored three approaches to sparse attention, and each has a fundamental limitation when deployed on PIM hardware:
| Approach | How it works | The limitation on PIM |
|---|---|---|
| Static sparsity | Keep a fixed window of recent tokens | Misses critical long-range dependencies entirely |
| Token-wise dynamic (SparQ [4], InfiniGen [5]) | Select the most relevant tokens for each query | Best accuracy — but selected tokens scatter across hundreds of memory rows. PIM must activate every row containing at least one relevant token, processing large amounts of irrelevant data along the way. |
| Page-wise dynamic (Quest [3]) | Group tokens into fixed-size pages, select relevant pages | Aligns well with PIM rows — but pages are defined by position in the sequence, not by semantic relevance. Our analysis shows most pages contain only 1–2 important tokens out of 16. The rest are wasted computation. |
The counterintuitive part: token-wise sparsity can reduce the number of relevant tokens by over 90%, yet PIM barely benefits. The tokens you need are spread across so many memory rows that the hardware still activates nearly every row in the bank. The sparsity is real at the algorithm level but invisible at the hardware level.
This mismatch between dynamic sparsity and rigid PIM data layouts is the fundamental barrier we set out to address. The question we asked was: what if you could reorganize the KV cache so that the tokens PIM needs to fetch are already sitting together in the same memory rows?
4. STARC: Making Sparsity Hardware-Visible
The core idea behind STARC is straightforward: cluster semantically similar KV pairs and physically co-locate them in contiguous memory rows. When the model identifies which tokens matter for a given query, those tokens are already packed together — and PIM can skip entire rows of irrelevant data.
Here is how it works in practice:
First, block partitioning. After the prefill phase, we divide the KV cache into non-overlapping blocks of 64 tokens. The number 64 is not arbitrary — it is derived directly from the hardware. Each PIM row holds 16 vectors, and we use 4 clusters per block: 4 × 16 = 64.
Second, local clustering. Within each 64-token block, we run K-means clustering — a standard algorithm that groups similar data points together — using cosine similarity as the distance metric. We use K=4 clusters and up to 16 iterations. Clustering is applied to the key vectors only; the corresponding value vectors inherit the same cluster assignments. The resulting clusters are stored in contiguous physical memory locations that align with PIM bank rows. One cluster maps to one row.
Third, query-guided retrieval. At each decoding step, the current query vector is compared against all cluster centroids (the representative center of each cluster). Clusters are ranked by their similarity score. The top-ranked clusters are retrieved until the KV budget is reached. Because each cluster maps to a contiguous memory region, PIM fetches exactly what it needs — no wasted row activations.
Fourth, incremental updates during decoding. Newly generated tokens remain unclustered until 64 of them accumulate. Then we cluster that batch, append it to the existing clusters, and continue. Clusters, once formed, are never updated. This append-only design avoids the costly reshuffling of data that is already laid out in memory.
STARC does not change the attention algorithm itself. It changes where data lives in physical memory. That single change — data layout — turns theoretical sparsity into actual hardware savings.
5. Why K=4 — The Co-Design Principle
K=4 is not a hyperparameter we tuned on a validation set. It was derived from first principles through hardware-algorithm co-design.
The arithmetic intensity of K-means clustering — the ratio of computation to memory traffic — scales linearly with K when using FP16 (half-precision floating point) data. Our HBM-PIM system, built on the AttAcc architecture with 40 HBM stacks, 1280 pseudo-channels, and 873 TFLOPs/s of peak compute against 242 TB/s of internal bandwidth, has a compute-to-memory tipping point of approximately 4 FLOPs per byte.
The design rule follows directly: set K equal to this tipping point. K=4 ensures the clustering workload operates at the hardware's exact balance point between being limited by memory bandwidth and being limited by compute throughput. A lower K wastes available compute capacity. A higher K creates a memory bandwidth bottleneck during clustering itself.
This is what hardware-algorithm co-design means in practice. The algorithm parameter is not chosen for algorithmic reasons — it is chosen because the hardware demands it.
6. Clustering Inside Memory — No GPU Round-Trip
A critical design decision in STARC is that all clustering happens inside PIM, not on the GPU. If you move KV vectors back to the GPU for clustering and then return the results to PIM, you introduce a round-trip across the memory interface — exactly the bottleneck PIM was designed to eliminate.
We mapped all three phases of cosine-based K-means onto existing PIM compute primitives:
- Normalization uses self dot-product operations (MAC_AB) plus a fused VNORM command — a lightweight lookup-table-based approximation of the reciprocal square-root that reuses the PIM architecture's existing scaling datapath
- Assignment compares each vector against all K centroids using MAC_AB operations, gathering similarity scores with MVSB (move-to-softmax-buffer) commands
- Update broadcasts vectors to all banks via MVGB (move-from-global-buffer) and accumulates cluster sums via MAC_AB
The only hardware addition is VNORM, and it reuses existing circuitry. No new silicon. No additional chip area. The entire clustering pipeline runs with primitives the PIM architecture already provides. This was a non-negotiable requirement — if an optimization demands new hardware, it faces years of delay before reaching production.
7. The Results
We evaluated STARC on three representative LLMs: LongChat-7B (which uses Multi-Head Attention, or MHA), LLaMA-3.1-8B (which uses Grouped-Query Attention, or GQA), and Mistral-7B (also GQA). Testing spanned LongBench [6] (16 diverse datasets), the RULER benchmark [7] (13 tasks at 32K context length), and PG-19 language modeling. Hardware performance was measured using the AttAcc simulator on a DGX+AttAcc platform with 8 NVIDIA H100 GPUs and 40 HBM3 stacks. The KV cache budget was set to 1024 tokens unless otherwise noted.
Accuracy — comparable to the best, well ahead of the rest
| Method | LongChat Avg. | Mistral Avg. | LLaMA-3.1 Avg. |
|---|---|---|---|
| Full KV (no sparsity) | 33.66 | 46.89 | 39.46 |
| STARC | 33.38 | 46.29 | 39.71 |
| SparQ (token-wise) | 34.55 | 47.77 | 39.76 |
| InfiniGen (token-wise) | 33.98 | 46.53 | 39.51 |
| Quest (page-wise) | 32.82 | 44.57 | 36.38 |
STARC is within a point of the best token-wise methods and significantly outperforms page-wise Quest across all models. On LLaMA-3.1, STARC actually exceeds the full KV baseline on average. On RULER at 32K context, STARC achieves 87.3% average accuracy — close to full-KV (88.1%) and SparQ (88.3%), and well ahead of Quest (78.5%).
Hardware efficiency — where the impact is most pronounced
Attention layer (vs. full KV retrieval):
- Up to 93% latency reduction
- Up to 92% energy reduction
Attention layer (vs. token-wise sparsity such as SparQ):
- Up to 78% latency reduction
- Up to 65% energy reduction
End-to-end decoding (vs. full KV):
- 25%–48% faster total decoding latency
- 34%–56% less energy consumption
The clustering overhead amounts to approximately 0.02% of total decoding time and energy. Because STARC clusters each token exactly once and never re-clusters, the cost scales linearly — not quadratically — with context length. At 32K tokens, it is negligible.
STARC approaches the hardware efficiency of page-wise sparsity — the theoretical best case for PIM row alignment — while preserving the accuracy of fine-grained token-wise methods. You do not have to choose between hardware efficiency and model quality.
8. What This Means for the Field
STARC is one paper, but it illustrates a broader principle that I believe our field needs to embrace more fully.
For years, the machine learning community and the systems community have operated in parallel. ML researchers optimize algorithms assuming uniform, unlimited memory. Systems researchers optimize hardware assuming fixed, dense workloads. Neither assumption holds anymore. The models are too large, the contexts are too long, and the gap between peak hardware capability and actual utilization has grown too wide to ignore.
Here is what I take away from this work:
Memory layout is a first-class optimization target. Most ML systems research treats memory as a black box. STARC demonstrates that controlling where data lives in physical memory — not just what data to access — can yield order-of-magnitude improvements. This remains a largely underexplored design dimension.
Co-design means the algorithm adapts to the hardware. We did not select K=4 and hope it would work well. We derived it from the hardware's arithmetic intensity tipping point. That is the level of integration this problem demands. If an algorithm ignores the hardware it runs on, it leaves an order of magnitude of performance on the table.
Sparsity without hardware alignment is a mirage. Token-wise sparsity reduces the number of relevant tokens by over 90%. On PIM, it barely changes the hardware cost because the access patterns do not align with memory row boundaries. STARC makes sparsity hardware-visible — and that is the difference between a paper result and a deployable system.
Looking ahead, as context windows push past 1M tokens, the KV cache challenge will only intensify, and the value of intelligent data placement will only grow. STARC is orthogonal to quantization and eviction-based KV cache compression — combining these techniques opens a path to genuinely efficient long-context serving. We are also exploring adaptive re-clustering during decoding, which could further improve accuracy on tasks with significant distribution shift between the prompt and the generated output.
The code is open source. We welcome the community to build on this work.
Paper: STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems — ASPLOS 2026
Code: github.com/EPIC-RPI/STARC
Authors: Zehao Fan, Yunzhen Liu, Garrett Gagnon, Zhenyu Liu, Yayue Hou, Hadjer Benmeziane, Kaoutar El Maghraoui, Liu Liu
Acknowledgments: This work was supported in part by the RPI-IBM Future of Computing Research Collaboration and the National Science Foundation under Award Number 2442271.
References
For a complete list of references, see the full paper. Key references cited in this post:
[1] Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
[2] Park, J., et al. (2024). AttAcc! Unleashing the Power of PIM for Batched Transformer-Based Generative Model Inference. ASPLOS 2024.
[3] Tang, J., et al. (2024). Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774.
[4] Ribar, L., et al. (2023). SparQ Attention: Bandwidth-Efficient LLM Inference. arXiv:2312.04985.
[5] Lee, W., et al. (2024). InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. OSDI 2024.
[6] Bai, Y., et al. (2024). LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. ACL 2024.
[7] Hsieh, C.-P., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654.
[8] Heo, G., et al. (2024). NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing. ASPLOS 2024.
Discussion
Sign in with GitHub to leave a comment or react. Threads are public and live in this site's GitHub Discussions.