ยท5 min read

Benchmarking TurboQuant+ KV Cache Compression on Apple Silicon

llmbenchmarksapple-siliconquantization

I've been running local LLMs on my M4 MacBook Air (24GB) for months. The bottleneck is always the same: KV cache memory. As context windows grow, the cache eats into the budget that could hold bigger models or longer conversations. So when TheTom's TurboQuant+ paper dropped at ICLR 2026, I wanted to see if the compression actually holds up on Apple Silicon.

TurboQuant+ uses PolarQuant with Walsh-Hadamard rotation to compress the KV cache. I tested two modes: turbo4 at 4.25 bits per value (3.8x compression) and turbo3 at 3.5 bits (4.6x compression). It integrates with llama.cpp through a custom fork, so everything ran on the Metal backend.

Full eval setup and scripts: turboquant-eval on GitHub.

Test setup

  • M4 MacBook Air, 24GB RAM, 10-core GPU (roughly 19GB usable working set)
  • Qwen2.5 1.5B (Q8_0), Qwen2.5 7B (Q4_K_M), Qwen2.5 14B (Q4_K_M)
  • llama.cpp Metal backend via TheTom's TurboQuant+ fork
  • KV cache types: FP16, Q8_0, Q4_0, turbo4, turbo3
  • Prompt processing at 512 tokens (pp512), generation at 128 tokens (tg128), perplexity on wikitext-2

Speed results

Prompt processing (pp512, tok/s)

ModelFP16Q8_0Q4_0turbo4turbo3
1.5B352.2299.8301.0303.2384.7
7B37.733.638.341.841.4
14B17.715.148.750.250.5
Prompt processing speed comparison across 1.5B, 7B, and 14B models for each KV cache type

Token generation (tg128, tok/s)

ModelFP16Q8_0Q4_0turbo4turbo3
1.5B20.423.824.921.421.2
7B2.73.03.13.74.9
14B1.93.05.75.55.6
Token generation speed comparison across 1.5B, 7B, and 14B models for each KV cache type

Speedup heatmap

Speedup heatmap showing turbo4 and turbo3 gains relative to FP16 baseline across all model sizes

What the numbers mean

The speed gains scale with model size, and the gap is dramatic. On the 1.5B model, turbo4 barely moves the needle (303 vs 352 tok/s for FP16). But look at the 14B row. 50.2 vs 17.7 tok/s for prompt processing. That's 2.8x. Token generation goes from 1.9 to 5.5 tok/s. I wasn't expecting that kind of jump.

Bigger models have bigger KV caches. Compressing them saves more memory bandwidth during attention. On Apple Silicon, where bandwidth is the bottleneck for generation, smaller caches mean faster inference. Simple as that.

One thing worth noting: turbo3 and turbo4 match Q4_0 on speed for the 14B model. But TurboQuant+ uses PolarQuant instead of naive round-to-nearest, which should preserve quality better at the same bit width. Whether it actually does is the next question.

The quality cliff

The speed numbers had me optimistic. Then I ran perplexity.

1.5B perplexity (wikitext-2, 512 ctx, 128 chunks)

KV Cache TypePerplexity
FP1610.40
Q8_010.43
Q4_03,308.5
turbo46,198.4
turbo38,198.5

Q8_0 is basically identical to FP16: 10.43 vs 10.40. But everything below 8-bit is destroyed. Q4_0 hits 3,308 perplexity. Turbo4 is worse at 6,198. Turbo3 is 8,198. These aren't "slightly degraded." These are gibberish.

Why? Qwen2.5-1.5B uses Grouped Query Attention with just 2 KV heads. Both are load-bearing. When you quantize to 4 bits, the rounding errors in each head propagate through every attention computation. Models with 32 or 64 KV heads can average out those errors. With 2 heads, there's nowhere to hide.

I didn't collect perplexity for the 7B and 14B models this round (multi-hour runtime per cache type on a laptop). The 7B has 4 KV heads, the 14B has 8, so I'd expect the degradation to be less severe. That's the next thing to test.

What I'd actually recommend

For 14B+ models on Apple Silicon where you need speed: try turbo4 or turbo3. The 2.8x prompt processing speedup is real, and you free GPU memory that the KV cache was holding.

If quality matters more than speed: Q8_0. Nearly lossless, minimal overhead. Hard to go wrong.

For small models (under 3B): don't touch sub-8-bit KV cache quantization. The KV head count is too low and the quality falls off a cliff.

If you have M5 hardware: run these benchmarks again. The M4 is using a software fallback for TurboQuant+, and native hardware support should improve the picture.

Reproducing this

Everything's in sacredvoid/turboquant-eval: benchmark scripts, plotting code, raw JSON results. The llama.cpp fork with TurboQuant+ is from TheTom/turboquant_plus.

  1. Clone the TurboQuant+ llama.cpp fork and build with Metal
  2. Download Qwen2.5 GGUFs (1.5B Q8_0, 7B Q4_K_M, 14B Q4_K_M)
  3. Run bench.sh for speed, bench_perplexity.sh for quality
  4. Run plot_results.py for charts

No cloud compute, no external GPU. Just a MacBook Air. That was the whole point.

What's next

The big open question: is the quality cliff I hit on the 1.5B model specific to low-KV-head architectures, or does it show up on larger models too? I need to run perplexity on the 7B and 14B to find out. I also want to test longer contexts (2048, 4096) where cache compression should matter even more. And whenever M5 hardware shows up, this whole suite gets re-run.