TurboQuant+ Meets Gemma on a Modal L40S
A few months ago I benchmarked TurboQuant+ on an Apple M4, the mixed-precision KV cache scheme from TheTom's ICLR 2026 paper. That post was about what fits on 24GB of unified memory. This one rents a Modal L40S (48GB, ~$1.95/hr), points it at three very different Gemma architectures, and finds out what breaks.
Full code, data, and plan: turboquant-eval on GitHub, branch gemma-turboquant-modal.
TL;DR
- On Gemma 3 12B, TurboQuant+ works exactly as advertised. Turbo3 (3.5 bpv) and Turbo4 (4.25 bpv) give perplexity of 10.27 and 10.24 on wikitext-2, a shade better than the Q8_0 baseline at 10.42. Speed is flat across cache types because at pp=512 on an L40S, a 12B model is compute-bound, not memory-bound.
- On Gemma 4 E4B and 26B A4B (MoE), the TurboQuant+ llama.cpp fork currently produces broken outputs. E4B perplexity lands around 65, the MoE produces perplexity in the thousands. The fork disables upstream attention rotation and applies its own kernel-level Walsh-Hadamard transform, which is transparent on Gemma 3 but not on Gemma 4's head-dim-512 attention path.
- Total Modal spend: $2.15 out of the $15 I had budgeted. L40S is fast enough that 29 benchmark configs finished in a few minutes.
Why L40S
The L40S is the cheapest Modal GPU with enough VRAM to hold a 26B-class model at reasonable quantization. It sits between a MacBook and a datacenter A100, which is the tier most people actually rent when they want to run a model for a weekend project.
Modal lists it at ~$1.95/hr. The full plan wanted three phases (image build, speed sweep, perplexity sweep) budgeted at ~$12.70. Actual spend came in at a seventh of that because an L40S tears through short-context inference much faster than a MacBook does.
The lineup
| Model | Class | VRAM @ Q4 | Base quant used |
|---|---|---|---|
| Gemma 4 E4B (instruct) | Edge, Per-Layer Embeddings | ~3 GB | Q8_0 |
| Gemma 3 12B (instruct) | Dense | ~8 GB | Q4_K_M |
| Gemma 4 26B A4B (instruct) | MoE, 4B active per token | ~15 GB | UD-Q4_K_M |
One model per architecture class, each with a different attention shape. Gemma 4 E4B uses Per-Layer Embeddings with n_embd_head_k_all=512. Gemma 4 26B A4B is the MoE flagship with 26B total parameters but only 4B active per token. Gemma 3 12B is the boring dense baseline that grounds the comparison.
Benchmarks use the existing llama.cpp knobs: llama-bench -p 512 -n 128 -ngl 99 -fa 1 -r 3 for speed, llama-perplexity -f wiki.test.raw --ctx-size 512 --chunks 128 for quality. Six KV cache types: f16, q8_0, q4_0, turbo4 (4.25 bpv), turbo3 (3.5 bpv), and turbo2 (new in the fork, never benchmarked in the M4 post).
Speed: what the L40S reveals

Gemma 3 12B is flat. PP hovers around 5900 tok/s, TG around 72 to 80 tok/s, whether the cache is FP16 or Turbo3. On the M4 the same 12B-class model was memory-bound and Q4/Turbo beat FP16 by almost 3x. On L40S, memory bandwidth (864 GB/s) is plentiful enough that a 12B's KV cache at 512 tokens never becomes the bottleneck. Compute dominates. So: cache compression matters less as you move up the GPU tier, at least at short context.
Gemma 4 E4B shows the opposite. FP16 KV runs at 7630 tok/s PP, Q4_0 at 10080 tok/s, a 32% speedup. The E4B is small enough that even on L40S the cache is a noticeable slice of active memory, and shrinking it helps. PLE models apparently still benefit from KV compression.
Gemma 4 26B MoE is the speed king. 7600 PP tok/s and 129 TG tok/s, beating both smaller models on generation. The MoE sparsity advantage (4B active per token) shows up exactly where you'd guess.

Perplexity: Gemma 3 12B is the story

| Cache | Gemma 3 12B PPL |
|---|---|
| Q8_0 | 10.42 |
| Q4_0 | 10.36 |
| Turbo4 (4.25 bpv) | 10.24 |
| Turbo3 (3.5 bpv) | 10.27 |
On wikitext-2-raw at ctx=512 / 128 chunks, Turbo4 and Turbo3 are both a touch better than Q8_0. That's the payoff TurboQuant+ was designed to produce: lossy cache compression that preserves attention structure well enough to match the baseline, while cutting cache memory by 3.7x to 4.6x. On a card where 8GB of your 48 is eaten by model weights and another slice by activations, that difference is what turns an 8K-context run into a 32K-context one.

What went sideways: Gemma 4
Both Gemma 4 models come out broken under the fork:
| Cache | Gemma 4 E4B PPL | Gemma 4 26B MoE PPL |
|---|---|---|
| Q8_0 | 65.9 | 24,297 |
| Q4_0 | 64.8 | 12,396 |
| Turbo4 | 63.2 | 14,710 |
| Turbo3 | 60.9 | 34,289 |
A functional model on wikitext-2 should land in the 6 to 12 range. Gemma 3 12B at 10.4 is textbook. The Gemma 4 numbers are an order of magnitude worse on E4B and three orders of magnitude worse on the MoE. Something is fundamentally off.
Looking at the llama-perplexity startup logs, the TurboQuant+ fork prints this on every run, including Q8_0:
llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
The fork replaces standard attention rotation with its own Walsh-Hadamard transform at the kernel level. On Gemma 3, Qwen, and presumably Llama, this substitution is transparent. On Gemma 4 something about the PLE layer (E4B) or MoE router (26B A4B) at head dim 512 doesn't play nicely with it, even when you pick Q8_0 and no actual compression is involved.
I didn't burn the rest of the Modal budget debugging this. The honest read is that the TurboQuant+ fork is currently validated on pre-Gemma-4 architectures and needs upstream work before it can claim Gemma 4 support.
What the numbers do and don't say
They do say:
- Turbo3 and Turbo4 preserve quality on Gemma 3 12B. If you're running dense 7B to 14B models on a rented L40S and want 128K context to fit, this is the compression scheme worth evaluating.
- At pp=512 on an L40S, KV cache compression won't speed up a 12B-class model. To move wallclock numbers you need longer context or a narrower memory-bandwidth envelope (i.e. a MacBook).
They don't say:
- Anything about Gemma 4 under a working KV compression path. That needs either a fixed fork or a different implementation.
- Anything about long context (32K+). The whole sweep ran at ctx=512, which is the regime where KV compression matters least. The obvious follow-up is running the same sweep at 32K or 128K on Gemma 3 12B.
How this was actually run
All benchmarks ran inside a single Modal app that defines the CUDA image once, caches it, and exposes two functions for speed and perplexity. The CLI runner iterates a config matrix and accumulates spend against a $14 cap, so you can't accidentally blow the budget.
Two things that cost me an hour each and might save you time:
- The
TheTom/turboquant_plusrepo on GitHub is the paper site, not the fork. The llama.cpp fork lives atTheTom/llama-cpp-turboquant. The README inturboquant-evalhad the wrong URL, which ate one image build before I noticed. - The
nvidia/cuda:12.4.1-devel-ubuntu22.04image has the CUDA toolkit but notlibcuda.so.1. Link step fails. Fix: symlinklibcuda.sotolibcuda.so.1under/usr/local/cuda/lib64/stubs, pass-DCMAKE_EXE_LINKER_FLAGS='-Wl,-rpath-link,/usr/local/cuda/lib64/stubs'to cmake, and setLIBRARY_PATH=/usr/local/cuda/lib64/stubsfor the build step. Pin-DCMAKE_CUDA_ARCHITECTURES=89(Ada Lovelace, L40S) so cmake doesn't need a GPU at build time.
To reproduce end to end: clone the repo, modal token new, then python3 scripts/run_modal_bench.py --phase all --cap 14.0. Full sweep finishes in under ten minutes of wall time and costs under $3.
Loose ends
The Modal credit I didn't spend is still sitting there, which is a fine problem to have. Two obvious next posts, in rough order of how useful I think they'd be:
- Long-context sweep (16K, 32K, 128K) on Gemma 3 12B. This is where Turbo3 vs Q4_0 should actually pull ahead on speed, not just quality, and where the compression story gets interesting.
- A sanity pass with upstream llama.cpp on Gemma 4, to confirm whether the perplexity comes back to the 6 to 12 range once the TurboQuant-specific attention path is out of the picture. If it does, the fork has a clear upstream issue to file.
If either one is the post you want, tell me and I'll prioritize it.
Related Posts
Benchmarking TurboQuant+ KV Cache Compression on Apple Silicon
I tested TurboQuant+ KV cache compression across 1.5B, 7B, and 14B models on an M4 MacBook Air. The speed gains are real, but there are sharp cliffs you need to know about.
Multimodal Models Are the New Default: GPT-4V, Gemini, and Beyond
In 2024, the best AI models understand text, images, audio, and video natively. As someone with a CV background, this convergence feels like a turning point.
OpenAI o1 and Reasoning Models: A New Paradigm?
OpenAI o1 doesn't just generate text. It thinks first. That distinction might be more important than any scaling breakthrough since GPT-3.