ยท8 min read

TurboQuant+ Meets Gemma on a Modal L40S

llmbenchmarksmodalgemmaquantization

A few months ago I benchmarked TurboQuant+ on an Apple M4, the mixed-precision KV cache scheme from TheTom's ICLR 2026 paper. That post was about what fits on 24GB of unified memory. This one rents a Modal L40S (48GB, ~$1.95/hr), points it at three very different Gemma architectures, and finds out what breaks.

Full code, data, and plan: turboquant-eval on GitHub, branch gemma-turboquant-modal.

TL;DR

  • On Gemma 3 12B, TurboQuant+ works exactly as advertised. Turbo3 (3.5 bpv) and Turbo4 (4.25 bpv) give perplexity of 10.27 and 10.24 on wikitext-2, a shade better than the Q8_0 baseline at 10.42. Speed is flat across cache types because at pp=512 on an L40S, a 12B model is compute-bound, not memory-bound.
  • On Gemma 4 E4B and 26B A4B (MoE), the TurboQuant+ llama.cpp fork currently produces broken outputs. E4B perplexity lands around 65, the MoE produces perplexity in the thousands. The fork disables upstream attention rotation and applies its own kernel-level Walsh-Hadamard transform, which is transparent on Gemma 3 but not on Gemma 4's head-dim-512 attention path.
  • Total Modal spend: $2.15 out of the $15 I had budgeted. L40S is fast enough that 29 benchmark configs finished in a few minutes.

Why L40S

The L40S is the cheapest Modal GPU with enough VRAM to hold a 26B-class model at reasonable quantization. It sits between a MacBook and a datacenter A100, which is the tier most people actually rent when they want to run a model for a weekend project.

Modal lists it at ~$1.95/hr. The full plan wanted three phases (image build, speed sweep, perplexity sweep) budgeted at ~$12.70. Actual spend came in at a seventh of that because an L40S tears through short-context inference much faster than a MacBook does.

The lineup

ModelClassVRAM @ Q4Base quant used
Gemma 4 E4B (instruct)Edge, Per-Layer Embeddings~3 GBQ8_0
Gemma 3 12B (instruct)Dense~8 GBQ4_K_M
Gemma 4 26B A4B (instruct)MoE, 4B active per token~15 GBUD-Q4_K_M

One model per architecture class, each with a different attention shape. Gemma 4 E4B uses Per-Layer Embeddings with n_embd_head_k_all=512. Gemma 4 26B A4B is the MoE flagship with 26B total parameters but only 4B active per token. Gemma 3 12B is the boring dense baseline that grounds the comparison.

Benchmarks use the existing llama.cpp knobs: llama-bench -p 512 -n 128 -ngl 99 -fa 1 -r 3 for speed, llama-perplexity -f wiki.test.raw --ctx-size 512 --chunks 128 for quality. Six KV cache types: f16, q8_0, q4_0, turbo4 (4.25 bpv), turbo3 (3.5 bpv), and turbo2 (new in the fork, never benchmarked in the M4 post).

Speed: what the L40S reveals

Prompt processing speed on L40S across E4B, 12B, and 26B MoE for each KV cache type

Gemma 3 12B is flat. PP hovers around 5900 tok/s, TG around 72 to 80 tok/s, whether the cache is FP16 or Turbo3. On the M4 the same 12B-class model was memory-bound and Q4/Turbo beat FP16 by almost 3x. On L40S, memory bandwidth (864 GB/s) is plentiful enough that a 12B's KV cache at 512 tokens never becomes the bottleneck. Compute dominates. So: cache compression matters less as you move up the GPU tier, at least at short context.

Gemma 4 E4B shows the opposite. FP16 KV runs at 7630 tok/s PP, Q4_0 at 10080 tok/s, a 32% speedup. The E4B is small enough that even on L40S the cache is a noticeable slice of active memory, and shrinking it helps. PLE models apparently still benefit from KV compression.

Gemma 4 26B MoE is the speed king. 7600 PP tok/s and 129 TG tok/s, beating both smaller models on generation. The MoE sparsity advantage (4B active per token) shows up exactly where you'd guess.

Token generation speed on L40S across the same three models

Perplexity: Gemma 3 12B is the story

Perplexity on wikitext-2 across cache types for E4B, 12B, and 26B MoE on L40S
CacheGemma 3 12B PPL
Q8_010.42
Q4_010.36
Turbo4 (4.25 bpv)10.24
Turbo3 (3.5 bpv)10.27

On wikitext-2-raw at ctx=512 / 128 chunks, Turbo4 and Turbo3 are both a touch better than Q8_0. That's the payoff TurboQuant+ was designed to produce: lossy cache compression that preserves attention structure well enough to match the baseline, while cutting cache memory by 3.7x to 4.6x. On a card where 8GB of your 48 is eaten by model weights and another slice by activations, that difference is what turns an 8K-context run into a 32K-context one.

Relative speed vs FP16 baseline for each cache type, showing mostly flat 1.0x ratios on 12B

What went sideways: Gemma 4

Both Gemma 4 models come out broken under the fork:

CacheGemma 4 E4B PPLGemma 4 26B MoE PPL
Q8_065.924,297
Q4_064.812,396
Turbo463.214,710
Turbo360.934,289

A functional model on wikitext-2 should land in the 6 to 12 range. Gemma 3 12B at 10.4 is textbook. The Gemma 4 numbers are an order of magnitude worse on E4B and three orders of magnitude worse on the MoE. Something is fundamentally off.

Looking at the llama-perplexity startup logs, the TurboQuant+ fork prints this on every run, including Q8_0:

llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512

The fork replaces standard attention rotation with its own Walsh-Hadamard transform at the kernel level. On Gemma 3, Qwen, and presumably Llama, this substitution is transparent. On Gemma 4 something about the PLE layer (E4B) or MoE router (26B A4B) at head dim 512 doesn't play nicely with it, even when you pick Q8_0 and no actual compression is involved.

I didn't burn the rest of the Modal budget debugging this. The honest read is that the TurboQuant+ fork is currently validated on pre-Gemma-4 architectures and needs upstream work before it can claim Gemma 4 support.

What the numbers do and don't say

They do say:

  • Turbo3 and Turbo4 preserve quality on Gemma 3 12B. If you're running dense 7B to 14B models on a rented L40S and want 128K context to fit, this is the compression scheme worth evaluating.
  • At pp=512 on an L40S, KV cache compression won't speed up a 12B-class model. To move wallclock numbers you need longer context or a narrower memory-bandwidth envelope (i.e. a MacBook).

They don't say:

  • Anything about Gemma 4 under a working KV compression path. That needs either a fixed fork or a different implementation.
  • Anything about long context (32K+). The whole sweep ran at ctx=512, which is the regime where KV compression matters least. The obvious follow-up is running the same sweep at 32K or 128K on Gemma 3 12B.

How this was actually run

All benchmarks ran inside a single Modal app that defines the CUDA image once, caches it, and exposes two functions for speed and perplexity. The CLI runner iterates a config matrix and accumulates spend against a $14 cap, so you can't accidentally blow the budget.

Two things that cost me an hour each and might save you time:

  1. The TheTom/turboquant_plus repo on GitHub is the paper site, not the fork. The llama.cpp fork lives at TheTom/llama-cpp-turboquant. The README in turboquant-eval had the wrong URL, which ate one image build before I noticed.
  2. The nvidia/cuda:12.4.1-devel-ubuntu22.04 image has the CUDA toolkit but not libcuda.so.1. Link step fails. Fix: symlink libcuda.so to libcuda.so.1 under /usr/local/cuda/lib64/stubs, pass -DCMAKE_EXE_LINKER_FLAGS='-Wl,-rpath-link,/usr/local/cuda/lib64/stubs' to cmake, and set LIBRARY_PATH=/usr/local/cuda/lib64/stubs for the build step. Pin -DCMAKE_CUDA_ARCHITECTURES=89 (Ada Lovelace, L40S) so cmake doesn't need a GPU at build time.

To reproduce end to end: clone the repo, modal token new, then python3 scripts/run_modal_bench.py --phase all --cap 14.0. Full sweep finishes in under ten minutes of wall time and costs under $3.

Loose ends

The Modal credit I didn't spend is still sitting there, which is a fine problem to have. Two obvious next posts, in rough order of how useful I think they'd be:

  1. Long-context sweep (16K, 32K, 128K) on Gemma 3 12B. This is where Turbo3 vs Q4_0 should actually pull ahead on speed, not just quality, and where the compression story gets interesting.
  2. A sanity pass with upstream llama.cpp on Gemma 4, to confirm whether the perplexity comes back to the 6 to 12 range once the TurboQuant-specific attention path is out of the picture. If it does, the fork has a clear upstream issue to file.

If either one is the post you want, tell me and I'll prioritize it.