Model Quantization in Practice: 4x Speedup Without Losing Accuracy
I quantize models almost every day at Myelin. It's become as routine as morning chai. We build models on beefy GPU machines, and then immediately the question is: "cool, now make it run on a phone." Quantization is the single biggest lever you have for that.
Let me walk through what actually works, with real numbers from models I've shipped.
The Basics
Neural network weights are typically stored as FP32 (32-bit floating point). Quantization converts them to lower precision, usually INT8 (8-bit integer). Each weight goes from 4 bytes to 1 byte. That's 4x smaller right there. And because integer math is faster than floating point on most hardware, inference speeds up too.
The real question is: does accuracy survive?
Post-Training Quantization (PTQ)
This is the easy path. Train your model normally in FP32, then convert after the fact. No retraining needed.
import tensorflow as tf
# Basic dynamic range quantization
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()That's the bare minimum. But for best results, you want full integer quantization with a representative dataset:
def representative_dataset():
for i in range(200):
sample = load_calibration_sample(i)
yield [sample.astype(np.float32)]
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()The representative dataset is crucial. It tells the quantizer the expected range of activations at each layer so it can map floating point ranges to INT8 properly. Use real data from your actual use case, not random noise.
Real Numbers From Our Models
Here's what happened when I quantized different models at Myelin:
| Model | FP32 Size | INT8 Size | FP32 Latency | INT8 Latency | Accuracy Drop |
|---|---|---|---|---|---|
| Super-res (ESPCN) | 45MB | 11MB | 120ms | 35ms | PSNR -0.3dB |
| Anomaly detector | 12MB | 3MB | 45ms | 12ms | F1 -0.8% |
| Image classifier | 28MB | 7MB | 80ms | 22ms | Top-1 -0.5% |
The super-resolution model was the big one. 45MB to 11MB, 120ms to 35ms on a Snapdragon 730. That's the difference between unusable and smooth. And 0.3dB PSNR loss is visually imperceptible. I showed the FP32 and INT8 outputs side by side to my team and nobody could tell the difference. This same model, after quantization, is what we deployed to the Raspberry Pi for our industrial monitoring pipeline.
Quantization-Aware Training (QAT)
Sometimes PTQ isn't enough. If your model is sensitive to quantization (common with very small models or models with large dynamic ranges), you need QAT.
QAT simulates quantization during training. The model learns to be robust to reduced precision.
import tensorflow_model_optimization as tfmot
quantize_model = tfmot.quantization.keras.quantize_model
q_aware_model = quantize_model(model)
q_aware_model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
q_aware_model.fit(
train_data,
epochs=10, # usually just a few epochs of fine-tuning
validation_data=val_data
)In my experience, QAT recovers about half the accuracy lost from PTQ. For our super-resolution model, QAT brought the PSNR drop from 0.3dB down to 0.15dB.
ONNX Quantization
Not everything is TensorFlow. If you're working with PyTorch models, the ONNX path is solid (I compare the TFLite and ONNX Runtime ecosystems in detail in a separate post):
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
'model.onnx',
'model_quantized.onnx',
weight_type=QuantType.QInt8
)Dynamic quantization in ONNX is quick but only quantizes weights, not activations. For static quantization (weights and activations), you need a calibration step similar to TFLite.
The Gotchas
A few things I learned the hard way during late night debugging sessions at the Myelin office:
Batch normalization layers can be tricky. Fold them into the preceding conv layers before quantization. TFLite does this automatically, but if you're using ONNX, check manually.
Depthwise separable convolutions sometimes quantize poorly. MobileNet architectures can lose more accuracy than you'd expect. QAT helps a lot here.
Always validate on your actual target hardware. I once had a model that quantized perfectly on x86 but produced garbage on ARM because of a numerical edge case in the TFLite ARM kernel. Filed a bug, used a workaround, moved on.
My Workflow
Every model at Myelin goes through this pipeline:
- Train in FP32
- Try PTQ first (takes 5 minutes)
- Benchmark on target device
- If accuracy is fine, ship it
- If not, do QAT for a few epochs
- If still not good enough, try mixed precision (keep sensitive layers in FP16)
Step 2 works for about 80% of our models. That's the beauty of quantization. Most of the time, it just works and you get a 4x speedup for free. For a broader view of all the optimization levers beyond just quantization, see my practical guide to model optimization for mobile.
Related Posts
Benchmarking TurboQuant+ KV Cache Compression on Apple Silicon
I tested TurboQuant+ KV cache compression across 1.5B, 7B, and 14B models on an M4 MacBook Air. The speed gains are real, but there are sharp cliffs you need to know about.
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.