ยท5 min read

Model Quantization in Practice: 4x Speedup Without Losing Accuracy

quantizationoptimizationtutorial

I quantize models almost every day at Myelin. It's become as routine as morning chai. We build models on beefy GPU machines, and then immediately the question is: "cool, now make it run on a phone." Quantization is the single biggest lever you have for that.

Let me walk through what actually works, with real numbers from models I've shipped.

The Basics

Neural network weights are typically stored as FP32 (32-bit floating point). Quantization converts them to lower precision, usually INT8 (8-bit integer). Each weight goes from 4 bytes to 1 byte. That's 4x smaller right there. And because integer math is faster than floating point on most hardware, inference speeds up too.

Quantization levels from full precision down to aggressive 4-bit. Each step trades precision for size and speed, with PTQ or QAT used to preserve accuracy.

The real question is: does accuracy survive?

Post-Training Quantization (PTQ)

This is the easy path. Train your model normally in FP32, then convert after the fact. No retraining needed.

import tensorflow as tf
 
# Basic dynamic range quantization
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

That's the bare minimum. But for best results, you want full integer quantization with a representative dataset:

def representative_dataset():
    for i in range(200):
        sample = load_calibration_sample(i)
        yield [sample.astype(np.float32)]
 
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()

The representative dataset is crucial. It tells the quantizer the expected range of activations at each layer so it can map floating point ranges to INT8 properly. Use real data from your actual use case, not random noise.

Real Numbers From Our Models

Here's what happened when I quantized different models at Myelin:

ModelFP32 SizeINT8 SizeFP32 LatencyINT8 LatencyAccuracy Drop
Super-res (ESPCN)45MB11MB120ms35msPSNR -0.3dB
Anomaly detector12MB3MB45ms12msF1 -0.8%
Image classifier28MB7MB80ms22msTop-1 -0.5%

The super-resolution model was the big one. 45MB to 11MB, 120ms to 35ms on a Snapdragon 730. That's the difference between unusable and smooth. And 0.3dB PSNR loss is visually imperceptible. I showed the FP32 and INT8 outputs side by side to my team and nobody could tell the difference. This same model, after quantization, is what we deployed to the Raspberry Pi for our industrial monitoring pipeline.

Quantization-Aware Training (QAT)

Sometimes PTQ isn't enough. If your model is sensitive to quantization (common with very small models or models with large dynamic ranges), you need QAT.

QAT simulates quantization during training. The model learns to be robust to reduced precision.

import tensorflow_model_optimization as tfmot
 
quantize_model = tfmot.quantization.keras.quantize_model
q_aware_model = quantize_model(model)
 
q_aware_model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)
 
q_aware_model.fit(
    train_data,
    epochs=10,  # usually just a few epochs of fine-tuning
    validation_data=val_data
)

In my experience, QAT recovers about half the accuracy lost from PTQ. For our super-resolution model, QAT brought the PSNR drop from 0.3dB down to 0.15dB.

ONNX Quantization

Not everything is TensorFlow. If you're working with PyTorch models, the ONNX path is solid (I compare the TFLite and ONNX Runtime ecosystems in detail in a separate post):

from onnxruntime.quantization import quantize_dynamic, QuantType
 
quantize_dynamic(
    'model.onnx',
    'model_quantized.onnx',
    weight_type=QuantType.QInt8
)

Dynamic quantization in ONNX is quick but only quantizes weights, not activations. For static quantization (weights and activations), you need a calibration step similar to TFLite.

The Gotchas

A few things I learned the hard way during late night debugging sessions at the Myelin office:

Batch normalization layers can be tricky. Fold them into the preceding conv layers before quantization. TFLite does this automatically, but if you're using ONNX, check manually.

Depthwise separable convolutions sometimes quantize poorly. MobileNet architectures can lose more accuracy than you'd expect. QAT helps a lot here.

Always validate on your actual target hardware. I once had a model that quantized perfectly on x86 but produced garbage on ARM because of a numerical edge case in the TFLite ARM kernel. Filed a bug, used a workaround, moved on.

My Workflow

Every model at Myelin goes through this pipeline:

  1. Train in FP32
  2. Try PTQ first (takes 5 minutes)
  3. Benchmark on target device
  4. If accuracy is fine, ship it
  5. If not, do QAT for a few epochs
  6. If still not good enough, try mixed precision (keep sensitive layers in FP16)

Step 2 works for about 80% of our models. That's the beauty of quantization. Most of the time, it just works and you get a 4x speedup for free. For a broader view of all the optimization levers beyond just quantization, see my practical guide to model optimization for mobile.