A Practical Guide to Model Optimization for Mobile
Training a model that hits good accuracy is step one. Getting that model to run at 30fps on a phone with a 3000mAh battery is a completely different game. I've been doing a lot of this recently and wanted to share what actually works.
Why Bother?
Because a model that only runs on a datacenter GPU serves a fraction of your users. Most people have mid-range Android phones, not RTX 3090s. If your model can't run on a Snapdragon 665, you're leaving out most of India (and most of the world, honestly).
Quantization
The single biggest win. I wrote a detailed guide on quantization in practice with real numbers from models I shipped at Myelin. Most models are trained in FP32 (32-bit floating point). Converting to INT8 gives you:
- 4x smaller model size
- 2-3x faster inference on mobile CPUs
- Typically < 1% accuracy loss
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model('my_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()That's it. Post-training quantization in TFLite is literally four lines. For most models, this is all you need.
If you need more precision, quantization-aware training simulates quantization during training so the model learns to be robust to reduced precision. More work, better results.
Pruning
Remove weights that are close to zero. Structured pruning removes entire filters/channels, giving you actual speedups (not just smaller files). Unstructured pruning gives better sparsity ratios but needs hardware support to realize speed gains.
import tensorflow_model_optimization as tfmot
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.30,
final_sparsity=0.70,
begin_step=1000,
end_step=3000
)
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(
model, pruning_schedule=pruning_schedule
)I've seen 50-70% sparsity with under 2% accuracy drop on image classification models. Your mileage will vary.
Knowledge Distillation
Train a small "student" model to mimic a large "teacher" model. The student learns from the teacher's soft probabilities, which carry more information than hard labels.
This works surprisingly well. A MobileNetV2 student trained to mimic a ResNet-152 teacher often beats a MobileNetV2 trained directly on the data.
Architecture Choice
Sometimes the best optimization is starting with a model designed for mobile:
- MobileNetV2/V3 with depthwise separable convolutions and inverted residuals
- EfficientNet-Lite, scaled-down EfficientNet for TFLite
- ShuffleNet with channel shuffle operations for efficiency
Don't try to cram ResNet-152 onto a phone. Start with a mobile architecture and build up from there.
Benchmarking
Always benchmark on the actual target device. I've been burned by models that look fast on my laptop but completely choke on a phone's GPU.
# TFLite benchmark tool
adb push benchmark_model /data/local/tmp
adb shell /data/local/tmp/benchmark_model \
--graph=model.tflite \
--num_threads=4Measure latency, memory usage, and thermal throttling. Models that run fast for 10 seconds but overheat after 30 are useless in production.
The Optimization Pipeline
My typical workflow:
- Train full model in FP32
- Apply pruning during fine-tuning
- Export and apply post-training quantization
- Convert to TFLite
- Benchmark on target device
- If too slow, try quantization-aware training or switch architecture
It's iterative. Budget a solid week for optimization if you've never done it before. Also budget some chai because you'll need the patience. For the browser specifically, deploying with TensorFlow.js has its own conversion quirks on top of these optimization steps. And this whole pipeline becomes even more structured when you're taking a model from PyTorch to production at scale.
Related Posts
From PyTorch to Production: The Optimization Pipeline Nobody Talks About
Research papers stop at accuracy metrics. Production starts at deployment constraints. Here's the pipeline that bridges the gap.
TFLite vs ONNX Runtime: A Practical Edge AI Comparison
I deploy models with both TFLite and ONNX Runtime. Here's an honest comparison from someone who deals with the rough edges daily.
Model Quantization in Practice: 4x Speedup Without Losing Accuracy
Our super-resolution model went from 45MB to 11MB. Here's exactly how, with code and real numbers.