A Practical Guide to Model Optimization for Mobile

Training a model that hits good accuracy is step one. Getting that model to run at 30fps on a phone with a 3000mAh battery is a completely different game. I've been doing a lot of this recently and wanted to share what actually works.

Why Bother?

Because a model that only runs on a datacenter GPU serves a fraction of your users. Most people have mid-range Android phones, not RTX 3090s. If your model can't run on a Snapdragon 665, you're leaving out most of India (and most of the world, honestly).

Quantization

The single biggest win. I wrote a detailed guide on quantization in practice with real numbers from models I shipped at Myelin. Most models are trained in FP32 (32-bit floating point). Converting to INT8 gives you:

4x smaller model size
2-3x faster inference on mobile CPUs
Typically < 1% accuracy loss

import tensorflow as tf
 
converter = tf.lite.TFLiteConverter.from_saved_model('my_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

That's it. Post-training quantization in TFLite is literally four lines. For most models, this is all you need.

If you need more precision, quantization-aware training simulates quantization during training so the model learns to be robust to reduced precision. More work, better results.

Pruning

Remove weights that are close to zero. Structured pruning removes entire filters/channels, giving you actual speedups (not just smaller files). Unstructured pruning gives better sparsity ratios but needs hardware support to realize speed gains.

import tensorflow_model_optimization as tfmot
 
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
    initial_sparsity=0.30,
    final_sparsity=0.70,
    begin_step=1000,
    end_step=3000
)
 
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(
    model, pruning_schedule=pruning_schedule
)

I've seen 50-70% sparsity with under 2% accuracy drop on image classification models. Your mileage will vary.

Knowledge Distillation

Train a small "student" model to mimic a large "teacher" model. The student learns from the teacher's soft probabilities, which carry more information than hard labels.

This works surprisingly well. A MobileNetV2 student trained to mimic a ResNet-152 teacher often beats a MobileNetV2 trained directly on the data.

Architecture Choice

Sometimes the best optimization is starting with a model designed for mobile:

MobileNetV2/V3 with depthwise separable convolutions and inverted residuals
EfficientNet-Lite, scaled-down EfficientNet for TFLite
ShuffleNet with channel shuffle operations for efficiency

Don't try to cram ResNet-152 onto a phone. Start with a mobile architecture and build up from there.

Benchmarking

Always benchmark on the actual target device. I've been burned by models that look fast on my laptop but completely choke on a phone's GPU.

# TFLite benchmark tool
adb push benchmark_model /data/local/tmp
adb shell /data/local/tmp/benchmark_model \
  --graph=model.tflite \
  --num_threads=4

Measure latency, memory usage, and thermal throttling. Models that run fast for 10 seconds but overheat after 30 are useless in production.

The Optimization Pipeline

A typical mobile model optimization pipeline, from full-precision training through pruning, quantization, and conversion to on-device deployment.

My typical workflow:

Train full model in FP32
Apply pruning during fine-tuning
Export and apply post-training quantization
Convert to TFLite
Benchmark on target device
If too slow, try quantization-aware training or switch architecture

It's iterative. Budget a solid week for optimization if you've never done it before. Also budget some chai because you'll need the patience. For the browser specifically, deploying with TensorFlow.js has its own conversion quirks on top of these optimization steps. And this whole pipeline becomes even more structured when you're taking a model from PyTorch to production at scale.