From Python to Production: Deploying ML Models with TensorFlow.js
Training a model in Python is the easy part. Getting it into a production web application where real users interact with it? That's where the engineering actually starts.
I spent a solid two weeks the first time I tried to ship a model to the browser, mostly because I didn't know what I didn't know. Here's everything I wish someone had told me.
The Pipeline
PyTorch/TensorFlow → SavedModel → TF.js Conversion → Optimization → Web App
Each arrow hides a bunch of pain. Let me walk through them.
Exporting Your Model
From TensorFlow/Keras
model.save('my_model')Then convert:
tensorflowjs_converter \
--input_format=tf_saved_model \
--output_format=tfjs_graph_model \
--quantize_float16 \
./my_model \
./web_modelFrom PyTorch
PyTorch to ONNX to TF SavedModel to TF.js. Yes, it's three conversions. Yes, it's as annoying as it sounds. The full PyTorch-to-production pipeline gets even more complex when you're targeting multiple runtimes beyond just TF.js.
import torch
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)Then use onnx-tf to convert to TensorFlow, then tensorflowjs_converter to get to TF.js. I've spent many hours debugging shape mismatches in this pipeline. Keep snacks nearby.
Pro tip: Not every PyTorch op has an ONNX equivalent. Check the ONNX op coverage before you design your architecture, not after.
The Conversion Gotchas
Things that have bitten me:
Dynamic shapes. If your model uses dynamic input shapes, you need to specify them explicitly during conversion. TF.js graph models expect static shapes.
Custom ops. Any custom layers or ops need TF.js implementations. If you used a fancy activation function from a research paper, you might need to write a custom TF.js kernel.
Batch normalization. Make sure your model is in eval mode before exporting. Training-mode BN with running statistics will give you wrong results in production. I learned this the hard way.
Optimization Checklist
Before shipping:
- Quantize to float16 to halve model size, negligible accuracy loss
- Strip unused ops since conversion sometimes includes training-only ops
- Check model size, anything over 10MB will feel slow to load
- Test on slow connections using Chrome DevTools to simulate 3G
- Warm up, first inference compiles shaders, always run a dummy input on load
Loading Strategy
Don't block the UI while loading the model:
setLoading(true);
const model = await tf.loadGraphModel('/models/v2/model.json', {
onProgress: (fraction) => {
updateProgressBar(fraction);
}
});
// Warm up with dummy input
const dummy = tf.zeros(model.inputs[0].shape.map(d => d || 1));
await model.predict(dummy).data();
dummy.dispose();
setLoading(false);Memory Management
This is the number one source of bugs in TF.js apps. GPU memory (WebGL textures) does not get garbage collected by JavaScript.
// WRONG: leaks a tensor every call
function process(input) {
const result = model.predict(input);
return result;
}
// RIGHT: use tf.tidy or manual disposal
function process(input) {
const result = tf.tidy(() => {
return model.predict(input);
});
input.dispose();
return result;
}Watch your memory usage in Chrome DevTools. If it keeps climbing, you have a leak.
What I Wish Someone Told Me
The conversion is the easy part. Making the model work reliably across browsers, devices, and network conditions is the real job. If you want to understand what's actually happening when TF.js runs inference, my post on how WebGL powers browser-based ML covers the GPU layer underneath.
Test on Firefox too. WebGL behavior varies between browsers. I've had models work perfectly in Chrome and produce garbage in Firefox due to precision differences.
Users don't care about your model. They care about the experience. A slightly worse model with instant loading beats a great model behind a 15-second spinner every single time. The model optimization techniques for mobile -- quantization, pruning, distillation -- apply equally to browser deployment.
Related Posts
Demystifying WebGL for ML Engineers
You're running ML models in the browser and it's fast, but do you know why? A look at how WebGL makes GPU-accelerated inference possible on the web.
From PyTorch to Production: The Optimization Pipeline Nobody Talks About
Research papers stop at accuracy metrics. Production starts at deployment constraints. Here's the pipeline that bridges the gap.
Running Super-Resolution in the Browser with TensorFlow.js
How to take a trained super-resolution model and run it at interactive speeds in the browser, no server required.