·4 min read

From Python to Production: Deploying ML Models with TensorFlow.js

tensorflow-jsdeploymentmachine-learningweb

Training a model in Python is the easy part. Getting it into a production web application where real users interact with it? That's where the engineering actually starts.

I spent a solid two weeks the first time I tried to ship a model to the browser, mostly because I didn't know what I didn't know. Here's everything I wish someone had told me.

The Pipeline

PyTorch/TensorFlow → SavedModel → TF.js Conversion → Optimization → Web App

Each arrow hides a bunch of pain. Let me walk through them.

The end-to-end pipeline for getting a trained Python model running in a user's browser via TensorFlow.js.

Exporting Your Model

From TensorFlow/Keras

model.save('my_model')

Then convert:

tensorflowjs_converter \
  --input_format=tf_saved_model \
  --output_format=tfjs_graph_model \
  --quantize_float16 \
  ./my_model \
  ./web_model

From PyTorch

PyTorch to ONNX to TF SavedModel to TF.js. Yes, it's three conversions. Yes, it's as annoying as it sounds. The full PyTorch-to-production pipeline gets even more complex when you're targeting multiple runtimes beyond just TF.js.

import torch
 
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)

Then use onnx-tf to convert to TensorFlow, then tensorflowjs_converter to get to TF.js. I've spent many hours debugging shape mismatches in this pipeline. Keep snacks nearby.

Pro tip: Not every PyTorch op has an ONNX equivalent. Check the ONNX op coverage before you design your architecture, not after.

The Conversion Gotchas

Things that have bitten me:

Dynamic shapes. If your model uses dynamic input shapes, you need to specify them explicitly during conversion. TF.js graph models expect static shapes.

Custom ops. Any custom layers or ops need TF.js implementations. If you used a fancy activation function from a research paper, you might need to write a custom TF.js kernel.

Batch normalization. Make sure your model is in eval mode before exporting. Training-mode BN with running statistics will give you wrong results in production. I learned this the hard way.

Optimization Checklist

Before shipping:

  • Quantize to float16 to halve model size, negligible accuracy loss
  • Strip unused ops since conversion sometimes includes training-only ops
  • Check model size, anything over 10MB will feel slow to load
  • Test on slow connections using Chrome DevTools to simulate 3G
  • Warm up, first inference compiles shaders, always run a dummy input on load

Loading Strategy

Don't block the UI while loading the model:

setLoading(true);
 
const model = await tf.loadGraphModel('/models/v2/model.json', {
  onProgress: (fraction) => {
    updateProgressBar(fraction);
  }
});
 
// Warm up with dummy input
const dummy = tf.zeros(model.inputs[0].shape.map(d => d || 1));
await model.predict(dummy).data();
dummy.dispose();
 
setLoading(false);

Memory Management

This is the number one source of bugs in TF.js apps. GPU memory (WebGL textures) does not get garbage collected by JavaScript.

// WRONG: leaks a tensor every call
function process(input) {
  const result = model.predict(input);
  return result;
}
 
// RIGHT: use tf.tidy or manual disposal
function process(input) {
  const result = tf.tidy(() => {
    return model.predict(input);
  });
  input.dispose();
  return result;
}

Watch your memory usage in Chrome DevTools. If it keeps climbing, you have a leak.

What I Wish Someone Told Me

The conversion is the easy part. Making the model work reliably across browsers, devices, and network conditions is the real job. If you want to understand what's actually happening when TF.js runs inference, my post on how WebGL powers browser-based ML covers the GPU layer underneath.

Test on Firefox too. WebGL behavior varies between browsers. I've had models work perfectly in Chrome and produce garbage in Firefox due to precision differences.

Users don't care about your model. They care about the experience. A slightly worse model with instant loading beats a great model behind a 15-second spinner every single time. The model optimization techniques for mobile -- quantization, pruning, distillation -- apply equally to browser deployment.