From PyTorch to Production: The Optimization Pipeline Nobody Talks About
Every ML research paper ends the same way: a table of accuracy metrics, maybe some FLOPs comparisons, and a brief mention of "future work." What they never show is the pipeline required to take that model from a PyTorch training script to something that actually runs in production. That pipeline is where I spend most of my engineering time at Honeywell, and it's where most of the real complexity lives.
The gap between research and production isn't a crack. It's a canyon. And the bridge across it is an optimization pipeline that nobody teaches in school.
The Full Pipeline
Here's what the end-to-end flow looks like for a typical vision model going from research to deployment:
That's ten steps. The training step, the one that gets all the attention in papers and courses, is step one. The other nine are where the production engineering happens.
Where the Time Actually Goes
In my experience across both Myelin and Honeywell, the time distribution looks roughly like this:
- Training and experimentation: 15%
- ONNX export debugging: 20%
- Quantization and accuracy validation: 25%
- Runtime optimization and benchmarking: 20%
- Integration and deployment: 20%
The ONNX export step is deceptively painful. torch.onnx.export works perfectly for standard operations, but the moment your model uses dynamic control flow, custom operations, or certain PyTorch-specific patterns, you're in for a debugging session. Common issues include unsupported ops, incorrect shape inference, and silent numerical differences in the exported graph.
My workflow for ONNX export: export with opset_version=17 (or the latest stable), run onnx.checker.check_model(), then compare outputs between PyTorch and ONNX on 100 representative inputs. If the max absolute difference exceeds 1e-5, something went wrong and I trace through the graph node by node.
Quantization: The High-Leverage Step
Quantization is where you get the biggest bang for your engineering effort. Converting from FP32 to INT8 gives you roughly 4x model size reduction and 2-4x inference speedup, depending on hardware.
But quantization isn't free. The accuracy impact varies by model, and validating that impact requires careful benchmarking. I always quantize in two passes:
- Post-training quantization with a calibration dataset of 200-500 representative samples. If accuracy holds (within acceptable tolerance for the use case), ship it.
- Quantization-aware training if PTQ degrades accuracy beyond tolerance. This usually means fine-tuning for 5-10 epochs with fake quantization nodes inserted. It recovers most of the lost accuracy.
The subtlety is in defining "acceptable tolerance." For classification, a 0.5% top-1 drop might be fine. For detection, a 0.5% mAP drop might mask a significant regression on a critical class. For enhancement, perceptual quality metrics can tell a different story than PSNR. Always validate with domain-specific metrics that match your actual quality requirements.
Dynamic Shapes: The Silent Killer
If your model needs to handle variable input sizes, prepare for pain. PyTorch handles dynamic shapes natively. ONNX can represent them. But many downstream runtimes prefer or require fixed shapes.
TFLite historically wanted fixed shapes. TensorRT performs best with fixed shapes (or a small set of optimization profiles). Even ONNX Runtime, which supports dynamic shapes, often runs faster with fixed shapes because it can pre-allocate memory and optimize kernel selection.
My approach: define 2-3 standard input sizes that cover your deployment scenarios, optimize for each, and route inputs to the nearest size at runtime. It's not elegant, but it's practical and avoids the performance penalty of fully dynamic shape handling.
The Industry Gap
The disconnect between ML research and ML engineering is the defining challenge of applied AI right now. Research teams optimize for accuracy on benchmarks. Engineering teams optimize for latency, throughput, memory, power consumption, and reliability on specific hardware. These objectives overlap but are not the same.
The teams that ship successful ML products are the ones that build this optimization pipeline as a first-class engineering system, not as an afterthought. Version-controlled conversion scripts, automated benchmarking, regression testing for accuracy after each optimization step, CI/CD for model deployment. The same software engineering rigor that goes into application code needs to go into the model pipeline.
Having worked at a startup (Myelin) and now at an enterprise (Honeywell), I've seen both ends. The tools are maturing, but the knowledge is still tribal. Most of what I know about this pipeline, I learned by breaking things and fixing them. That needs to change, and it starts with being honest about where the complexity actually lives.
Related Posts
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.
Shipping a Feature in 45 Minutes: My Claude Code Workflow End to End
From memory recall to brainstorm to plan to execution to review to commit. Here's every step of building a real feature with Claude Code, with the actual workflow that makes it fast.
Hooks, Statuslines, and the Automation Layer Nobody Talks About
Hooks let you run shell commands when Claude Code starts, stops, or uses tools. Combined with a custom statusline, they turn Claude Code into a self-monitoring, self-correcting system.