Pushing Object Detection to 98.6% mAP: Lessons from Production CV
A few weeks into my co-op at Honeywell, I was tasked with improving an object detection system that was already performing at 96% mAP. The ask was simple: get it higher. The reality of doing that in a production environment, where every change has to be validated, documented, and deployable, was anything but simple.
We ended up at 98.6% mAP. And the path there taught me more about production computer vision than any paper or course ever did.
Data Quality Beats Model Architecture
This is the lesson I keep relearning. When I started, my instinct was to reach for a bigger backbone, try a newer detection head, maybe swap in some attention modules -- the kind of transformer-vs-CNN architecture debates that dominate the literature. Standard researcher thinking.
But the biggest jumps came from data work. Cleaning mislabeled annotations. Removing ambiguous edge cases that were confusing the model. Adding targeted examples of failure modes we identified during error analysis. One round of annotation cleanup gave us a 0.8% mAP improvement. No model change required.
The ratio was roughly this: 70% of our gains came from data curation, 20% from augmentation strategy, and 10% from model tweaks. That's not what ML Twitter would have you believe, but it's what production looks like.
Augmentation Strategies That Actually Worked
Not all augmentations are created equal. We tried the standard playbook (random flips, rotations, color jitter, mosaic augmentation) and then got more targeted.
What moved the needle:
- Mosaic augmentation with careful control over object scale distribution
- Copy-paste augmentation for underrepresented classes
- Domain-specific photometric distortions that matched real deployment conditions (lighting variation, sensor noise profiles)
What didn't help or hurt:
- Aggressive geometric transforms that created unrealistic object orientations
- CutOut/GridMask on small objects (just destroyed the signal)
- Random erasing at high ratios
The key insight is that augmentation should expand the distribution your model sees, but it shouldn't create samples that violate the physics of your deployment environment. If your objects are always upright, random 90-degree rotations add noise, not signal.
The mAP Evaluation Trap
Here's something that tripped me up early. mAP@0.5 and mAP@0.5:0.95 tell very different stories. A model can look great at the loose IoU threshold of 0.5 and fall apart at stricter thresholds. For our use case, localization precision mattered as much as detection accuracy.
We tracked both metrics, plus per-class AP breakdowns and confidence calibration curves. The per-class analysis was critical because aggregate mAP can hide the fact that your model is terrible at one specific class that happens to be rare but important.
We also ran confusion matrix analysis at the class level to understand what the model was confusing, not just what it was missing. This directly informed which additional training samples to collect.
The Diminishing Returns Problem
Going from 90% to 95% mAP took standard engineering. Good data, solid augmentations, well-tuned training schedule. Going from 95% to 96% required careful analysis and targeted interventions. Going from 96% to 98.6% required obsessive attention to every remaining failure case.
At that level, you're debugging individual predictions. Why did the model miss this specific instance? Is it an annotation issue, a data distribution gap, or a genuine model limitation? Each fix might recover a handful of detections out of thousands. The effort per percentage point increases exponentially.
This is the part of production CV that doesn't make it into research papers. Papers compete on COCO benchmarks where the difference between methods is 1-2 mAP points. In production, those 1-2 points can take months of engineering, and they often matter more than the paper authors realize.
The Bigger Picture
Working on this system reinforced something I've been thinking about since my time at Myelin. The gap between a working prototype and a production-grade system is where most of the real engineering lives. A research model that hits 95% accuracy in a Jupyter notebook is a proof of concept. Getting that same model to 98%+ in deployment, with all the data infrastructure, validation pipelines, and monitoring that entails, is a completely different discipline. I wrote about the full PyTorch-to-production optimization pipeline separately, and the lessons overlap directly with what I experienced here.
The industry is slowly recognizing this. Data-centric AI isn't just an Andrew Ng talking point anymore. It's how teams that ship production CV systems actually work. The model architecture is a commodity at this point. The data pipeline is the competitive advantage. And once you have the accuracy, the next challenge is quantization and optimization to actually deploy the model where it needs to run.
Related Posts
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.
Shipping a Feature in 45 Minutes: My Claude Code Workflow End to End
From memory recall to brainstorm to plan to execution to review to commit. Here's every step of building a real feature with Claude Code, with the actual workflow that makes it fast.
Hooks, Statuslines, and the Automation Layer Nobody Talks About
Hooks let you run shell commands when Claude Code starts, stops, or uses tools. Combined with a custom statusline, they turn Claude Code into a self-monitoring, self-correcting system.