Pushing Object Detection to 98.6% mAP: Lessons from Production CV

A few weeks into my co-op at Honeywell, I was tasked with improving an object detection system that was already performing at 96% mAP. The ask was simple: get it higher. The reality of doing that in a production environment, where every change has to be validated, documented, and deployable, was anything but simple.

We ended up at 98.6% mAP. And the path there taught me more about production computer vision than any paper or course ever did.

Object detection pipeline from input image to final bounding box predictions, showing the backbone, neck, and head stages.

Data Quality Beats Model Architecture

This is the lesson I keep relearning. When I started, my instinct was to reach for a bigger backbone, try a newer detection head, maybe swap in some attention modules -- the kind of transformer-vs-CNN architecture debates that dominate the literature. Standard researcher thinking.

But the biggest jumps came from data work. Cleaning mislabeled annotations. Removing ambiguous edge cases that were confusing the model. Adding targeted examples of failure modes we identified during error analysis. One round of annotation cleanup gave us a 0.8% mAP improvement. No model change required.

The ratio was roughly this: 70% of our gains came from data curation, 20% from augmentation strategy, and 10% from model tweaks. That's not what ML Twitter would have you believe, but it's what production looks like.

Augmentation Strategies That Actually Worked

Not all augmentations are created equal. We tried the standard playbook (random flips, rotations, color jitter, mosaic augmentation) and then got more targeted.

What moved the needle:

Mosaic augmentation with careful control over object scale distribution
Copy-paste augmentation for underrepresented classes
Domain-specific photometric distortions that matched real deployment conditions (lighting variation, sensor noise profiles)

What didn't help or hurt:

Aggressive geometric transforms that created unrealistic object orientations
CutOut/GridMask on small objects (just destroyed the signal)
Random erasing at high ratios

The key insight is that augmentation should expand the distribution your model sees, but it shouldn't create samples that violate the physics of your deployment environment. If your objects are always upright, random 90-degree rotations add noise, not signal.

The mAP Evaluation Trap

Here's something that tripped me up early. mAP@0.5 and mAP@0.5:0.95 tell very different stories. A model can look great at the loose IoU threshold of 0.5 and fall apart at stricter thresholds. For our use case, localization precision mattered as much as detection accuracy.

We tracked both metrics, plus per-class AP breakdowns and confidence calibration curves. The per-class analysis was critical because aggregate mAP can hide the fact that your model is terrible at one specific class that happens to be rare but important.

We also ran confusion matrix analysis at the class level to understand what the model was confusing, not just what it was missing. This directly informed which additional training samples to collect.

The Diminishing Returns Problem

Going from 90% to 95% mAP took standard engineering. Good data, solid augmentations, well-tuned training schedule. Going from 95% to 96% required careful analysis and targeted interventions. Going from 96% to 98.6% required obsessive attention to every remaining failure case.

At that level, you're debugging individual predictions. Why did the model miss this specific instance? Is it an annotation issue, a data distribution gap, or a genuine model limitation? Each fix might recover a handful of detections out of thousands. The effort per percentage point increases exponentially.

This is the part of production CV that doesn't make it into research papers. Papers compete on COCO benchmarks where the difference between methods is 1-2 mAP points. In production, those 1-2 points can take months of engineering, and they often matter more than the paper authors realize.

The Bigger Picture

Working on this system reinforced something I've been thinking about since my time at Myelin. The gap between a working prototype and a production-grade system is where most of the real engineering lives. A research model that hits 95% accuracy in a Jupyter notebook is a proof of concept. Getting that same model to 98%+ in deployment, with all the data infrastructure, validation pipelines, and monitoring that entails, is a completely different discipline. I wrote about the full PyTorch-to-production optimization pipeline separately, and the lessons overlap directly with what I experienced here.

The industry is slowly recognizing this. Data-centric AI isn't just an Andrew Ng talking point anymore. It's how teams that ship production CV systems actually work. The model architecture is a commodity at this point. The data pipeline is the competitive advantage. And once you have the accuracy, the next challenge is quantization and optimization to actually deploy the model where it needs to run.

Pushing Object Detection to 98.6% mAP: Lessons from Production CV

Data Quality Beats Model Architecture

Augmentation Strategies That Actually Worked

The mAP Evaluation Trap

The Diminishing Returns Problem

The Bigger Picture

Related Posts

Subagents and Parallel Execution: Making Claude Code 5x Faster

Shipping a Feature in 45 Minutes: My Claude Code Workflow End to End

Hooks, Statuslines, and the Automation Layer Nobody Talks About