ยท5 min read

MLOps Is Not Optional Anymore: Lessons from Production

mlopsengineeringopinion

I left Myelin Foundry about a month ago, and now that I have some distance, I keep thinking about all the times we got burned not by bad models but by bad practices around them. The model was fine. The pipeline around it was a disaster.

Honestly, if I could go back to day one, I'd set up proper MLOps infrastructure before writing a single line of model code. Here's why.

The MLOps lifecycle loop: models degrade in production without continuous monitoring, retraining, and redeployment.

The Pain Points Were Always the Same

Models degrading silently. We had a super-resolution model in production that slowly got worse over three weeks. Nobody noticed because we didn't have proper monitoring. The input distribution had shifted (the client started sending images from a different camera), and our model was producing subtly worse outputs. No alerts. No dashboards. Just a client email asking "hey, is something wrong?"

The "works on my machine" problem, but for ML. Someone would train a model, get great metrics, hand it off for deployment, and the deployed version would perform differently. Different library versions, different preprocessing, different random seeds. We once spent two days debugging a model accuracy drop that turned out to be a NumPy version mismatch between training and inference environments.

No experiment tracking early on. In the first few months, our experiment tracking was a shared Google Sheet. I'm not joking. Someone would update a row with "tried learning rate 0.001, accuracy 94.2%." But what was the exact data split? What augmentations? What commit hash? No idea. Reproducing results was basically impossible.

What I'd Do Differently

Experiment tracking from day one. Weights & Biases or MLflow, pick one and use it religiously. Log everything: hyperparameters, data versions, model artifacts, system metrics. The cost of setting this up is maybe half a day. The cost of not having it is weeks of wasted time.

Version your data, not just your code. DVC (Data Version Control) exists for a reason. When someone asks "what data was this model trained on?", the answer should be a commit hash, not "I think it was the dataset from June, maybe the cleaned version?"

CI/CD for models. Every model change should go through automated testing. Not just unit tests, but validation on a held-out set, performance benchmarking, and integration tests that verify the model works in the full pipeline. We eventually set this up at Myelin, and it caught issues that would have gone to production.

Monitor everything in production. Input distributions, output distributions, latency, error rates. Set up alerts for drift. If you're not monitoring a deployed model, you don't actually know if it works. You're just hoping.

The Tools Have Caught Up

The good news is that the MLOps ecosystem has matured a lot even in the last two years. MLflow for experiment tracking and model registry. Weights & Biases for experiment visualization. DVC for data versioning. Seldon or BentoML for serving. Great Expectations for data validation. Evidently for monitoring and drift detection.

Two years ago, half of these tools either didn't exist or were rough. Now you can set up a proper ML pipeline without building everything from scratch.

The Real Lesson

The thing I keep coming back to is this: the ML part of an ML system is maybe 20% of the actual work. The rest is data pipelines, feature engineering, testing, deployment, monitoring, and all the infrastructure glue that holds it together. Google published that famous paper about ML systems and technical debt, and every word of it is true.

I'm taking a break from production ML to think about grad school, and honestly, these operational lessons are a big part of why. I want to understand the science deeper, but I also want to come back with better engineering instincts. Because building a good model is the easy part. Keeping it good in production is where the real challenge lives.