MLOps and Production ML: Your Vetting Checklist for Hiring Engineers Who Ship

The most expensive mis-hire I have ever seen happened when a client chose the more technically brilliant of two finalists. The engineer had won a major Kaggle competition the year before. Six months in, he had still not shipped a model to production. The runner-up, who had been passed over, was two years into production ML at a competitor by then. The lesson, which I have now watched repeat four or five times: competitive ML ability and production ML ability are different jobs. After years of placing ML engineers into production teams, I have developed a vetting checklist that sorts the two.

Why “Kaggle champion” and “ships to production” are different jobs

A Kaggle environment is a frozen dataset, a known metric, a fixed time horizon, and no downstream consequences. Production is a shifting dataset, a contested metric, an infinite time horizon, and direct business consequences of every choice. The skills overlap; the mindset and muscle memory do not. The engineer who wins a Kaggle championship has spent six months maximizing one number. The engineer who ships a model has spent six months navigating a dozen tradeoffs with incomplete information.

The production ML skills hiring managers routinely miss

The skills that predict production success but rarely appear in a standard ML interview:

Comfort with ambiguous requirements and pushing back on poorly scoped asks
Discipline around versioning data, code, and model artifacts together
Instinct for when a simpler model is the right choice
Familiarity with the failure modes of deployed systems, not just training loops
Ability to pair with software engineers and product managers without friction

If your interview loop does not probe at least four of these, you are rolling dice on the outcome.

Resume signals that actually predict production success

When I screen a resume for production ML potential, I scan for concrete artifacts: a shipped model, measurable business outcomes, a team size, a tech stack that matches production work, and language that shows the candidate owns outcomes. I downweight long lists of model architectures with no context. I watch for the word “shipped” used in the specific past tense. I look for mentions of monitoring, rollback, A/B testing, or post-launch iteration. The absence of those words in a senior candidate’s resume is a flag.

The technical screen: system design beats algorithm puzzles

For MLOps and production ML hires, a forty-five-minute system design interview will tell you more than three hours of algorithm puzzles. Ask the candidate to design an end-to-end ML system for a realistic scenario: a recommendation engine, a fraud detection pipeline, a content moderation classifier. Listen for how they handle data freshness, retraining cadence, serving latency, evaluation, monitoring, and rollback. Candidates with real production experience will reach for those topics on their own. Candidates without will need prompting.

Reproducibility, monitoring, and rollback questions to ask

Three questions I ask every senior production ML candidate, and what I listen for:

“Walk me through how you version a model and its training data together.” Listen for specific tools and a concrete workflow, not a philosophical answer.
“Describe the monitoring you put on a model in production.” Listen for input drift, output drift, business metrics, and alerting thresholds, not just model accuracy.
“Tell me about a time you rolled back a model. What was the trigger, how fast was the rollback, and what did you change after?” The absence of a real story here is the answer.

Working sessions and take-homes that do not waste time

Take-homes have a bad reputation in ML hiring, and most of it is earned. A good take-home is bounded (four hours maximum), realistic (uses a dataset or problem that looks like the team’s actual work), and respected (the team discusses it in a follow-up working session rather than grading it silently). A bad take-home is open-ended, drags on for a weekend, and becomes unpaid consulting.

Reference check questions calibrated for MLOps

Generic reference questions tell you generic things. The questions that actually reveal MLOps ability:

“What did this engineer do the first time a model they owned caused a production incident?”
“How did they handle a conflict with a product manager over a feature launch timeline?”
“Would you hire them back today, and for what type of role?”

The third question, asked late in the conversation, has caught me more good and bad candidates than any other reference question in fifteen years.

A vetting checklist you can use tomorrow

A condensed checklist for your next production ML hire, in the order I recommend applying it:

Resume screen: look for “shipped,” concrete outcomes, monitoring language
Recruiter call: confirm production experience, team size, tech stack
Hiring manager conversation: deep-dive on one production project
System design interview: end-to-end production ML scenario
Bounded take-home or live coding: realistic, four hours max
Cross-functional round: partner with product or engineering peer
Structured reference checks: calibrated questions, never skipped

Run that loop in fewer than fifteen business days, and you will hire candidates who actually MLOps engineers need in order to succeed on day one. Skip a step, and you will be back in the market in eight months wondering what went wrong.