ML in production demands process before MLOps has a name
Why companies that are serious about machine learning inevitably reach the need to version not just code, but data, models, and experiments.
Teams that move from machine learning experiments to actually using models in production sooner or later hit the same problem. The model works. Then something changes - data, hyperparameters, a library - and nobody can reproduce exactly what was working three months ago.
This is not an academic problem. It is what happens in real teams. And it is a signal that machine learning demands a different level of discipline compared to ordinary software development.
What needs to be versioned and why
In ordinary software development, code is versioned. Git handles that well. In ML, code is only one of several ingredients in the outcome.
Data. If several model variants are trained on the same data, an experiment can be reproduced. If the data has changed and the original snapshot was not preserved, it cannot be. A change to the training set is a change that needs to be recorded just as much as a code change.
Experiments. A data scientist tries different configurations: architectures, hyperparameters, feature sets. A month later it is hard to remember exactly what gave the best result and why. Experiments need to be logged systematically - not in a notebook and not from memory.
Models. A model weights file sitting on a server with no history is not versioning. You need to know: which model version is running in production now, what data and hyperparameters it was trained on, what the metrics were at training time.
Environment. Library versions, Python, dependencies. A model that performed well on Python 3.5 and scikit-learn 0.19 may behave differently on a different version. The environment is part of reproducibility.
Why this matters for business
I talk to executives for whom model versioning sounds like a technical detail. It is not a detail.
The first practical argument: audit. If a model makes decisions that affect customers - credit scoring, pricing, prioritisation - there must be the ability to reproduce exactly which model made a specific decision at a specific moment.
The second argument: recovery. If a new model version produced worse results, there must be the ability to roll back quickly. Without versioning that is either impossible or requires days of work.
The third argument: knowledge retention. Data science teams change. Knowledge about what works and what does not must live somewhere other than people's heads.
Where to start
These processes do not have a common name in 2018. In a few years this will be called MLOps. But the need for them arises before the convenient term appears.
You can start small:
- Log all experiments in a simple structure - date, parameters, result metrics.
- Store snapshots of training data with explicit version identifiers.
- Save model artefacts (weight files) linked to a code commit and data version.
- Document the environment - at minimum as a requirements.txt or Docker image.
- Introduce a process: any model change in production goes through a fixed set of steps.
This does not require specialised tooling immediately. It requires discipline and agreements within the team. Tools can be added later - they fit much better onto a ready culture.