---
# Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md
title: "MLOps pipeline"
---
## The model that has to keep being right {#sec-keep-being-right}
In Chapter 21 you deployed the churn model as a service; in Chapter 22 you made a result reproducible. This chapter confronts the case that needs both at once, and more: a model that must keep performing *as the world changes*. You can't deploy it and forget it, because it will drift (Chapter 16). You can't reproduce a single result and be done, because the data keeps moving. And you can't retrain it by hand forever, because that doesn't scale and nobody wants to be the person who remembers to do it every Monday.
MLOps is the discipline of automating the *loop*: training, deploying, monitoring, and retraining wired into a continuous cycle that keeps the model current with as little manual intervention as is safe. It is the most demanding end-to-end project in this book because it assembles nearly all of it — and it's a fitting place to finish, because the loop is only as strong as the weakest practice in it.
## The MLOps loop {#sec-mlops-loop}
The shape of MLOps is a cycle, and every arrow in it is a chapter you've already read:
> a **reproducible training pipeline** (Chapter 22) produces a versioned **model artefact** → registered in a **model registry** → **deployed** as a service (Chapters 14, 15, 21) → **monitored** in production (Chapter 16) → drift or decay **triggers retraining** → back to the start.
What makes it MLOps rather than a pile of scripts is that the cycle runs continuously and mostly automatically, driven by signals from production rather than by someone remembering. The practices don't change; they're connected into a loop.
::: {.callout-note}
## Data Science Bridge
The MLOps loop is your own experiment–iterate cycle, automated and moved into production. In exploration you train a model, evaluate it, notice a weakness, adjust, and retrain — a loop you drive by hand, with your judgement at every step. MLOps is that same loop running in production, except the trigger to go round again isn't your curiosity but a monitoring signal: not "I had an idea" but "the input distribution drifted" or "performance dropped below the floor". The cycle you already know intimately is the cycle being automated.
Where it breaks down: your exploratory loop optimises for *discovery*, and you're inside it making judgement calls on every iteration. The production loop optimises for *staying current*, and it has to run safely with you out of it most of the time — which means the judgement you'd normally apply by eye has to be encoded as automated gates: a drift threshold that decides when to retrain, a validation check that decides whether a new model is allowed to replace the old one. The loop is the same; the judgement has to be made explicit because no one is watching each turn.
:::
## A reproducible training pipeline {#sec-training-pipeline}
The training side of the loop is Chapter 22 applied to a model: a deterministic pipeline that takes versioned data, builds features, trains, and evaluates — producing not just a model but a model *with its lineage*, the version of the data and code that made it and the metric it achieved. That bundle is what gets registered.
```{python}
#| label: training-pipeline
#| echo: true
import hashlib
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
def fingerprint(df: pd.DataFrame) -> str:
"""A stable hash of the training data — part of the model's lineage."""
return hashlib.sha256(pd.util.hash_pandas_object(df).values).hexdigest()[:12]
def train_and_evaluate(data: pd.DataFrame, version: str) -> dict:
"""The training pipeline: fit, score, and record provenance."""
X = data[["spend_per_active_day", "is_recent", "days_since_login"]]
y = data["churned"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42)
model = GradientBoostingClassifier(random_state=42).fit(X_tr, y_tr)
auc = roc_auc_score(y_te, model.predict_proba(X_te)[:, 1])
return {"model": model, "version": version,
"data_fingerprint": fingerprint(data), "auc": round(float(auc), 3)}
def make_data(seed: int, drift: float = 0.0) -> pd.DataFrame:
rng = np.random.default_rng(seed)
days_since_login = rng.integers(0, 90, 3_000) + drift
return pd.DataFrame({
"spend_per_active_day": rng.exponential(5, 3_000),
"is_recent": (days_since_login < 30).astype(int),
"days_since_login": days_since_login,
"churned": rng.binomial(1, 1 / (1 + np.exp(-(days_since_login / 30 - 1)))),
})
training_data = make_data(seed=1)
registered = train_and_evaluate(training_data, version="1.0.0")
print(f"registered model {registered['version']} "
f"data={registered['data_fingerprint']} AUC={registered['auc']}")
```
The artefact carries everything needed to trust and trace it: a version, the fingerprint of the exact data it learned from, and its measured performance. A prediction in production can now be traced back through its `model_version` (Chapter 21) to the precise data and code that produced the model — the reproducibility of Chapter 22, attached to a living model.
## The model registry {#sec-model-registry}
Models are artefacts, and like code (Chapter 2) and data (Chapter 22) they need versioning — which is what a **model registry** provides. A registry (MLflow is a common one) stores each trained model with its version, its lineage, and its metrics, and tracks which version is in `staging` and which in `production`. This is what makes promotion and rollback concrete: you promote a specific, identified version to production, and if it misbehaves you roll back to the previous one by name, not by frantically retraining. The registry is the single source of truth for "which model is live, and what made it" — the operational counterpart to the experiment log from Chapter 2.
## Serving, with a way back {#sec-serving-with-way-back}
The serving side is Chapters 15 and 21, now fed by the registry: the service loads the model version the registry marks as `production`, exposes it behind the typed API contract, and is rolled out gradually — a canary release watched before it takes full traffic — with the previous version retained for an instant rollback. The model artefact is loaded, never retrained, at serving time, and the response carries its version so every prediction is traceable. None of this is new; it's the deployment discipline you've already built, now pointed at a registry rather than a single saved file.
## Closing the loop: monitor and retrain {#sec-closing-loop}
What turns a deployment into MLOps is the return arrow. Monitoring (Chapter 16) watches the live inputs for drift and, when ground-truth labels eventually arrive, watches performance. A drift or decay signal *triggers* the training pipeline to produce a new candidate — and here is the crucial safety gate: a new model is not promoted just because it's newer. It must be *validated against the current one* and only promoted if it's genuinely better, or rolled back if it regresses. Two automated decisions, made explicit, are what let the loop run without a human watching every turn:
```{python}
#| label: closing-the-loop
#| echo: true
from scipy import stats
# Gate 1 — should we retrain? A drift signal, not a calendar, is the trigger.
live_data = make_data(seed=2, drift=10) # this quarter: customers logging in later
ks_p = stats.ks_2samp(training_data["days_since_login"], live_data["days_since_login"]).pvalue
should_retrain = ks_p < 0.01
print(f"input drift p={ks_p:.1e} -> retrain triggered: {should_retrain}")
# Gate 2 — promote the candidate only if it beats production on the SAME recent data.
candidate = train_and_evaluate(live_data, version="1.1.0")
recent = make_data(seed=3, drift=10)
X_recent = recent[["spend_per_active_day", "is_recent", "days_since_login"]]
current_auc = roc_auc_score(recent["churned"], registered["model"].predict_proba(X_recent)[:, 1])
candidate_auc = roc_auc_score(recent["churned"], candidate["model"].predict_proba(X_recent)[:, 1])
promote = candidate_auc >= current_auc
print(f"on recent data: current={current_auc:.3f} candidate={candidate_auc:.3f} -> promote: {promote}")
print("retrained on fresh data, validated against production, promoted only if better.")
```
Drift in the input distribution triggers a retrain; the freshly trained candidate is then judged against the production model *on the same recent data*, and promoted only if it actually wins. Encoding those two judgements — when to retrain, and whether the new model earns its place — is what makes the loop trustworthy enough to run on its own, with a human approving the promotion rather than performing every step. Close that loop and the model maintains itself: it notices when it's slipping, retrains, checks the replacement, and ships it — over and over, keeping current with a world that won't hold still.
::: {.callout-tip}
## Author's Note
The MLOps loop is where every practice in this book is cashed in at once, and where it becomes impossible to skip any of them. You cannot automate a retraining loop you can't *reproduce* (Chapter 22), can't *deploy* safely (Chapters 15 and 21), can't *test* (Chapter 7), can't *monitor* (Chapter 16), or can't *roll back* (Chapter 15). The loop is only as trustworthy as its weakest link, which is why it comes last: it needs all of it.
And it's the right place to close the book, because it answers the question the whole book has been circling. The goal was never to turn you into a software engineer, or to make you abandon the exploratory instincts that make you good at finding things. It was to let your models live in the world — to keep being right, for real users, without you watching every prediction. That is what all the scaffolding has been *for*: the package, the tests, the container, the monitoring, the loop. The notebook found something true; engineering is how you keep it true after you've moved on to the next question. You won't build the full loop for every model — the judgement about how far to go is itself one of the things you've learned — but when a model genuinely matters and has to stay right, this is the shape of the thing, and now you know how to build it.
:::
## Summary {#sec-mlops-summary}
MLOps wires the whole book into a self-maintaining loop:
1. **The loop is the unit.** A reproducible training pipeline feeds a registry, which feeds a monitored deployment, whose drift signals trigger retraining — every arrow a practice from an earlier chapter, connected into a cycle.
2. **Models are versioned artefacts with lineage.** A registry tracks each model's version, the data and code that made it, and its metrics, making promotion and rollback a matter of naming a version.
3. **Retraining is triggered, and promotion is gated.** Drift triggers a retrain rather than the calendar; a new model is promoted only if it beats the current one on fair, recent data — both judgements encoded as automated gates.
4. **The loop is only as strong as its weakest practice.** Reproducibility, testing, deployment, monitoring, and rollback all have to hold for the cycle to run safely without a human at every step.
## Where to go from here {#sec-where-to-go}
That is the bridge built. You began with a notebook that worked on your machine and the four properties it lacked — reproducibility, modularity, testability, readability — and you now have the engineering practices to supply all four, and to carry a model all the way from exploration to a system that maintains itself. You don't have to apply every practice to every project; the judgement of how far to go is part of what you've gained. But when the work matters, you can now make it reliable, reproducible, and trustworthy — a data scientist whose code works far beyond their own laptop.
The companion volume, *Thinking in Uncertainty*, travels the bridge in the other direction, helping software engineers develop the statistical and probabilistic thinking that you brought to this book as your home ground. Between the two, the gap that opened this book — two disciplines sharing tools and vocabulary but thinking differently — is a little narrower, crossed from both sides.
## Exercises {#sec-mlops-exercises}
1. Sketch the MLOps loop for a model of your own: name what plays the part of each stage — the training pipeline, the registry, the deployment, the monitoring signal, and the retraining trigger. Which stage is currently missing or manual, and what would it take to automate it?
2. Implement the retraining trigger: write a check that compares a live feature distribution against the training reference and returns a decision to retrain when it drifts past a threshold you justify. What false-alarm rate would make this trigger more trouble than it's worth?
3. Implement the promotion gate: given a current model and a freshly trained candidate, evaluate both on the same recent data and decide whether to promote the candidate. Why must the comparison use the *same* evaluation data, and why a margin rather than strict improvement?
4. **Conceptual:** The Data Science Bridge compares the MLOps loop to your own experiment–iterate cycle. Give one way the analogy holds and one way it breaks down. What has to be made explicit in the production loop that you handle by judgement during exploration?
5. **Conceptual:** The Author's Note claims the loop is only as strong as its weakest practice. Pick any one practice from Parts 1–5 (reproducibility, testing, deployment, monitoring, rollback) and describe concretely how an automated retraining loop fails if that single practice is missing.