23 MLOps pipeline

23.1 The model that has to keep being right

In Chapter 21 you deployed the churn model as a service; in Chapter 22 you made a result reproducible. This chapter confronts the case that needs both at once, and more: a model that must keep performing as the world changes. You can’t deploy it and forget it, because it will drift (Chapter 16). You can’t reproduce a single result and be done, because the data keeps moving. And you can’t retrain it by hand forever, because that doesn’t scale and nobody wants to be the person who remembers to do it every Monday.

MLOps is the discipline of automating the loop: training, deploying, monitoring, and retraining wired into a continuous cycle that keeps the model current with as little manual intervention as is safe. It is the most demanding end-to-end project in this book because it assembles nearly all of it — and it’s a fitting place to finish, because the loop is only as strong as the weakest practice in it.

23.2 The MLOps loop

The shape of MLOps is a cycle, and every arrow in it is a chapter you’ve already read:

a reproducible training pipeline (Chapter 22) produces a versioned model artefact → registered in a model registry → deployed as a service (Chapter 14, Chapter 15, Chapter 21) → monitored in production (Chapter 16) → drift or decay triggers retraining → back to the start.

What makes it MLOps rather than a pile of scripts is that the cycle runs continuously and mostly automatically, driven by signals from production rather than by someone remembering. The practices don’t change; they’re connected into a loop.

Data Science Bridge

The MLOps loop is your own experiment–iterate cycle, automated and moved into production. In exploration you train a model, evaluate it, notice a weakness, adjust, and retrain — a loop you drive by hand, with your judgement at every step. MLOps is that same loop running in production, except the trigger to go round again isn’t your curiosity but a monitoring signal: not “I had an idea” but “the input distribution drifted” or “performance dropped below the floor”. The cycle you already know intimately is the cycle being automated.

Where it breaks down: your exploratory loop optimises for discovery, and you’re inside it making judgement calls on every iteration. The production loop optimises for staying current, and it has to run safely with you out of it most of the time — which means the judgement you’d normally apply by eye has to be encoded as automated gates: a drift threshold that decides when to retrain, a validation check that decides whether a new model is allowed to replace the old one. The loop is the same; the judgement has to be made explicit because no one is watching each turn.

23.3 A reproducible training pipeline

The training side of the loop is Chapter 22 applied to a model: a deterministic pipeline that takes versioned data, builds features, trains, and evaluates — producing not just a model but a model with its lineage, the version of the data and code that made it and the metric it achieved. That bundle is what gets registered.

import hashlib
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

def fingerprint(df: pd.DataFrame) -> str:
    """A stable hash of the training data — part of the model's lineage."""
    return hashlib.sha256(pd.util.hash_pandas_object(df).values).hexdigest()[:12]

def train_and_evaluate(data: pd.DataFrame, version: str) -> dict:
    """The training pipeline: fit, score, and record provenance."""
    X = data[["spend_per_active_day", "is_recent", "days_since_login"]]
    y = data["churned"]
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42)
    model = GradientBoostingClassifier(random_state=42).fit(X_tr, y_tr)
    auc = roc_auc_score(y_te, model.predict_proba(X_te)[:, 1])
    return {"model": model, "version": version,
            "data_fingerprint": fingerprint(data), "auc": round(float(auc), 3)}

def make_data(seed: int, drift: float = 0.0) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    days_since_login = rng.integers(0, 90, 3_000) + drift
    return pd.DataFrame({
        "spend_per_active_day": rng.exponential(5, 3_000),
        "is_recent": (days_since_login < 30).astype(int),
        "days_since_login": days_since_login,
        "churned": rng.binomial(1, 1 / (1 + np.exp(-(days_since_login / 30 - 1)))),
    })

training_data = make_data(seed=1)
registered = train_and_evaluate(training_data, version="1.0.0")
print(f"registered model {registered['version']}  "
      f"data={registered['data_fingerprint']}  AUC={registered['auc']}")

registered model 1.0.0  data=cf0f14ace53e  AUC=0.714

The artefact carries everything needed to trust and trace it: a version, the fingerprint of the exact data it learned from, and its measured performance. A prediction in production can now be traced back through its model_version (Chapter 21) to the precise data and code that produced the model — the reproducibility of Chapter 22, attached to a living model.

23.4 The model registry

Models are artefacts, and like code (Chapter 2) and data (Chapter 22) they need versioning — which is what a model registry provides. A registry (MLflow is a common one) stores each trained model with its version, its lineage, and its metrics, and lets you attach a moving label to whichever version is currently the one you mean. In MLflow those labels are aliases — champion for the version serving production, challenger for one being evaluated against it — set with client.set_registered_model_alias(name, "champion", version) and resolved by the serving code through a URI like models:/churn@champion. (Older material uses stages, with fixed names like Staging and Production; these were deprecated in MLflow 2.9 in favour of aliases, which you can name freely and point at more than one version.) This is what makes promotion and rollback concrete: promotion moves the champion alias to a specific, identified version, and rollback moves it back — a label change rather than a frantic retrain. The registry is the single source of truth for “which model is live, and what made it” — the operational counterpart to the experiment log from Chapter 2.

23.5 Serving, with a way back

The serving side is Chapter 15 and Chapter 21, now fed by the registry: the service loads the model version the registry’s champion alias points at, exposes it behind the typed API contract, and is rolled out gradually — a canary release watched before it takes full traffic — with the previous version retained for an instant rollback. The model artefact is loaded, never retrained, at serving time, and the response carries its version so every prediction is traceable. None of this is new; it’s the deployment discipline you’ve already built, now pointed at a registry rather than a single saved file.

23.6 Closing the loop: monitor and retrain

What turns a deployment into MLOps is the return arrow. Monitoring (Chapter 16) watches the live inputs for drift and, when ground-truth labels eventually arrive, watches performance. A drift or decay signal triggers the training pipeline to produce a new candidate — and here is the crucial safety gate: a new model is not promoted just because it’s newer. It must be validated against the current one and only promoted if it’s genuinely better, or rolled back if it regresses. Two automated decisions, made explicit, are what let the loop run without a human watching every turn:

from scipy import stats

# Gate 1 — should we retrain? A drift signal, not a calendar, is the trigger.
# This quarter's traffic, inputs only: churn labels for it don't exist yet.
live_features = make_data(seed=2, drift=10).drop(columns="churned")

ks = stats.ks_2samp(training_data["days_since_login"], live_features["days_since_login"])

# Two conditions, and the first is the one that matters. At 3,000 rows a KS test
# returns a significant p-value for a shift far too small to act on, so
# significance alone is not a decision — the KS statistic is the effect size.
DRIFT_THRESHOLD = 0.10
should_retrain = ks.statistic > DRIFT_THRESHOLD and ks.pvalue < 0.01
print(f"input drift: KS={ks.statistic:.3f} (p={ks.pvalue:.1e})"
      f"  ->  retrain triggered: {should_retrain}")

# Gate 2 — retrain and promote using data old enough to be honestly labelled.
# Churn labels take a quarter to mature (Chapter 16), so both the candidate's
# training set and the evaluation set come from the drifted-but-matured window,
# not from the traffic that just arrived.
matured = make_data(seed=4, drift=10)          # drifted regime, labels now in
holdout = make_data(seed=3, drift=10)          # held out from the same window

candidate = train_and_evaluate(matured, version="1.1.0")
X_holdout = holdout[["spend_per_active_day", "is_recent", "days_since_login"]]
current_auc = roc_auc_score(holdout["churned"], registered["model"].predict_proba(X_holdout)[:, 1])
candidate_auc = roc_auc_score(holdout["churned"], candidate["model"].predict_proba(X_holdout)[:, 1])

# A margin, not a bare >=: an AUC lead smaller than the metric's own sampling
# wobble is noise, and promoting on it churns production for nothing.
PROMOTION_MARGIN = 0.01
promote = candidate_auc >= current_auc + PROMOTION_MARGIN
print(f"on matured labelled data: current={current_auc:.3f}  candidate={candidate_auc:.3f}"
      f"  (margin {PROMOTION_MARGIN})  ->  promote: {promote}")

input drift: KS=0.121 (p=1.2e-19)  ->  retrain triggered: True
on matured labelled data: current=0.703  candidate=0.712  (margin 0.01)  ->  promote: False

Three decisions are encoded there, and each one is a place where the obvious implementation is wrong.

The retrain trigger gates on the KS statistic, not its p-value. This is the significance-versus-effect-size trap in its natural habitat: feed a two-sample test enough production rows and it will report p < 0.001 for a shift of no practical consequence whatsoever, because that is what a p-value does as N grows — it answers “is the difference real?”, never “is it big enough to care about?”. A trigger built on significance alone fires constantly at production volume. The statistic answers the question you actually have, and the p-value is left in as a sanity check on small batches.

The promotion gate requires a margin rather than a bare >=. AUC has sampling variability of its own, and a candidate that wins by 0.002 has not demonstrated anything except that it was evaluated on a finite holdout. Set the margin from the metric’s observed run-to-run variation — bootstrap the holdout if you want to be principled about it — and the gate stops swapping the production model on noise.

The run above shows this working, and it’s worth sitting with the result: the candidate scores higher than the incumbent and is not promoted, because its lead falls just short of the margin. Under a bare >= it would have shipped. That is the gate doing its job rather than failing — “better on this holdout” and “better in production” are not the same claim, and the margin is where you set the price of the difference. If the drift is real, the next window’s candidate will clear the bar comfortably; if it wasn’t, you have avoided a pointless redeployment.

And both the candidate’s training data and the evaluation holdout come from a matured window rather than this quarter’s traffic. This is the label-lag honesty from Chapter 16 carried into the loop: you find out who churned months after they churn, so the freshest data is exactly the data you cannot yet learn from or score against. The drift signal can use today’s inputs, because it needs no labels. Everything downstream of it has to wait. The most common way an MLOps loop quietly lies to itself is by evaluating on recent data whose labels, in production, would not exist yet — a mistake that looks like nothing in a demo and inflates every number in reality.

Encoding those judgements — when to retrain, and whether the new model earns its place — is what makes the loop trustworthy enough to run on its own, with a human approving the promotion rather than performing every step. Close that loop and the model maintains itself: it notices when it’s slipping, retrains, checks the replacement, and ships it — over and over, keeping current with a world that won’t hold still.

Author’s Note

The MLOps loop is where every practice in this book is cashed in at once, and where it becomes impossible to skip any of them. You cannot automate a retraining loop you can’t reproduce (Chapter 22), can’t deploy safely (Chapter 15 and Chapter 21), can’t test (Chapter 7), can’t monitor (Chapter 16), or can’t roll back (Chapter 15). The loop is only as trustworthy as its weakest link, which is why it comes last: it needs all of it.

And it’s the right place to close the book, because it answers the question the whole book has been circling. The goal was never to turn you into a software engineer, or to make you abandon the exploratory instincts that make you good at finding things. It was to let your models live in the world — to keep being right, for real users, without you watching every prediction. That is what all the scaffolding has been for: the package, the tests, the container, the monitoring, the loop. The notebook found something true; engineering is how you keep it true after you’ve moved on to the next question. You won’t build the full loop for every model — the judgement about how far to go is itself one of the things you’ve learned — but when a model genuinely matters and has to stay right, this is the shape of the thing, and now you know how to build it.

23.7 Summary

MLOps wires the whole book into a self-maintaining loop:

The loop is the unit. A reproducible training pipeline feeds a registry, which feeds a monitored deployment, whose drift signals trigger retraining — every arrow a practice from an earlier chapter, connected into a cycle.
Models are versioned artefacts with lineage. A registry tracks each model’s version, the data and code that made it, and its metrics, making promotion and rollback a matter of naming a version.
Retraining is triggered, and promotion is gated. Drift triggers a retrain rather than the calendar; a new model is promoted only if it beats the current one by a real margin, on data old enough to be honestly labelled — both judgements encoded as automated gates, and both gating on effect size rather than significance.
The loop is only as strong as its weakest practice. Reproducibility, testing, deployment, monitoring, and rollback all have to hold for the cycle to run safely without a human at every step.

23.8 Where to go from here

That is the bridge built. You began with a notebook that worked on your machine and the four properties it lacked — reproducibility, modularity, testability, readability — and you now have the engineering practices to supply all four, and to carry a model all the way from exploration to a system that maintains itself. You don’t have to apply every practice to every project; the judgement of how far to go is part of what you’ve gained. But when the work matters, you can now make it reliable, reproducible, and trustworthy — a data scientist whose code works far beyond their own laptop.

The companion volume, Thinking in Uncertainty, travels the bridge in the other direction, helping software engineers develop the statistical and probabilistic thinking that you brought to this book as your home ground. Between the two, the gap that opened this book — two disciplines sharing tools and vocabulary but thinking differently — is a little narrower, crossed from both sides.

23.9 Exercises

Sketch the MLOps loop for a model of your own: name what plays the part of each stage — the training pipeline, the registry, the deployment, the monitoring signal, and the retraining trigger. Which stage is currently missing or manual, and what would it take to automate it?
Implement the retraining trigger: write a check that compares a live feature distribution against the training reference and returns a decision to retrain when it drifts past a threshold you justify. What false-alarm rate would make this trigger more trouble than it’s worth?
Implement the promotion gate: given a current model and a freshly trained candidate, evaluate both on the same recent data and decide whether to promote the candidate. Why must the comparison use the same evaluation data, and why a margin rather than strict improvement?
Conceptual: The production loop only runs unattended because judgements you would make by eye have been written down as thresholds. Name one judgement you genuinely make while iterating on a model that you would not trust a threshold to make on your behalf. Then choose your response for a model you actually work on: encode a cruder proxy and accept what it misses, keep a human in that one step, or leave the loop open and retrain on request. Justify the choice, and say what would have to change for you to pick a different one.
Conceptual: The Author’s Note claims the loop is only as strong as its weakest practice. Pick any one practice from Parts 1–5 (reproducibility, testing, deployment, monitoring, rollback) and describe concretely how an automated retraining loop fails if that single practice is missing.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # MLOps pipeline {#sec-mlops} ## The model that has to keep being right {#sec-keep-being-right} In @sec-notebook-to-api you deployed the churn model as a service; in @sec-repro-pipeline you made a result reproducible. This chapter confronts the case that needs both at once, and more: a model that must keep performing *as the world changes*. You can't deploy it and forget it, because it will drift (@sec-monitoring). You can't reproduce a single result and be done, because the data keeps moving. And you can't retrain it by hand forever, because that doesn't scale and nobody wants to be the person who remembers to do it every Monday. MLOps is the discipline of automating the *loop*: training, deploying, monitoring, and retraining wired into a continuous cycle that keeps the model current with as little manual intervention as is safe. It is the most demanding end-to-end project in this book because it assembles nearly all of it — and it's a fitting place to finish, because the loop is only as strong as the weakest practice in it. ## The MLOps loop {#sec-mlops-loop} The shape of MLOps is a cycle, and every arrow in it is a chapter you've already read: > a **reproducible training pipeline** (@sec-repro-pipeline) produces a versioned **model artefact** → registered in a **model registry** → **deployed** as a service (@sec-containerisation, @sec-deployment, @sec-notebook-to-api) → **monitored** in production (@sec-monitoring) → drift or decay **triggers retraining** → back to the start. What makes it MLOps rather than a pile of scripts is that the cycle runs continuously and mostly automatically, driven by signals from production rather than by someone remembering. The practices don't change; they're connected into a loop. ::: {.callout-note} ## Data Science Bridge The MLOps loop is your own experiment–iterate cycle, automated and moved into production. In exploration you train a model, evaluate it, notice a weakness, adjust, and retrain — a loop you drive by hand, with your judgement at every step. MLOps is that same loop running in production, except the trigger to go round again isn't your curiosity but a monitoring signal: not "I had an idea" but "the input distribution drifted" or "performance dropped below the floor". The cycle you already know intimately is the cycle being automated. Where it breaks down: your exploratory loop optimises for *discovery*, and you're inside it making judgement calls on every iteration. The production loop optimises for *staying current*, and it has to run safely with you out of it most of the time — which means the judgement you'd normally apply by eye has to be encoded as automated gates: a drift threshold that decides when to retrain, a validation check that decides whether a new model is allowed to replace the old one. The loop is the same; the judgement has to be made explicit because no one is watching each turn. ::: ## A reproducible training pipeline {#sec-training-pipeline} The training side of the loop is @sec-repro-pipeline applied to a model: a deterministic pipeline that takes versioned data, builds features, trains, and evaluates — producing not just a model but a model *with its lineage*, the version of the data and code that made it and the metric it achieved. That bundle is what gets registered. ```{python} #| label: training-pipeline #| echo: true import hashlib import numpy as np import pandas as pd from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score def fingerprint(df: pd.DataFrame) -> str: """A stable hash of the training data — part of the model's lineage.""" return hashlib.sha256(pd.util.hash_pandas_object(df).values).hexdigest()[:12] def train_and_evaluate(data: pd.DataFrame, version: str) -> dict: """The training pipeline: fit, score, and record provenance.""" X = data[["spend_per_active_day", "is_recent", "days_since_login"]] y = data["churned"] X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42) model = GradientBoostingClassifier(random_state=42).fit(X_tr, y_tr) auc = roc_auc_score(y_te, model.predict_proba(X_te)[:, 1]) return {"model": model, "version": version, "data_fingerprint": fingerprint(data), "auc": round(float(auc), 3)} def make_data(seed: int, drift: float = 0.0) -> pd.DataFrame: rng = np.random.default_rng(seed) days_since_login = rng.integers(0, 90, 3_000) + drift return pd.DataFrame({ "spend_per_active_day": rng.exponential(5, 3_000), "is_recent": (days_since_login < 30).astype(int), "days_since_login": days_since_login, "churned": rng.binomial(1, 1 / (1 + np.exp(-(days_since_login / 30 - 1)))), }) training_data = make_data(seed=1) registered = train_and_evaluate(training_data, version="1.0.0") print(f"registered model {registered['version']} " f"data={registered['data_fingerprint']} AUC={registered['auc']}") ``` The artefact carries everything needed to trust and trace it: a version, the fingerprint of the exact data it learned from, and its measured performance. A prediction in production can now be traced back through its `model_version` (@sec-notebook-to-api) to the precise data and code that produced the model — the reproducibility of @sec-repro-pipeline, attached to a living model. ## The model registry {#sec-model-registry} Models are artefacts, and like code (@sec-version-control) and data (@sec-repro-pipeline) they need versioning — which is what a **model registry** provides. A registry (MLflow is a common one) stores each trained model with its version, its lineage, and its metrics, and lets you attach a moving label to whichever version is currently the one you mean. In MLflow those labels are **aliases** — `champion` for the version serving production, `challenger` for one being evaluated against it — set with `client.set_registered_model_alias(name, "champion", version)` and resolved by the serving code through a URI like `models:/churn@champion`. (Older material uses *stages*, with fixed names like `Staging` and `Production`; these were deprecated in MLflow 2.9 in favour of aliases, which you can name freely and point at more than one version.) This is what makes promotion and rollback concrete: promotion moves the `champion` alias to a specific, identified version, and rollback moves it back — a label change rather than a frantic retrain. The registry is the single source of truth for "which model is live, and what made it" — the operational counterpart to the experiment log from @sec-version-control. ## Serving, with a way back {#sec-serving-with-way-back} The serving side is @sec-deployment and @sec-notebook-to-api, now fed by the registry: the service loads the model version the registry's `champion` alias points at, exposes it behind the typed API contract, and is rolled out gradually — a canary release watched before it takes full traffic — with the previous version retained for an instant rollback. The model artefact is loaded, never retrained, at serving time, and the response carries its version so every prediction is traceable. None of this is new; it's the deployment discipline you've already built, now pointed at a registry rather than a single saved file. ## Closing the loop: monitor and retrain {#sec-closing-loop} What turns a deployment into MLOps is the return arrow. Monitoring (@sec-monitoring) watches the live inputs for drift and, when ground-truth labels eventually arrive, watches performance. A drift or decay signal *triggers* the training pipeline to produce a new candidate — and here is the crucial safety gate: a new model is not promoted just because it's newer. It must be *validated against the current one* and only promoted if it's genuinely better, or rolled back if it regresses. Two automated decisions, made explicit, are what let the loop run without a human watching every turn: ```{python} #| label: closing-the-loop #| echo: true from scipy import stats # Gate 1 — should we retrain? A drift signal, not a calendar, is the trigger. # This quarter's traffic, inputs only: churn labels for it don't exist yet. live_features = make_data(seed=2, drift=10).drop(columns="churned") ks = stats.ks_2samp(training_data["days_since_login"], live_features["days_since_login"]) # Two conditions, and the first is the one that matters. At 3,000 rows a KS test # returns a significant p-value for a shift far too small to act on, so # significance alone is not a decision — the KS statistic is the effect size. DRIFT_THRESHOLD = 0.10 should_retrain = ks.statistic > DRIFT_THRESHOLD and ks.pvalue < 0.01 print(f"input drift: KS={ks.statistic:.3f} (p={ks.pvalue:.1e})" f" -> retrain triggered: {should_retrain}") # Gate 2 — retrain and promote using data old enough to be honestly labelled. # Churn labels take a quarter to mature (Chapter 16), so both the candidate's # training set and the evaluation set come from the drifted-but-matured window, # not from the traffic that just arrived. matured = make_data(seed=4, drift=10) # drifted regime, labels now in holdout = make_data(seed=3, drift=10) # held out from the same window candidate = train_and_evaluate(matured, version="1.1.0") X_holdout = holdout[["spend_per_active_day", "is_recent", "days_since_login"]] current_auc = roc_auc_score(holdout["churned"], registered["model"].predict_proba(X_holdout)[:, 1]) candidate_auc = roc_auc_score(holdout["churned"], candidate["model"].predict_proba(X_holdout)[:, 1]) # A margin, not a bare >=: an AUC lead smaller than the metric's own sampling # wobble is noise, and promoting on it churns production for nothing. PROMOTION_MARGIN = 0.01 promote = candidate_auc >= current_auc + PROMOTION_MARGIN print(f"on matured labelled data: current={current_auc:.3f} candidate={candidate_auc:.3f}" f" (margin {PROMOTION_MARGIN}) -> promote: {promote}") ``` Three decisions are encoded there, and each one is a place where the obvious implementation is wrong. The retrain trigger gates on the KS *statistic*, not its p-value. This is the significance-versus-effect-size trap in its natural habitat: feed a two-sample test enough production rows and it will report `p < 0.001` for a shift of no practical consequence whatsoever, because that is what a p-value does as N grows — it answers "is the difference real?", never "is it big enough to care about?". A trigger built on significance alone fires constantly at production volume. The statistic answers the question you actually have, and the p-value is left in as a sanity check on small batches. The promotion gate requires a margin rather than a bare `>=`. AUC has sampling variability of its own, and a candidate that wins by 0.002 has not demonstrated anything except that it was evaluated on a finite holdout. Set the margin from the metric's observed run-to-run variation — bootstrap the holdout if you want to be principled about it — and the gate stops swapping the production model on noise. The run above shows this working, and it's worth sitting with the result: the candidate scores higher than the incumbent and is *not* promoted, because its lead falls just short of the margin. Under a bare `>=` it would have shipped. That is the gate doing its job rather than failing — "better on this holdout" and "better in production" are not the same claim, and the margin is where you set the price of the difference. If the drift is real, the next window's candidate will clear the bar comfortably; if it wasn't, you have avoided a pointless redeployment. And both the candidate's training data and the evaluation holdout come from a *matured* window rather than this quarter's traffic. This is the label-lag honesty from @sec-monitoring carried into the loop: you find out who churned months after they churn, so the freshest data is exactly the data you cannot yet learn from or score against. The drift signal can use today's inputs, because it needs no labels. Everything downstream of it has to wait. The most common way an MLOps loop quietly lies to itself is by evaluating on recent data whose labels, in production, would not exist yet — a mistake that looks like nothing in a demo and inflates every number in reality. Encoding those judgements — when to retrain, and whether the new model earns its place — is what makes the loop trustworthy enough to run on its own, with a human approving the promotion rather than performing every step. Close that loop and the model maintains itself: it notices when it's slipping, retrains, checks the replacement, and ships it — over and over, keeping current with a world that won't hold still. ::: {.callout-tip} ## Author's Note The MLOps loop is where every practice in this book is cashed in at once, and where it becomes impossible to skip any of them. You cannot automate a retraining loop you can't *reproduce* (@sec-repro-pipeline), can't *deploy* safely (@sec-deployment and @sec-notebook-to-api), can't *test* (@sec-testing), can't *monitor* (@sec-monitoring), or can't *roll back* (@sec-deployment). The loop is only as trustworthy as its weakest link, which is why it comes last: it needs all of it. And it's the right place to close the book, because it answers the question the whole book has been circling. The goal was never to turn you into a software engineer, or to make you abandon the exploratory instincts that make you good at finding things. It was to let your models live in the world — to keep being right, for real users, without you watching every prediction. That is what all the scaffolding has been *for*: the package, the tests, the container, the monitoring, the loop. The notebook found something true; engineering is how you keep it true after you've moved on to the next question. You won't build the full loop for every model — the judgement about how far to go is itself one of the things you've learned — but when a model genuinely matters and has to stay right, this is the shape of the thing, and now you know how to build it. ::: ## Summary {#sec-mlops-summary} MLOps wires the whole book into a self-maintaining loop: 1. **The loop is the unit.** A reproducible training pipeline feeds a registry, which feeds a monitored deployment, whose drift signals trigger retraining — every arrow a practice from an earlier chapter, connected into a cycle. 2. **Models are versioned artefacts with lineage.** A registry tracks each model's version, the data and code that made it, and its metrics, making promotion and rollback a matter of naming a version. 3. **Retraining is triggered, and promotion is gated.** Drift triggers a retrain rather than the calendar; a new model is promoted only if it beats the current one by a real margin, on data old enough to be honestly labelled — both judgements encoded as automated gates, and both gating on effect size rather than significance. 4. **The loop is only as strong as its weakest practice.** Reproducibility, testing, deployment, monitoring, and rollback all have to hold for the cycle to run safely without a human at every step. ## Where to go from here {#sec-where-to-go} That is the bridge built. You began with a notebook that worked on your machine and the four properties it lacked — reproducibility, modularity, testability, readability — and you now have the engineering practices to supply all four, and to carry a model all the way from exploration to a system that maintains itself. You don't have to apply every practice to every project; the judgement of how far to go is part of what you've gained. But when the work matters, you can now make it reliable, reproducible, and trustworthy — a data scientist whose code works far beyond their own laptop. The companion volume, *Thinking in Uncertainty*, travels the bridge in the other direction, helping software engineers develop the statistical and probabilistic thinking that you brought to this book as your home ground. Between the two, the gap that opened this book — two disciplines sharing tools and vocabulary but thinking differently — is a little narrower, crossed from both sides. ## Exercises {#sec-mlops-exercises} 1. Sketch the MLOps loop for a model of your own: name what plays the part of each stage — the training pipeline, the registry, the deployment, the monitoring signal, and the retraining trigger. Which stage is currently missing or manual, and what would it take to automate it? 2. Implement the retraining trigger: write a check that compares a live feature distribution against the training reference and returns a decision to retrain when it drifts past a threshold you justify. What false-alarm rate would make this trigger more trouble than it's worth? 3. Implement the promotion gate: given a current model and a freshly trained candidate, evaluate both on the same recent data and decide whether to promote the candidate. Why must the comparison use the *same* evaluation data, and why a margin rather than strict improvement? 4. **Conceptual:** The production loop only runs unattended because judgements you would make by eye have been written down as thresholds. Name one judgement you genuinely make while iterating on a model that you would *not* trust a threshold to make on your behalf. Then choose your response for a model you actually work on: encode a cruder proxy and accept what it misses, keep a human in that one step, or leave the loop open and retrain on request. Justify the choice, and say what would have to change for you to pick a different one. 5. **Conceptual:** The Author's Note claims the loop is only as strong as its weakest practice. Pick any one practice from Parts 1–5 (reproducibility, testing, deployment, monitoring, rollback) and describe concretely how an automated retraining loop fails if that single practice is missing.