16 Monitoring and observability

16.1 The model that quietly got worse

The model you deployed six months ago is still running. The service is green, the logs show no errors, requests come in and predictions go out. And it has been getting steadily worse for months — its accuracy sliding as the world it was trained on drifted away from the world it now sees — and nobody noticed, because nothing was watching for that. A deployed model has a failure mode that an ordinary web service doesn’t: it can be perfectly healthy and increasingly wrong at the same time.

This is the gap monitoring closes. Everything in the previous chapter got the model running where others depend on it; this chapter is about keeping an eye on it once it’s there — both the system (is it up, fast, error-free?) and, the part data scientists uniquely have to care about, the model (are its predictions still any good?).

16.2 Observability: logs, metrics, health

Start with the operational layer, which a model shares with any service. Three things make a running system observable. Structured logging (Chapter 8) records what happened in a form you can search and reconstruct after the fact — essential when the only evidence of a problem is a complaint about “last Tuesday”. Metrics — request rate, latency, error rate — track the system’s health over time on a dashboard. And a health-check endpoint lets the platform ask “are you alive?” and restart or reroute if not:

@app.get("/health")
def health() -> dict:
    return {"status": "ok"}   # the platform polls this; a non-200 triggers a restart

Together these answer “is the system healthy?” — is it up, responding, and fast. For most services that’s the whole job. For a model, it’s only half, because all three can be green while the predictions quietly rot.

16.3 The model-specific problem: drift

The failure that generic monitoring misses is drift: the live data moving away from the data the model was trained on. It comes in two forms. Data drift is a change in the input distribution — the feature values coming in look different from training (a new customer segment, a changed upstream pipeline, inflation moving every price). Concept drift is a change in the relationship itself — the same inputs now map to different outcomes (fraud tactics evolve, so the patterns that meant “safe” last year don’t now).

You can detect input drift without waiting for outcomes, by comparing this week’s feature distribution to the training reference with a two-sample test:

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)
reference = rng.normal(50, 10, 5_000)        # a feature's distribution at training time

live_stable = rng.normal(50, 10, 5_000)      # this week: same process
live_drifted = rng.normal(56, 12, 5_000)     # this week: shifted and more variable

# Gate on the KS statistic — the effect size — not on significance alone.
DRIFT_THRESHOLD = 0.10

for name, live in [("stable", live_stable), ("drifted", live_drifted)]:
    ks_stat, p_value = stats.ks_2samp(reference, live)
    flag = "ALERT — drift" if ks_stat > DRIFT_THRESHOLD else "ok"
    print(f"{name:8}  KS={ks_stat:.3f}  p={p_value:.1e}  -> {flag}")

stable    KS=0.021  p=2.4e-01  -> ok
drifted   KS=0.230  p=2.6e-116  -> ALERT — drift

The stable batch matches the reference and raises no alarm; the drifted batch is flagged, because the distribution has moved far enough to matter.

Notice what the alert is gated on. It would be natural to test p_value < 0.01 and be done — but at 5,000 rows per batch, and far more in a real service, a two-sample test returns a significant p-value for a shift that is real and utterly inconsequential. Significance answers “is this difference more than noise?”, which at production sample sizes is almost always yes; it never answers “is this difference big enough to act on?”. The KS statistic — the largest gap between the two cumulative distributions — is an effect size, and it’s the one that belongs in the threshold. The p-value is still worth printing as a sanity check on small batches, where a large statistic can be pure noise. The retraining trigger in MLOps pipeline is built on the same distinction, for the same reason. This Kolmogorov–Smirnov test suits continuous features; for categorical ones you’d use a population stability index or a chi-squared test. The companion volume, Thinking in Uncertainty, treats the statistics of these comparisons in depth — here the point is that the check is cheap and can run on every batch.

Data Science Bridge

Monitoring a model is validation that never stops. You validated the model once, on a holdout, before shipping — but that holdout only told you the model was good on data that resembled the past. Monitoring continues that validation indefinitely against live data, on the assumption (which drift violates) that the future keeps resembling the past. The drift check itself is a tool you already own: a two-sample test or a stability index comparing the training sample to this week’s, exactly the distribution comparison you’d run to check whether two datasets came from the same place.

Where it breaks down: holdout validation has labels, so you measure accuracy directly. In production the labels usually lag — you find out who actually churned months later — or never arrive at all. So you fall back to watching the inputs and the predictions as a proxy: input drift can warn you that the model is now operating on unfamiliar data, but it cannot, on its own, confirm the model has got worse. Confirming that still needs ground truth, whenever it finally comes.

16.4 The prediction log

That drift check needs live data to compare against, and the data has to come from somewhere. It comes from a prediction log: one record written at serving time for every request the model answers. Get its contents right and most of the questions you’ll be asked months later are answerable by query; get them wrong and you’ll be reconstructing evidence you never kept.

A useful record has five parts. A request identifier, so a complaint about one prediction can be traced to one row. The feature values the model actually scored — not the raw request body but the post-transformation values, because those are what the model saw and what drift is measured on. The output, both the decision and the score behind it, since a batch of predictions sitting at 0.49 tells a very different story from the same decisions made at 0.02. The model version, so that after a redeployment you can still tell which model produced which prediction. And a timestamp. With those five, “the model gave a strange answer last Tuesday” stops being an archaeology project: you find the row, and you replay those exact features through the model running now to see whether its behaviour has changed.

The constraint on all this is that a prediction log is personal data. For a churn model it is a per-customer file of behavioural features with an identifier attached, accumulating indefinitely, and often living outside the storage that your organisation’s data governance actually watches. That is processing like any other, and it needs the same treatment: log a pseudonymous key rather than an email address, prefer the engineered features over raw free text where the engineered version answers the same question, and — the one most often skipped — set a retention period and enforce it. Long enough for labels to mature and for drift windows to be compared against each other; not forever. The engineering instinct to log everything and decide later is the wrong instinct here, and it is much easier to argue for a narrow log at design time than to defend a wide one after it exists.

16.5 Watching the predictions themselves

Input drift is measured feature by feature, which means a model with two hundred features hands you two hundred comparisons, two hundred thresholds, and the multiple-comparisons problem that comes with them. There is a cheaper signal, and for most teams it is the first one worth building: watch the distribution of the model’s own output.

Prediction drift has three things going for it. It is one-dimensional — a single series, whatever the width of the feature space, which makes it something you can put on a dashboard and a human can actually read. It needs no labels and no separately maintained reference dataset, only a rolling history of predictions you are already logging. And it captures interactions that per-feature checks miss: every input can sit comfortably inside its own reference range while their joint distribution moves somewhere the model has never been, and the output notices even though no single feature did. If the mean predicted churn probability drifts from 0.12 to 0.21 over a fortnight, something has changed — upstream pipeline, population, or the world — and you should go and look.

What it cannot do is tell you what moved, which is why it complements the per-feature checks rather than replacing them. And it carries the same statistical caveat as the input check earlier: judge the size of the shift, not merely whether a test calls it significant. Run a KS test over a week’s predictions and it will return an emphatically significant p-value for a change in mean of 0.003, because that is what significance testing does as N grows. So gate on the KS statistic here too, or on a shift in the mean you would be willing to defend to a stakeholder, and keep the p-value as a sanity check for small batches where it still earns its place. The MLOps pipeline chapter makes exactly this choice when it wires a drift signal into an automatic retraining trigger, for exactly this reason: a trigger built on significance alone fires constantly and teaches everyone to ignore it.

16.6 Alerting on what matters

A metric nobody looks at is not monitoring. Observability becomes useful only when something tells you that a threshold has been crossed — error rate up, latency up, a drift statistic past its limit, the prediction distribution shifting. The hard part is the same one from the testing chapter: an alert that fires too often becomes the flaky test of operations, and a team that learns to ignore alerts will ignore the real one. Tune thresholds so an alert means action, and if you test many features for drift at once, remember the multiple-comparisons problem — check enough features and some will look “drifted” by chance.

The point of an alert is to start a loop: detect that something has shifted, investigate whether it matters, and if it does, retrain or roll back. Closing that loop automatically — drift or decay triggering a retrain — is the subject of the MLOps pipeline chapter; monitoring is what makes the loop possible by noticing in the first place.

Author’s Note

A notebook runs once, and you read its result immediately. A deployed model runs unattended and indefinitely, and its most dangerous failure is precisely the one you’ll never see by looking: not a crash, which announces itself, but a slow slide into wrongness while every operational dashboard stays reassuringly green. Data scientists are trained to evaluate a model at a point in time — fit, validate, report the number. Production demands evaluating it over time, which is a genuinely different discipline, and the instinct to consider the job finished at deployment is exactly the instinct that lets a model decay in silence.

The reframe is that shipping the model is the start of its working life, not the end of yours. The holdout score you were so pleased with was a claim about one moment; monitoring is how you keep knowing whether that claim is still true. The companion volume covers the statistics of detecting that it has stopped being true; the engineering responsibility is humbler and just as essential — to make sure that something, somewhere, is actually looking, and will say so when the answer changes.

16.7 Summary

A deployed model needs watching, in two layers:

Observe the system. Structured logs, metrics (rate, latency, errors), and a health check tell you whether the service is up, fast, and error-free — necessary, but not sufficient, for a model.
Watch for drift. A model can be perfectly healthy and increasingly wrong; compare live feature distributions to the training reference to catch input drift, and distinguish it from concept drift in the input–outcome relationship.
You usually can’t measure live accuracy directly. Labels lag or never arrive, so input and prediction drift are proxies that warn of trouble without confirming decay — ground truth, when it comes, is what confirms it.
Alert on what matters, and close the loop. Tune alerts so they mean action rather than noise, mind multiple comparisons, and let detection trigger investigation and retraining.

This closes Part 4. With the model built, structured, configured, deployed, and watched, Part 5 turns to the people around it — beginning with code review.

16.8 Exercises

Add structured logging and a /health endpoint to a service, and log each prediction together with its inputs. Then imagine a user complains that “the model gave a strange answer last Tuesday” — what exactly would you need to have logged to investigate, and is your logging capturing it?
Implement a data-drift check: store a reference sample from training and compare a live batch to it with a two-sample test (KS) or a population stability index, raising an alert when the statistic crosses a threshold. On a project you know, which feature do you think would drift first, and why?
Define an alert that would actually be useful: choose a condition (error rate, latency, or drift), set a threshold, and describe how you’d keep it from becoming noise that the team learns to ignore.
Conceptual: A churn model’s labels arrive ninety days late, and you have time to build only one thing this quarter: a nightly input-drift check across every feature, or a quarterly accuracy report run against the labels once they land. Choose one and justify it in terms of what each would tell you and when. Then name a specific failure the option you rejected would have caught that yours will not.
Conceptual: Distinguish data drift from concept drift, with a concrete example of each, and explain why detecting input drift without labels can warn you that something is wrong but cannot confirm that the model has actually become less accurate.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Monitoring and observability {#sec-monitoring} ## The model that quietly got worse {#sec-quietly-worse} The model you deployed six months ago is still running. The service is green, the logs show no errors, requests come in and predictions go out. And it has been getting steadily worse for months — its accuracy sliding as the world it was trained on drifted away from the world it now sees — and nobody noticed, because nothing was watching for *that*. A deployed model has a failure mode that an ordinary web service doesn't: it can be perfectly healthy and increasingly wrong at the same time. This is the gap monitoring closes. Everything in the previous chapter got the model running where others depend on it; this chapter is about keeping an eye on it once it's there — both the system (is it up, fast, error-free?) and, the part data scientists uniquely have to care about, the model (are its predictions still any good?). ## Observability: logs, metrics, health {#sec-observability} Start with the operational layer, which a model shares with any service. Three things make a running system observable. *Structured logging* (@sec-debugging) records what happened in a form you can search and reconstruct after the fact — essential when the only evidence of a problem is a complaint about "last Tuesday". *Metrics* — request rate, latency, error rate — track the system's health over time on a dashboard. And a *health-check* endpoint lets the platform ask "are you alive?" and restart or reroute if not: ```python @app.get("/health") def health() -> dict: return {"status": "ok"} # the platform polls this; a non-200 triggers a restart ``` Together these answer "is the *system* healthy?" — is it up, responding, and fast. For most services that's the whole job. For a model, it's only half, because all three can be green while the predictions quietly rot. ## The model-specific problem: drift {#sec-drift} The failure that generic monitoring misses is **drift**: the live data moving away from the data the model was trained on. It comes in two forms. *Data drift* is a change in the input distribution — the feature values coming in look different from training (a new customer segment, a changed upstream pipeline, inflation moving every price). *Concept drift* is a change in the relationship itself — the same inputs now map to different outcomes (fraud tactics evolve, so the patterns that meant "safe" last year don't now). You can detect *input* drift without waiting for outcomes, by comparing this week's feature distribution to the training reference with a two-sample test: ```{python} #| label: drift-detection #| echo: true import numpy as np from scipy import stats rng = np.random.default_rng(42) reference = rng.normal(50, 10, 5_000) # a feature's distribution at training time live_stable = rng.normal(50, 10, 5_000) # this week: same process live_drifted = rng.normal(56, 12, 5_000) # this week: shifted and more variable # Gate on the KS statistic — the effect size — not on significance alone. DRIFT_THRESHOLD = 0.10 for name, live in [("stable", live_stable), ("drifted", live_drifted)]: ks_stat, p_value = stats.ks_2samp(reference, live) flag = "ALERT — drift" if ks_stat > DRIFT_THRESHOLD else "ok" print(f"{name:8} KS={ks_stat:.3f} p={p_value:.1e} -> {flag}") ``` The stable batch matches the reference and raises no alarm; the drifted batch is flagged, because the distribution has moved far enough to matter. Notice what the alert is gated on. It would be natural to test `p_value < 0.01` and be done — but at 5,000 rows per batch, and far more in a real service, a two-sample test returns a significant p-value for a shift that is real and utterly inconsequential. Significance answers "is this difference more than noise?", which at production sample sizes is almost always yes; it never answers "is this difference big enough to act on?". The KS *statistic* — the largest gap between the two cumulative distributions — is an effect size, and it's the one that belongs in the threshold. The p-value is still worth printing as a sanity check on small batches, where a large statistic can be pure noise. The retraining trigger in *MLOps pipeline* is built on the same distinction, for the same reason. This Kolmogorov–Smirnov test suits continuous features; for categorical ones you'd use a population stability index or a chi-squared test. The companion volume, *Thinking in Uncertainty*, treats the statistics of these comparisons in depth — here the point is that the check is cheap and can run on every batch. ::: {.callout-note} ## Data Science Bridge Monitoring a model is validation that never stops. You validated the model once, on a holdout, before shipping — but that holdout only told you the model was good on data that resembled the past. Monitoring continues that validation indefinitely against live data, on the assumption (which drift violates) that the future keeps resembling the past. The drift check itself is a tool you already own: a two-sample test or a stability index comparing the training sample to this week's, exactly the distribution comparison you'd run to check whether two datasets came from the same place. Where it breaks down: holdout validation has *labels*, so you measure accuracy directly. In production the labels usually lag — you find out who actually churned months later — or never arrive at all. So you fall back to watching the *inputs* and the *predictions* as a proxy: input drift can warn you that the model is now operating on unfamiliar data, but it cannot, on its own, confirm the model has got worse. Confirming that still needs ground truth, whenever it finally comes. ::: ## The prediction log {#sec-prediction-log} That drift check needs live data to compare against, and the data has to come from somewhere. It comes from a *prediction log*: one record written at serving time for every request the model answers. Get its contents right and most of the questions you'll be asked months later are answerable by query; get them wrong and you'll be reconstructing evidence you never kept. A useful record has five parts. A request identifier, so a complaint about one prediction can be traced to one row. The feature values the model actually scored — not the raw request body but the post-transformation values, because those are what the model saw and what drift is measured on. The output, both the decision and the score behind it, since a batch of predictions sitting at 0.49 tells a very different story from the same decisions made at 0.02. The model version, so that after a redeployment you can still tell which model produced which prediction. And a timestamp. With those five, "the model gave a strange answer last Tuesday" stops being an archaeology project: you find the row, and you replay those exact features through the model running now to see whether its behaviour has changed. The constraint on all this is that a prediction log is personal data. For a churn model it is a per-customer file of behavioural features with an identifier attached, accumulating indefinitely, and often living outside the storage that your organisation's data governance actually watches. That is processing like any other, and it needs the same treatment: log a pseudonymous key rather than an email address, prefer the engineered features over raw free text where the engineered version answers the same question, and — the one most often skipped — set a retention period and enforce it. Long enough for labels to mature and for drift windows to be compared against each other; not forever. The engineering instinct to log everything and decide later is the wrong instinct here, and it is much easier to argue for a narrow log at design time than to defend a wide one after it exists. ## Watching the predictions themselves {#sec-prediction-drift} Input drift is measured feature by feature, which means a model with two hundred features hands you two hundred comparisons, two hundred thresholds, and the multiple-comparisons problem that comes with them. There is a cheaper signal, and for most teams it is the first one worth building: watch the distribution of the model's *own output*. Prediction drift has three things going for it. It is one-dimensional — a single series, whatever the width of the feature space, which makes it something you can put on a dashboard and a human can actually read. It needs no labels and no separately maintained reference dataset, only a rolling history of predictions you are already logging. And it captures interactions that per-feature checks miss: every input can sit comfortably inside its own reference range while their joint distribution moves somewhere the model has never been, and the output notices even though no single feature did. If the mean predicted churn probability drifts from 0.12 to 0.21 over a fortnight, something has changed — upstream pipeline, population, or the world — and you should go and look. What it cannot do is tell you *what* moved, which is why it complements the per-feature checks rather than replacing them. And it carries the same statistical caveat as the input check earlier: judge the *size* of the shift, not merely whether a test calls it significant. Run a KS test over a week's predictions and it will return an emphatically significant p-value for a change in mean of 0.003, because that is what significance testing does as N grows. So gate on the KS statistic here too, or on a shift in the mean you would be willing to defend to a stakeholder, and keep the p-value as a sanity check for small batches where it still earns its place. The *MLOps pipeline* chapter makes exactly this choice when it wires a drift signal into an automatic retraining trigger, for exactly this reason: a trigger built on significance alone fires constantly and teaches everyone to ignore it. ## Alerting on what matters {#sec-alerting} A metric nobody looks at is not monitoring. Observability becomes useful only when something *tells* you that a threshold has been crossed — error rate up, latency up, a drift statistic past its limit, the prediction distribution shifting. The hard part is the same one from the testing chapter: an alert that fires too often becomes the flaky test of operations, and a team that learns to ignore alerts will ignore the real one. Tune thresholds so an alert means action, and if you test many features for drift at once, remember the multiple-comparisons problem — check enough features and some will look "drifted" by chance. The point of an alert is to start a loop: detect that something has shifted, investigate whether it matters, and if it does, retrain or roll back. Closing that loop automatically — drift or decay triggering a retrain — is the subject of the *MLOps pipeline* chapter; monitoring is what makes the loop possible by noticing in the first place. ::: {.callout-tip} ## Author's Note A notebook runs once, and you read its result immediately. A deployed model runs unattended and indefinitely, and its most dangerous failure is precisely the one you'll never see by looking: not a crash, which announces itself, but a slow slide into wrongness while every operational dashboard stays reassuringly green. Data scientists are trained to evaluate a model *at a point in time* — fit, validate, report the number. Production demands evaluating it *over* time, which is a genuinely different discipline, and the instinct to consider the job finished at deployment is exactly the instinct that lets a model decay in silence. The reframe is that shipping the model is the *start* of its working life, not the end of yours. The holdout score you were so pleased with was a claim about one moment; monitoring is how you keep knowing whether that claim is still true. The companion volume covers the statistics of detecting that it has stopped being true; the engineering responsibility is humbler and just as essential — to make sure that something, somewhere, is actually looking, and will say so when the answer changes. ::: ## Summary {#sec-monitoring-summary} A deployed model needs watching, in two layers: 1. **Observe the system.** Structured logs, metrics (rate, latency, errors), and a health check tell you whether the service is up, fast, and error-free — necessary, but not sufficient, for a model. 2. **Watch for drift.** A model can be perfectly healthy and increasingly wrong; compare live feature distributions to the training reference to catch input drift, and distinguish it from concept drift in the input–outcome relationship. 3. **You usually can't measure live accuracy directly.** Labels lag or never arrive, so input and prediction drift are proxies that warn of trouble without confirming decay — ground truth, when it comes, is what confirms it. 4. **Alert on what matters, and close the loop.** Tune alerts so they mean action rather than noise, mind multiple comparisons, and let detection trigger investigation and retraining. This closes Part 4. With the model built, structured, configured, deployed, and watched, Part 5 turns to the people around it — beginning with *code review*. ## Exercises {#sec-monitoring-exercises} 1. Add structured logging and a `/health` endpoint to a service, and log each prediction together with its inputs. Then imagine a user complains that "the model gave a strange answer last Tuesday" — what exactly would you need to have logged to investigate, and is your logging capturing it? 2. Implement a data-drift check: store a reference sample from training and compare a live batch to it with a two-sample test (KS) or a population stability index, raising an alert when the statistic crosses a threshold. On a project you know, which feature do you think would drift first, and why? 3. Define an alert that would actually be useful: choose a condition (error rate, latency, or drift), set a threshold, and describe how you'd keep it from becoming noise that the team learns to ignore. 4. **Conceptual:** A churn model's labels arrive ninety days late, and you have time to build only one thing this quarter: a nightly input-drift check across every feature, or a quarterly accuracy report run against the labels once they land. Choose one and justify it in terms of what each would tell you and when. Then name a specific failure the option you rejected would have caught that yours will not. 5. **Conceptual:** Distinguish data drift from concept drift, with a concrete example of each, and explain why detecting input drift *without* labels can warn you that something is wrong but cannot confirm that the model has actually become less accurate.