---
# Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md
title: "Monitoring and observability"
---
## The model that quietly got worse {#sec-quietly-worse}
The model you deployed six months ago is still running. The service is green, the logs show no errors, requests come in and predictions go out. And it has been getting steadily worse for months — its accuracy sliding as the world it was trained on drifted away from the world it now sees — and nobody noticed, because nothing was watching for *that*. A deployed model has a failure mode that an ordinary web service doesn't: it can be perfectly healthy and increasingly wrong at the same time.
This is the gap monitoring closes. Everything in the previous chapter got the model running where others depend on it; this chapter is about keeping an eye on it once it's there — both the system (is it up, fast, error-free?) and, the part data scientists uniquely have to care about, the model (are its predictions still any good?).
## Observability: logs, metrics, health {#sec-observability}
Start with the operational layer, which a model shares with any service. Three things make a running system observable. *Structured logging* (Chapter 8) records what happened in a form you can search and reconstruct after the fact — essential when the only evidence of a problem is a complaint about "last Tuesday". *Metrics* — request rate, latency, error rate — track the system's health over time on a dashboard. And a *health-check* endpoint lets the platform ask "are you alive?" and restart or reroute if not:
```python
@app.get("/health")
def health() -> dict:
return {"status": "ok"} # the platform polls this; a non-200 triggers a restart
```
Together these answer "is the *system* healthy?" — is it up, responding, and fast. For most services that's the whole job. For a model, it's only half, because all three can be green while the predictions quietly rot.
## The model-specific problem: drift {#sec-drift}
The failure that generic monitoring misses is **drift**: the live data moving away from the data the model was trained on. It comes in two forms. *Data drift* is a change in the input distribution — the feature values coming in look different from training (a new customer segment, a changed upstream pipeline, inflation moving every price). *Concept drift* is a change in the relationship itself — the same inputs now map to different outcomes (fraud tactics evolve, so the patterns that meant "safe" last year don't now).
You can detect *input* drift without waiting for outcomes, by comparing this week's feature distribution to the training reference with a two-sample test:
```{python}
#| label: drift-detection
#| echo: true
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
reference = rng.normal(50, 10, 5_000) # a feature's distribution at training time
live_stable = rng.normal(50, 10, 5_000) # this week: same process
live_drifted = rng.normal(56, 12, 5_000) # this week: shifted and more variable
for name, live in [("stable", live_stable), ("drifted", live_drifted)]:
ks_stat, p_value = stats.ks_2samp(reference, live)
flag = "ALERT — drift" if p_value < 0.01 else "ok"
print(f"{name:8} KS={ks_stat:.3f} p={p_value:.1e} -> {flag}")
```
The stable batch matches the reference (a high p-value, no alarm); the drifted batch is flagged, because the distribution has moved enough that the test rejects "same distribution". This Kolmogorov–Smirnov test suits continuous features; for categorical ones you'd use a population stability index or a chi-squared test. The companion volume, *Thinking in Uncertainty*, treats the statistics of these comparisons in depth — here the point is that the check is cheap and can run on every batch.
::: {.callout-note}
## Data Science Bridge
Monitoring a model is validation that never stops. You validated the model once, on a holdout, before shipping — but that holdout only told you the model was good on data that resembled the past. Monitoring continues that validation indefinitely against live data, on the assumption (which drift violates) that the future keeps resembling the past. The drift check itself is a tool you already own: a two-sample test or a stability index comparing the training sample to this week's, exactly the distribution comparison you'd run to check whether two datasets came from the same place.
Where it breaks down: holdout validation has *labels*, so you measure accuracy directly. In production the labels usually lag — you find out who actually churned months later — or never arrive at all. So you fall back to watching the *inputs* and the *predictions* as a proxy: input drift can warn you that the model is now operating on unfamiliar data, but it cannot, on its own, confirm the model has got worse. Confirming that still needs ground truth, whenever it finally comes.
:::
## Alerting on what matters {#sec-alerting}
A metric nobody looks at is not monitoring. Observability becomes useful only when something *tells* you that a threshold has been crossed — error rate up, latency up, a drift statistic past its limit, the prediction distribution shifting. The hard part is the same one from the testing chapter: an alert that fires too often becomes the flaky test of operations, and a team that learns to ignore alerts will ignore the real one. Tune thresholds so an alert means action, and if you test many features for drift at once, remember the multiple-comparisons problem — check enough features and some will look "drifted" by chance.
The point of an alert is to start a loop: detect that something has shifted, investigate whether it matters, and if it does, retrain or roll back. Closing that loop automatically — drift or decay triggering a retrain — is the subject of the *MLOps pipeline* chapter; monitoring is what makes the loop possible by noticing in the first place.
::: {.callout-tip}
## Author's Note
A notebook runs once, and you read its result immediately. A deployed model runs unattended and indefinitely, and its most dangerous failure is precisely the one you'll never see by looking: not a crash, which announces itself, but a slow slide into wrongness while every operational dashboard stays reassuringly green. Data scientists are trained to evaluate a model *at a point in time* — fit, validate, report the number. Production demands evaluating it *over* time, which is a genuinely different discipline, and the instinct to consider the job finished at deployment is exactly the instinct that lets a model decay in silence.
The reframe is that shipping the model is the *start* of its working life, not the end of yours. The holdout score you were so pleased with was a claim about one moment; monitoring is how you keep knowing whether that claim is still true. The companion volume covers the statistics of detecting that it has stopped being true; the engineering responsibility is humbler and just as essential — to make sure that something, somewhere, is actually looking, and will say so when the answer changes.
:::
## Summary {#sec-monitoring-summary}
A deployed model needs watching, in two layers:
1. **Observe the system.** Structured logs, metrics (rate, latency, errors), and a health check tell you whether the service is up, fast, and error-free — necessary, but not sufficient, for a model.
2. **Watch for drift.** A model can be perfectly healthy and increasingly wrong; compare live feature distributions to the training reference to catch input drift, and distinguish it from concept drift in the input–outcome relationship.
3. **You usually can't measure live accuracy directly.** Labels lag or never arrive, so input and prediction drift are proxies that warn of trouble without confirming decay — ground truth, when it comes, is what confirms it.
4. **Alert on what matters, and close the loop.** Tune alerts so they mean action rather than noise, mind multiple comparisons, and let detection trigger investigation and retraining.
This closes Part 4. With the model built, structured, configured, deployed, and watched, Part 5 turns to the people around it — beginning with *code review*.
## Exercises {#sec-monitoring-exercises}
1. Add structured logging and a `/health` endpoint to a service, and log each prediction together with its inputs. Then imagine a user complains that "the model gave a strange answer last Tuesday" — what exactly would you need to have logged to investigate, and is your logging capturing it?
2. Implement a data-drift check: store a reference sample from training and compare a live batch to it with a two-sample test (KS) or a population stability index, raising an alert when the statistic crosses a threshold. On a project you know, which feature do you think would drift first, and why?
3. Define an alert that would actually be useful: choose a condition (error rate, latency, or drift), set a threshold, and describe how you'd keep it from becoming noise that the team learns to ignore.
4. **Conceptual:** The Data Science Bridge compares monitoring to ongoing validation. Give one way the analogy holds and one way it breaks down. Why can you usually not measure a deployed model's accuracy as directly as you measured its holdout accuracy?
5. **Conceptual:** Distinguish data drift from concept drift, with a concrete example of each, and explain why detecting input drift *without* labels can warn you that something is wrong but cannot confirm that the model has actually become less accurate.