15 Deployment

15.1 The model that has to keep running

By now the model is in good shape. It’s wrapped in an API (Chapter 12), packaged into a container (Chapter 14), and verified on every change by CI (Chapter 13). But all of that still runs when you run it. Deployment is the step where it runs without you: serving the website’s requests at three in the morning, or scoring yesterday’s data every day before anyone arrives, on a machine you’re not logged into. It is the act of putting your work somewhere it runs reliably, automatically, and where other people and systems depend on it.

This is the step that carries the most anxiety, and the least of it is technical. The word “deploy” tends to mark the boundary where a comfortable, self-contained piece of work becomes an operational responsibility — something that can be down, or slow, or wrong at an hour when you’re asleep. Most of that anxiety dissolves once you see deployment as the continuation of the practices you’ve already built, plus one new idea: a way to undo it.

15.2 Two shapes: batch and online

The first decision shapes everything else: is the model batch or online? A batch deployment runs on a schedule over a dataset and writes its results somewhere — score every customer’s churn risk each night and write the table the dashboard reads in the morning. An online deployment is a long-running service answering requests in real time — the API that the checkout page calls for a fraud score while the customer waits. This is the same distinction the Chapter 12 exercise drew between a scheduled job and a real-time API, now made an operational reality: a batch model is a scheduled job, an online model is a running service.

The two have different failure modes and different machinery. A batch job that fails can often just be re-run; an online service that fails takes the feature down with it. We’ll come back to scheduling batch work; the artefact at the centre of both, though, is the same.

15.3 Packaging the model artefact

Whichever shape you choose, you don’t retrain the model on the production machine — you train it once, save the result as a portable artefact, and load that artefact wherever it runs. The artefact is the trained model serialised to a file (with joblib or pickle), versioned, and stored somewhere durable (a model registry or object store), then loaded at startup. The one property you must be able to trust is that the loaded model predicts identically to the one you saved.

import tempfile
from pathlib import Path

import joblib
import numpy as np
from sklearn.linear_model import LogisticRegression

rng = np.random.default_rng(42)
X = rng.normal(size=(500, 3))
y = (X[:, 0] + 0.5 * X[:, 1] > 0).astype(int)
model = LogisticRegression().fit(X, y)

# Save the trained artefact, then load it as the deployed service would.
artefact = Path(tempfile.mkdtemp()) / "model.joblib"
joblib.dump(model, artefact)
loaded = joblib.load(artefact)

# Deployment depends on this: the loaded model must predict identically.
new_data = rng.normal(size=(20, 3))
identical = np.array_equal(model.predict_proba(new_data),
                           loaded.predict_proba(new_data))
print(f"artefact size: {artefact.stat().st_size} bytes")
print(f"loaded model predicts identically: {identical}")

artefact size: 879 bytes
loaded model predicts identically: True

The round-trip is faithful — the deployed model is the model you trained, not an approximation of it — which is exactly the serialisation guarantee the testing chapter said to assert. This artefact is generated, not authored, so (per Chapter 6) it lives in a registry rather than in Git; the deployment’s job is to fetch the right version and load it.

Data Science Bridge

Deployment is the moment your work crosses from “runs for me” to “others depend on it”, and a staging environment is a holdout set for that crossing. You would never judge a model on the data it trained on; you hold out data to see how it behaves on cases it didn’t see. Staging is the same instinct applied to the whole system: a production-like environment where the deployment runs against realistic conditions before any real user meets it, so that the surprises happen somewhere safe.

Where it breaks down: a holdout is a fixed sample you score once and read off a number. Staging is an environment you run continuously, and “passing” it is a judgement about operational behaviour — latency, error rate, resource use under load — not a single accuracy figure. The shared idea is “try it somewhere safe before it counts”; what you measure, and how you decide it passed, is different.

15.4 Promotion and rollback

Mature deployment moves one artefact through a sequence of environments — development, then staging, then production — promoting the same image with environment-specific configuration and secrets (Chapter 11), never rebuilding it along the way. Building once and promoting the identical artefact is what makes staging meaningful: if you rebuilt for production, you’d be testing something other than what you’ll run.

The single most important safety practice is the one that most reduces the fear: a rollback. Keep the previous known-good version available so that, if a release misbehaves, you can switch back to it in one step rather than scrambling to fix forward under pressure. Gradual rollout strategies build on this — blue-green keeps two environments and flips traffic between them, and canary routes a small fraction of traffic to the new version, watches it, and ramps up only if it behaves. A continuous-deployment pipeline often automates the release on a tagged commit:

# .github/workflows/deploy.yml (sketch) — build and push the image on a version tag
on:
  push:
    tags: ["v*"]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t registry.example.com/customer-value:${{ github.ref_name }} .
      - run: docker push registry.example.com/customer-value:${{ github.ref_name }}
      # a deploy step then points the platform at the new tag

15.5 Scheduling batch work

For a batch model, “deployment” largely means scheduling. The simplest form is cron, which runs a command on a timetable:

# Score yesterday's customers every morning at 06:00
0 6 * * *  /app/.venv/bin/python -m customer_value.score_batch

For anything with multiple dependent steps, retries, or its own monitoring, you reach for the orchestrators from Chapter 10 (Airflow, Prefect, Dagster) rather than a bare cron line. Either way, the scheduled job should exit non-zero on failure (Chapter 4) so the scheduler can detect the failure and alert, rather than silently producing no output.

Author’s Note

“Deploy” carries an outsized dread for data scientists, and it’s worth being honest that the dread is reasonable: the territory is unfamiliar, it feels like someone else’s domain, and the failure modes are scarier than a wrong number in a notebook because they happen in public and at bad hours. None of that is irrational.

But deployment is not the cliff it looks like from the notebook. It’s the same incremental discipline you’ve been assembling: a versioned artefact (Chapter 14), run from configuration (Chapter 11), gated by CI (Chapter 13) — put somewhere it runs on its own. And the single practice that does most to dissolve the fear is the rollback. Deployment stops being terrifying the moment you know you can undo it in one step, because the worst case is no longer “I’ve broken production and must fix it live while it bleeds” but “I switch back to the version that worked and debug at my leisure”. You don’t have to become a site-reliability engineer. You need a repeatable way to put a known-good artefact somewhere, and a known-fast way to take it back.

15.6 Summary

Deployment puts your work where it runs without you, reliably and reversibly:

Choose the shape: batch or online. A batch model is a scheduled job that writes results; an online model is a long-running service answering requests. The choice sets the machinery.
Deploy a packaged artefact, not a retrained model. Serialise the trained model, version it, store it, and load it at runtime — and trust that the round-trip is faithful.
Promote one artefact and keep a way back. Move the same image through dev, staging, and production with environment-specific config; keep a rollback, and roll out gradually with blue-green or canary.
Batch deployment is scheduling. cron for the simple case, an orchestrator for dependencies and retries — and fail loudly so the schedule can alert.

A deployed model that runs is not the same as a model that still works. The next chapter is about knowing the difference: monitoring and observability.

15.7 Exercises

Take the containerised service from Chapter 14 and deploy it somewhere it runs without you — a platform-as-a-service, a container host, or a long-running container on a server — and call it from a different machine. What did you have to provide the platform that you’d previously taken for granted on your laptop?
Set up a staging environment that runs the same artefact as production with different configuration, and decide, in advance, what “passing staging” means (a latency ceiling, an error-rate ceiling, a smoke test that must pass) before you promote.
Schedule a batch scoring job (with cron or an orchestrator) that reads input, writes predictions, and exits non-zero on failure. Confirm that a deliberately failed run is detectable rather than silently producing nothing.
Conceptual: The Data Science Bridge compares a staging environment to a holdout set. Give one way the analogy holds and one way it breaks down. What does “passing” staging mean that “passing” a holdout does not?
Conceptual: Describe one model best delivered as a scheduled batch job and one best delivered as an always-on service, and sketch what a rollback would look like for each.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Deployment" --- ## The model that has to keep running {#sec-keep-running} By now the model is in good shape. It's wrapped in an API (Chapter 12), packaged into a container (Chapter 14), and verified on every change by CI (Chapter 13). But all of that still runs when *you* run it. Deployment is the step where it runs without you: serving the website's requests at three in the morning, or scoring yesterday's data every day before anyone arrives, on a machine you're not logged into. It is the act of putting your work somewhere it runs reliably, automatically, and where other people and systems depend on it. This is the step that carries the most anxiety, and the least of it is technical. The word "deploy" tends to mark the boundary where a comfortable, self-contained piece of work becomes an operational responsibility — something that can be *down*, or *slow*, or wrong at an hour when you're asleep. Most of that anxiety dissolves once you see deployment as the continuation of the practices you've already built, plus one new idea: a way to undo it. ## Two shapes: batch and online {#sec-batch-online} The first decision shapes everything else: is the model *batch* or *online*? A batch deployment runs on a schedule over a dataset and writes its results somewhere — score every customer's churn risk each night and write the table the dashboard reads in the morning. An online deployment is a long-running service answering requests in real time — the API that the checkout page calls for a fraud score while the customer waits. This is the same distinction the Chapter 12 exercise drew between a scheduled job and a real-time API, now made an operational reality: a batch model is a *scheduled job*, an online model is a *running service*. The two have different failure modes and different machinery. A batch job that fails can often just be re-run; an online service that fails takes the feature down with it. We'll come back to scheduling batch work; the artefact at the centre of both, though, is the same. ## Packaging the model artefact {#sec-model-artefact} Whichever shape you choose, you don't retrain the model on the production machine — you train it once, save the result as a portable artefact, and load that artefact wherever it runs. The artefact is the trained model serialised to a file (with `joblib` or `pickle`), versioned, and stored somewhere durable (a model registry or object store), then loaded at startup. The one property you must be able to trust is that the loaded model predicts identically to the one you saved. ```{python} #| label: artefact-round-trip #| echo: true import tempfile from pathlib import Path import joblib import numpy as np from sklearn.linear_model import LogisticRegression rng = np.random.default_rng(42) X = rng.normal(size=(500, 3)) y = (X[:, 0] + 0.5 * X[:, 1] > 0).astype(int) model = LogisticRegression().fit(X, y) # Save the trained artefact, then load it as the deployed service would. artefact = Path(tempfile.mkdtemp()) / "model.joblib" joblib.dump(model, artefact) loaded = joblib.load(artefact) # Deployment depends on this: the loaded model must predict identically. new_data = rng.normal(size=(20, 3)) identical = np.array_equal(model.predict_proba(new_data), loaded.predict_proba(new_data)) print(f"artefact size: {artefact.stat().st_size} bytes") print(f"loaded model predicts identically: {identical}") ``` The round-trip is faithful — the deployed model is the model you trained, not an approximation of it — which is exactly the serialisation guarantee the testing chapter said to assert. This artefact is generated, not authored, so (per Chapter 6) it lives in a registry rather than in Git; the deployment's job is to fetch the right version and load it. ::: {.callout-note} ## Data Science Bridge Deployment is the moment your work crosses from "runs for me" to "others depend on it", and a *staging* environment is a holdout set for that crossing. You would never judge a model on the data it trained on; you hold out data to see how it behaves on cases it didn't see. Staging is the same instinct applied to the whole system: a production-like environment where the deployment runs against realistic conditions before any real user meets it, so that the surprises happen somewhere safe. Where it breaks down: a holdout is a fixed sample you score once and read off a number. Staging is an environment you run continuously, and "passing" it is a judgement about *operational* behaviour — latency, error rate, resource use under load — not a single accuracy figure. The shared idea is "try it somewhere safe before it counts"; what you measure, and how you decide it passed, is different. ::: ## Promotion and rollback {#sec-promotion-rollback} Mature deployment moves one artefact through a sequence of environments — development, then staging, then production — promoting the *same* image with environment-specific configuration and secrets (Chapter 11), never rebuilding it along the way. Building once and promoting the identical artefact is what makes staging meaningful: if you rebuilt for production, you'd be testing something other than what you'll run. The single most important safety practice is the one that most reduces the fear: a **rollback**. Keep the previous known-good version available so that, if a release misbehaves, you can switch back to it in one step rather than scrambling to fix forward under pressure. Gradual rollout strategies build on this — *blue-green* keeps two environments and flips traffic between them, and *canary* routes a small fraction of traffic to the new version, watches it, and ramps up only if it behaves. A continuous-deployment pipeline often automates the release on a tagged commit: ```yaml # .github/workflows/deploy.yml (sketch) — build and push the image on a version tag on: push: tags: ["v*"] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: docker build -t registry.example.com/customer-value:${{ github.ref_name }} . - run: docker push registry.example.com/customer-value:${{ github.ref_name }} # a deploy step then points the platform at the new tag ``` ## Scheduling batch work {#sec-scheduling} For a batch model, "deployment" largely means *scheduling*. The simplest form is `cron`, which runs a command on a timetable: ```cron # Score yesterday's customers every morning at 06:00 0 6 * * * /app/.venv/bin/python -m customer_value.score_batch ``` For anything with multiple dependent steps, retries, or its own monitoring, you reach for the orchestrators from Chapter 10 (Airflow, Prefect, Dagster) rather than a bare `cron` line. Either way, the scheduled job should exit non-zero on failure (Chapter 4) so the scheduler can detect the failure and alert, rather than silently producing no output. ::: {.callout-tip} ## Author's Note "Deploy" carries an outsized dread for data scientists, and it's worth being honest that the dread is reasonable: the territory is unfamiliar, it feels like someone else's domain, and the failure modes are scarier than a wrong number in a notebook because they happen in public and at bad hours. None of that is irrational. But deployment is not the cliff it looks like from the notebook. It's the same incremental discipline you've been assembling: a versioned artefact (Chapter 14), run from configuration (Chapter 11), gated by CI (Chapter 13) — put somewhere it runs on its own. And the single practice that does most to dissolve the fear is the rollback. Deployment stops being terrifying the moment you know you can undo it in one step, because the worst case is no longer "I've broken production and must fix it live while it bleeds" but "I switch back to the version that worked and debug at my leisure". You don't have to become a site-reliability engineer. You need a repeatable way to put a known-good artefact somewhere, and a known-fast way to take it back. ::: ## Summary {#sec-deployment-summary} Deployment puts your work where it runs without you, reliably and reversibly: 1. **Choose the shape: batch or online.** A batch model is a scheduled job that writes results; an online model is a long-running service answering requests. The choice sets the machinery. 2. **Deploy a packaged artefact, not a retrained model.** Serialise the trained model, version it, store it, and load it at runtime — and trust that the round-trip is faithful. 3. **Promote one artefact and keep a way back.** Move the same image through dev, staging, and production with environment-specific config; keep a rollback, and roll out gradually with blue-green or canary. 4. **Batch deployment is scheduling.** `cron` for the simple case, an orchestrator for dependencies and retries — and fail loudly so the schedule can alert. A deployed model that runs is not the same as a model that still *works*. The next chapter is about knowing the difference: *monitoring and observability*. ## Exercises {#sec-deployment-exercises} 1. Take the containerised service from Chapter 14 and deploy it somewhere it runs without you — a platform-as-a-service, a container host, or a long-running container on a server — and call it from a different machine. What did you have to provide the platform that you'd previously taken for granted on your laptop? 2. Set up a staging environment that runs the *same* artefact as production with different configuration, and decide, in advance, what "passing staging" means (a latency ceiling, an error-rate ceiling, a smoke test that must pass) before you promote. 3. Schedule a batch scoring job (with `cron` or an orchestrator) that reads input, writes predictions, and exits non-zero on failure. Confirm that a deliberately failed run is detectable rather than silently producing nothing. 4. **Conceptual:** The Data Science Bridge compares a staging environment to a holdout set. Give one way the analogy holds and one way it breaks down. What does "passing" staging mean that "passing" a holdout does not? 5. **Conceptual:** Describe one model best delivered as a scheduled batch job and one best delivered as an always-on service, and sketch what a rollback would look like for each.