12 API design

12.1 “Now serve the model”

The model works. It scores well, the notebook runs, and then comes the sentence that ends the comfortable phase of every data science project: now we need it to serve predictions to the website. Another system — one you didn’t write and may never see — needs an answer from your model, on demand, hundreds of times a minute. You cannot email it a notebook. The model has to become a service: something other software can ask a question and get an answer back.

An API (application programming interface) is how software asks other software for something. A web API does it over HTTP: a client sends a request, your code runs, and a response goes back. Turning a model into an API is the concrete form of the “production gap” from Chapter 1 — the step where a thing that worked on your laptop becomes a thing the rest of the system can rely on. It is more approachable than the silence after “now serve the model” suggests, and the tooling has made the common case genuinely simple.

12.2 An endpoint is a function over HTTP

Strip away the web vocabulary and a web API is a function call across a network. The client names an operation by its HTTP method and path (POST /predict), sends arguments as a JSON body, and receives a return value as a JSON response with a status code (200 for success, 4xx for a bad request, 5xx for a server error). A /predict endpoint is your model.predict() with an HTTP wrapper around it.

FastAPI makes the wrapper almost disappear: you write a normal Python function, decorate it with the method and path, and it handles the HTTP machinery. We’ll see the whole thing run in a moment; the shape is just a decorated function that takes features and returns a prediction.

Data Science Bridge

An endpoint is model.predict() exposed over HTTP. You already call predict with a feature vector and get a result back; an API is the same call, except the caller is on another machine and the conversation happens in JSON. The request schema is the contract for the feature vector going in; the response schema is the contract for the prediction coming out. Seen this way, serving a model isn’t a new skill so much as a wrapper around one you use daily.

Where the analogy breaks down: when you call predict in a notebook, you trust the input completely, because you built the array yourself a cell earlier. An endpoint receives its input from strangers — other teams, other systems, the open internet — who will send nulls, strings where you expect numbers, missing fields, and values far outside any range you trained on. So an endpoint has to do things a notebook predict never does: validate its input, handle errors gracefully, and stay stable as the model behind it changes. Those concerns, not the prediction itself, are most of what API design is about.

12.3 The contract: request and response schemas

The heart of a well-designed API is its contract — an explicit statement of what a valid request looks like and what the response will contain. With FastAPI you declare the contract as pydantic models (the same typed-and-validated objects from the previous chapter), and the framework enforces it automatically: a request that doesn’t match is rejected with a clear 422 before your code runs, and the contract is published as interactive documentation without your writing any.

Because FastAPI runs an app in-process for testing, we can define a small service and exercise it here, with no server to start:

import numpy as np
from fastapi import FastAPI
from fastapi.testclient import TestClient
from pydantic import BaseModel, Field
from sklearn.linear_model import LogisticRegression

# A tiny model, fit once at startup — it stands in for a loaded artefact.
rng = np.random.default_rng(42)
X = rng.normal(size=(500, 2))
y = (X[:, 0] + 0.5 * X[:, 1] > 0).astype(int)
model = LogisticRegression().fit(X, y)

class CustomerFeatures(BaseModel):       # the request contract
    recency: float = Field(ge=0)
    frequency: float = Field(ge=0)

class Prediction(BaseModel):             # the response contract
    churn_probability: float

app = FastAPI()

@app.post("/predict", response_model=Prediction)
def predict(features: CustomerFeatures) -> Prediction:
    proba = model.predict_proba([[features.recency, features.frequency]])[0, 1]
    return Prediction(churn_probability=round(float(proba), 3))

client = TestClient(app)

ok = client.post("/predict", json={"recency": 1.2, "frequency": 0.5})
print(f"valid request   -> {ok.status_code}  {ok.json()}")

bad = client.post("/predict", json={"recency": "yesterday", "frequency": 0.5})
print(f"invalid request -> {bad.status_code}  ({bad.json()['detail'][0]['msg']})")

valid request   -> 200  {'churn_probability': 1.0}
invalid request -> 422  (Input should be a valid number, unable to parse string as a number)

The valid request returns 200 and a typed prediction; the malformed one — a string where a number belongs — is rejected with 422 and a message naming the problem, and our predict function never even runs. We wrote no validation logic and no error handling: declaring the schema was enough. That same schema also generates an interactive documentation page (served at /docs) that callers can read and try out, kept in sync with the code automatically because it is the code.

12.4 When the model fails, not the request

Schema validation handles the caller’s mistakes. It says nothing about yours, and a model service has two failure modes that a well-formed request can still walk straight into.

The first is the model that won’t load. The artefact is missing from the image, the pickle was written by a different scikit-learn version, the file is there but truncated. The instinct is to catch this and carry on, so the service at least starts; the better move is the opposite. Load the model once, at startup, outside any request handler, and let a failure there stop the process. A container that refuses to start is a loud, obvious deployment failure that a platform will notice and roll back (Chapter 15); a container that starts happily and returns an error to every caller is a quiet one that looks healthy from the outside. Loading at startup also keeps the first user from paying the deserialisation cost, which on a large model is the difference between a fast endpoint and a mysteriously slow one.

The second is the model that loads but cannot answer this request. The values are the right types and pass every range check, and predict still raises — an unseen category the encoder was never fitted on, a feature combination that produces a division by zero, a NaN arriving through a field where null was legal. What matters here is the distinction the status code draws: 4xx says the caller sent something they can fix, 5xx says the fault is yours and retrying the same request is reasonable. An unknown category is the caller’s to fix, so it deserves a 400 naming the offending field and value; an unexpected exception in your feature code is yours, and should be a 500 with a generic message and the real traceback in the logs, not in the response. The one option not on the table is returning 200 with a null or a default probability. A caller has no way to tell that apart from a real prediction, and a silent failure becomes someone else’s mysterious data quality problem weeks later.

Latency belongs in the contract too, even though it never appears in the schema. Every caller sets a timeout, and if you don’t state a target they will pick one and be disappointed by it. A synchronous /predict should be fast enough to sit inside a web page’s own budget — tens of milliseconds for a linear model, and the number worth watching is not the average but the tail, since a p99 of two seconds means one request in a hundred is a spinning cursor. When the work genuinely cannot fit — a large ensemble, or features that have to be fetched from a database first — the answer is to change the shape of the interaction rather than to hope: accept the request, return 202 with a job identifier, and let the caller collect the result later.

That same question of shape decides how the endpoint handles volume. A /predict that takes one customer and a /predict that takes a list of them are different contracts, and moving from the first to the second later breaks every existing caller. Decide early, and if there is any prospect of scoring in bulk, make the request a list from the start — a batch of one is a perfectly good single prediction, and the round trip you save on a thousand rows is substantial. A batch endpoint then owes its callers one more decision: if row 400 of 1,000 is unprocessable, returning a 400 for the whole batch throws away 999 good predictions. Return 200 with a result per item, each carrying either a prediction or an error, and let the caller decide what to do about the failures.

12.5 Designing for callers you’ll never meet

A production endpoint needs a few things a toy one doesn’t, all flowing from the fact that its callers are unknown. Input validation rejects garbage at the door with a helpful message rather than letting it reach the model and produce nonsense. Error handling ensures a failure returns a meaningful status and message, not a stack trace that leaks your internals. Versioning — serving the endpoint at /v1/predict — lets you change or retrain the model behind a new version without breaking the callers depending on the old one. And the auto-generated docs serve as the contract those callers read instead of asking you.

Running the service in production is a matter of pointing a server at the app (uvicorn main:app — uvicorn is the program that actually listens on a port and hands incoming HTTP requests to your FastAPI code; FastAPI defines what the endpoints do, uvicorn does the serving), which is where the next part of the book picks up — packaging it into a container and deploying it. The design work, though, is done here: a clear contract, validated inputs, sensible errors, and a version.

Author’s Note

A notebook has exactly one user — you — and that single fact explains why serving a model feels so unexpectedly involved. You know what every input means, you trust the values because you made them, and when something breaks you see the error yourself and fix it on the spot. An API inverts all three. Its callers are strangers who will send the inputs you never thought to guard against; they can’t be trusted, because they don’t know your assumptions; and when something fails, they experience a cryptic error with none of the context you’d have in front of you.

The shift, then, is from trusting your input to defending against it, and from a result you read yourself to a contract other systems build on. That reframing is uncomfortable because it feels like a lot of ceremony around a one-line predict call — but the ceremony is the product. The model is the easy part, already built; the contract around it — the schema that rejects bad input, the version that protects existing callers, the docs that let others integrate without you — is the thing that turns a model into a service other people can actually depend on.

12.6 Summary

An API turns a model into a service other software can rely on:

An endpoint is a function over HTTP. A POST /predict is model.predict() wrapped so that another machine can call it in JSON; FastAPI makes the wrapper a decorated function.
The contract is the design. Declare request and response schemas as pydantic models; FastAPI validates input automatically, returns a clear 422 on bad requests, and generates interactive docs from the schema.
Design for unknown callers. Validate input, handle errors without leaking internals, and version the endpoint so you can change the model without breaking the systems that depend on it.
The contract is the deliverable, not the model. The prediction is the easy part; the schema, versioning, and docs around it are what make the model dependable as a service.

This completes Part 3. Part 4 takes the service the rest of the way to production — running its tests automatically, packaging it, deploying it, and watching it — beginning with continuous integration.

12.7 Exercises

Wrap a model — one of your own, or a trivial one — in a FastAPI /predict endpoint with a pydantic request schema. Run it locally with uvicorn and call it, either with curl or through the interactive /docs page. What did you have to decide about the request format that a notebook predict let you ignore?
Add validation to the endpoint by constraining the request fields (required fields, numeric ranges), then send a malformed request and confirm it returns a clear 422 rather than a 500 or a confidently wrong answer.
Add a response schema and open the auto-generated documentation at /docs. Change a field in the code and reload the page — how does the documentation stay in sync with the implementation, and why does that matter for the people calling your API?
Conceptual: You retrain the churn model and it now needs an extra feature, tenure_months. In a notebook you would pass the extra column and carry on. Choose between adding tenure_months as a required field on the existing /v1/predict and publishing a /v2/predict alongside it — then justify your choice from the position of a team that integrated against your schema last month and hasn’t spoken to you since.
Conceptual: Not every model needs a real-time API. Describe a situation where a scheduled batch job scoring a file is the right delivery mechanism, and one where a real-time API is genuinely necessary. What property of the use case decides between them?

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # API design {#sec-api-design} ## "Now serve the model" {#sec-now-serve} The model works. It scores well, the notebook runs, and then comes the sentence that ends the comfortable phase of every data science project: *now we need it to serve predictions to the website.* Another system — one you didn't write and may never see — needs an answer from your model, on demand, hundreds of times a minute. You cannot email it a notebook. The model has to become a *service*: something other software can ask a question and get an answer back. An API (application programming interface) is how software asks other software for something. A web API does it over HTTP: a client sends a request, your code runs, and a response goes back. Turning a model into an API is the concrete form of the "production gap" from @sec-notebook-to-system — the step where a thing that worked on your laptop becomes a thing the rest of the system can rely on. It is more approachable than the silence after "now serve the model" suggests, and the tooling has made the common case genuinely simple. ## An endpoint is a function over HTTP {#sec-endpoint-function} Strip away the web vocabulary and a web API is a function call across a network. The client names an operation by its HTTP method and path (`POST /predict`), sends arguments as a JSON body, and receives a return value as a JSON response with a status code (200 for success, 4xx for a bad request, 5xx for a server error). A `/predict` endpoint is your `model.predict()` with an HTTP wrapper around it. `FastAPI` makes the wrapper almost disappear: you write a normal Python function, decorate it with the method and path, and it handles the HTTP machinery. We'll see the whole thing run in a moment; the shape is just a decorated function that takes features and returns a prediction. ::: {.callout-note} ## Data Science Bridge An endpoint is `model.predict()` exposed over HTTP. You already call `predict` with a feature vector and get a result back; an API is the same call, except the caller is on another machine and the conversation happens in JSON. The request schema is the contract for the feature vector going in; the response schema is the contract for the prediction coming out. Seen this way, serving a model isn't a new skill so much as a wrapper around one you use daily. Where the analogy breaks down: when you call `predict` in a notebook, you trust the input completely, because you built the array yourself a cell earlier. An endpoint receives its input from strangers — other teams, other systems, the open internet — who will send nulls, strings where you expect numbers, missing fields, and values far outside any range you trained on. So an endpoint has to do things a notebook `predict` never does: validate its input, handle errors gracefully, and stay stable as the model behind it changes. Those concerns, not the prediction itself, are most of what API design is about. ::: ## The contract: request and response schemas {#sec-schemas} The heart of a well-designed API is its *contract* — an explicit statement of what a valid request looks like and what the response will contain. With `FastAPI` you declare the contract as `pydantic` models (the same typed-and-validated objects from the previous chapter), and the framework enforces it automatically: a request that doesn't match is rejected with a clear `422` before your code runs, and the contract is published as interactive documentation without your writing any. Because `FastAPI` runs an app in-process for testing, we can define a small service and exercise it here, with no server to start: ```{python} #| label: fastapi-contract #| echo: true import numpy as np from fastapi import FastAPI from fastapi.testclient import TestClient from pydantic import BaseModel, Field from sklearn.linear_model import LogisticRegression # A tiny model, fit once at startup — it stands in for a loaded artefact. rng = np.random.default_rng(42) X = rng.normal(size=(500, 2)) y = (X[:, 0] + 0.5 * X[:, 1] > 0).astype(int) model = LogisticRegression().fit(X, y) class CustomerFeatures(BaseModel): # the request contract recency: float = Field(ge=0) frequency: float = Field(ge=0) class Prediction(BaseModel): # the response contract churn_probability: float app = FastAPI() @app.post("/predict", response_model=Prediction) def predict(features: CustomerFeatures) -> Prediction: proba = model.predict_proba([[features.recency, features.frequency]])[0, 1] return Prediction(churn_probability=round(float(proba), 3)) client = TestClient(app) ok = client.post("/predict", json={"recency": 1.2, "frequency": 0.5}) print(f"valid request -> {ok.status_code} {ok.json()}") bad = client.post("/predict", json={"recency": "yesterday", "frequency": 0.5}) print(f"invalid request -> {bad.status_code} ({bad.json()['detail'][0]['msg']})") ``` The valid request returns `200` and a typed prediction; the malformed one — a string where a number belongs — is rejected with `422` and a message naming the problem, and our `predict` function never even runs. We wrote no validation logic and no error handling: declaring the schema was enough. That same schema also generates an interactive documentation page (served at `/docs`) that callers can read and try out, kept in sync with the code automatically because it *is* the code. ## When the model fails, not the request {#sec-model-failures} Schema validation handles the caller's mistakes. It says nothing about yours, and a model service has two failure modes that a well-formed request can still walk straight into. The first is the model that won't load. The artefact is missing from the image, the pickle was written by a different `scikit-learn` version, the file is there but truncated. The instinct is to catch this and carry on, so the service at least starts; the better move is the opposite. Load the model once, at startup, outside any request handler, and let a failure there stop the process. A container that refuses to start is a loud, obvious deployment failure that a platform will notice and roll back (@sec-deployment); a container that starts happily and returns an error to every caller is a quiet one that looks healthy from the outside. Loading at startup also keeps the first user from paying the deserialisation cost, which on a large model is the difference between a fast endpoint and a mysteriously slow one. The second is the model that loads but cannot answer *this* request. The values are the right types and pass every range check, and `predict` still raises — an unseen category the encoder was never fitted on, a feature combination that produces a division by zero, a `NaN` arriving through a field where `null` was legal. What matters here is the distinction the status code draws: `4xx` says the caller sent something they can fix, `5xx` says the fault is yours and retrying the same request is reasonable. An unknown category is the caller's to fix, so it deserves a `400` naming the offending field and value; an unexpected exception in your feature code is yours, and should be a `500` with a generic message and the real traceback in the logs, not in the response. The one option not on the table is returning `200` with a null or a default probability. A caller has no way to tell that apart from a real prediction, and a silent failure becomes someone else's mysterious data quality problem weeks later. Latency belongs in the contract too, even though it never appears in the schema. Every caller sets a timeout, and if you don't state a target they will pick one and be disappointed by it. A synchronous `/predict` should be fast enough to sit inside a web page's own budget — tens of milliseconds for a linear model, and the number worth watching is not the average but the tail, since a p99 of two seconds means one request in a hundred is a spinning cursor. When the work genuinely cannot fit — a large ensemble, or features that have to be fetched from a database first — the answer is to change the shape of the interaction rather than to hope: accept the request, return `202` with a job identifier, and let the caller collect the result later. That same question of shape decides how the endpoint handles volume. A `/predict` that takes one customer and a `/predict` that takes a list of them are different contracts, and moving from the first to the second later breaks every existing caller. Decide early, and if there is any prospect of scoring in bulk, make the request a list from the start — a batch of one is a perfectly good single prediction, and the round trip you save on a thousand rows is substantial. A batch endpoint then owes its callers one more decision: if row 400 of 1,000 is unprocessable, returning a `400` for the whole batch throws away 999 good predictions. Return `200` with a result per item, each carrying either a prediction or an error, and let the caller decide what to do about the failures. ## Designing for callers you'll never meet {#sec-designing-for-callers} A production endpoint needs a few things a toy one doesn't, all flowing from the fact that its callers are unknown. Input validation rejects garbage at the door with a helpful message rather than letting it reach the model and produce nonsense. Error handling ensures a failure returns a meaningful status and message, not a stack trace that leaks your internals. Versioning — serving the endpoint at `/v1/predict` — lets you change or retrain the model behind a new version without breaking the callers depending on the old one. And the auto-generated docs serve as the contract those callers read instead of asking you. Running the service in production is a matter of pointing a server at the app (`uvicorn main:app` — `uvicorn` is the program that actually listens on a port and hands incoming HTTP requests to your FastAPI code; FastAPI defines *what* the endpoints do, `uvicorn` does the serving), which is where the next part of the book picks up — packaging it into a container and deploying it. The design work, though, is done here: a clear contract, validated inputs, sensible errors, and a version. ::: {.callout-tip} ## Author's Note A notebook has exactly one user — you — and that single fact explains why serving a model feels so unexpectedly involved. You know what every input means, you trust the values because you made them, and when something breaks you see the error yourself and fix it on the spot. An API inverts all three. Its callers are strangers who will send the inputs you never thought to guard against; they can't be trusted, because they don't know your assumptions; and when something fails, they experience a cryptic error with none of the context you'd have in front of you. The shift, then, is from *trusting* your input to *defending against* it, and from a result you read yourself to a contract other systems build on. That reframing is uncomfortable because it feels like a lot of ceremony around a one-line `predict` call — but the ceremony *is* the product. The model is the easy part, already built; the contract around it — the schema that rejects bad input, the version that protects existing callers, the docs that let others integrate without you — is the thing that turns a model into a service other people can actually depend on. ::: ## Summary {#sec-api-design-summary} An API turns a model into a service other software can rely on: 1. **An endpoint is a function over HTTP.** A `POST /predict` is `model.predict()` wrapped so that another machine can call it in JSON; `FastAPI` makes the wrapper a decorated function. 2. **The contract is the design.** Declare request and response schemas as `pydantic` models; `FastAPI` validates input automatically, returns a clear `422` on bad requests, and generates interactive docs from the schema. 3. **Design for unknown callers.** Validate input, handle errors without leaking internals, and version the endpoint so you can change the model without breaking the systems that depend on it. 4. **The contract is the deliverable, not the model.** The prediction is the easy part; the schema, versioning, and docs around it are what make the model dependable as a service. This completes Part 3. Part 4 takes the service the rest of the way to production — running its tests automatically, packaging it, deploying it, and watching it — beginning with *continuous integration*. ## Exercises {#sec-api-design-exercises} 1. Wrap a model — one of your own, or a trivial one — in a `FastAPI` `/predict` endpoint with a `pydantic` request schema. Run it locally with `uvicorn` and call it, either with `curl` or through the interactive `/docs` page. What did you have to decide about the request format that a notebook `predict` let you ignore? 2. Add validation to the endpoint by constraining the request fields (required fields, numeric ranges), then send a malformed request and confirm it returns a clear `422` rather than a `500` or a confidently wrong answer. 3. Add a response schema and open the auto-generated documentation at `/docs`. Change a field in the code and reload the page — how does the documentation stay in sync with the implementation, and why does that matter for the people calling your API? 4. **Conceptual:** You retrain the churn model and it now needs an extra feature, `tenure_months`. In a notebook you would pass the extra column and carry on. Choose between adding `tenure_months` as a required field on the existing `/v1/predict` and publishing a `/v2/predict` alongside it — then justify your choice from the position of a team that integrated against your schema last month and hasn't spoken to you since. 5. **Conceptual:** Not every model needs a real-time API. Describe a situation where a scheduled batch job scoring a file is the right delivery mechanism, and one where a real-time API is genuinely necessary. What property of the use case decides between them?