11  Configuration and secrets

11.1 The password in the notebook

Two lines from a real notebook, both committed to a shared repository. The first: THRESHOLD = 0.73, a value tuned by hand one afternoon, sitting in the middle of a cell with no explanation. The second, a few cells down: conn = connect("postgresql://admin:hunter2@prod-db:5432/customers"), a production database password in plain text. Each is a different kind of mistake, and both are endemic to data science code.

The first is a configuration problem: a value that ought to be adjustable is baked into the logic, so changing it means editing (and re-reviewing) code, and there’s no single place that records what the run’s settings actually were. The second is a secrets problem, and a more serious one: a credential that should never have entered the repository is now in its history permanently — recall from Chapter 2 that deleting the file doesn’t remove it from past commits. This chapter is about getting both kinds of value out of your code: configuration into files you can change without touching logic, and secrets out of the repository entirely.

11.2 Configuration is not code

The principle is to separate the what from the how — the values a run depends on (file paths, thresholds, hyperparameters, table names, model versions) from the logic that uses them. The values live in a configuration file, loaded at runtime; the code reads them by name. A small config.yaml might hold:

data:
  raw_path: data/raw/customers.csv
  min_active_days: 1
model:
  test_size: 0.2
  random_state: 42
  n_estimators: 300

This buys three things. The same code runs in development and in production with different config rather than different code. A threshold can change without anyone editing a line of logic, so the change is low-risk and doesn’t need a code review of the algorithm. And the config file is a readable, version-controllable record of the settings a given run used — far better than reconstructing them from scattered literals.

11.3 Typed, validated configuration

Loading config into a bare dictionary trades one problem for another: a typo in a key (raw_paht) gives you a silent None at 2am rather than a clear error. The fix is to load configuration into a typed object that validates itself at startup, so a missing key, a wrong type, or an out-of-range value fails immediately, with a message that says what’s wrong. pydantic is the standard tool.

from pydantic import BaseModel, Field, ValidationError

class ModelConfig(BaseModel):
    test_size: float = Field(gt=0, lt=1)     # must be a proportion
    random_state: int
    n_estimators: int = Field(gt=0)

# A valid config (as if parsed from the YAML above) loads cleanly.
config = ModelConfig(test_size=0.2, random_state=42, n_estimators=300)
print(f"loaded: test_size={config.test_size}, n_estimators={config.n_estimators}")

# An invalid one fails at startup with a precise message — not a silent None.
try:
    ModelConfig(test_size=1.5, random_state=42, n_estimators=300)
except ValidationError as exc:
    print(f"rejected: {exc.errors()[0]['loc'][0]}{exc.errors()[0]['msg']}")
loaded: test_size=0.2, n_estimators=300
rejected: test_size — Input should be less than 1

The valid config loads into an object whose fields are typed and autocomplete in your editor; the invalid one — a test_size of 1.5, which is not a proportion — is rejected the moment it’s loaded, naming the offending field. Catching a bad configuration at startup, rather than three stages into a pipeline, is the same boundary-validation idea as the pipeline gates of the previous chapter, applied to the run’s settings.

NoteData Science Bridge

You already externalise configuration, informally. The dictionary of hyperparameters you pass to a model, or the block of PARAMS at the top of a notebook that everything below reads from, is exactly this instinct: pull the knobs out of the logic so they live in one place you can adjust. Moving them into a validated config file is that same move made robust — one place for the settings, checked for sanity before the run starts.

Where it breaks down: a hyperparameter dictionary is consumed once, in one process, to fit one model. Application configuration often selects behaviour across environments — which database to connect to, which storage bucket to write to, whether to run in debug mode — a concern a hyperparameter sweep never has. So config grows a dimension the hyperparameter dict doesn’t: the same code, the same config schema, but different values in dev, staging, and production.

11.4 Secrets never live in the repository

A secret — a database password, an API key, an access token — is configuration that must never be committed. The rule is absolute because the cost is asymmetric: the convenience of pasting a key inline is small, and the consequence of it entering the repository’s permanent history is large. Secrets are loaded from the environment instead (the environment variables from Chapter 4), kept locally in a .env file that is git-ignored, and injected by a secret manager in production.

import os

# In production the platform sets this; locally it comes from an
# untracked .env file. Either way, the value is never in the code.
os.environ.setdefault("DATABASE_URL", "postgresql://localhost/dev")  # demo default

database_url = os.environ["DATABASE_URL"]
print(f"connecting to: {database_url}")
print("The code names the variable; the value lives outside the repository.")
connecting to: postgresql://localhost/dev
The code names the variable; the value lives outside the repository.

The pattern that keeps this honest is to commit a .env.example listing the names of the variables the application needs, with placeholder values, while the real .env stays untracked:

# .env.example  (committed — a template, no real values)
DATABASE_URL=postgresql://user:password@host:5432/dbname
MODEL_API_KEY=your-key-here

A new colleague copies .env.example to .env, fills in the real values, and is running — without any secret ever touching the repository. (And if a secret was ever committed, removing it from the code is not enough: it must also be rotated, because it lives in the history.)

11.5 Configuration for experiments

There’s a bonus for data science specifically. Once hyperparameters live in config files rather than inline, each experiment’s settings become a version-controlled artefact — a record of exactly what you ran, which is the experiment-log discipline from Chapter 2 made concrete. A config file per experiment, committed alongside the code, lets you reproduce or compare runs precisely, and tools like Hydra extend this to composing and sweeping over configurations. Configuration stops being plumbing and becomes part of how your results are made reproducible.

TipAuthor’s Note

Hard-coding is genuinely faster in the moment, and pretending otherwise would be dishonest. Typing 0.73 directly into the cell beats stopping to set up a config file when you’re iterating, and during exploration that trade is often correct — the value is changing every few minutes and isn’t worth externalising yet. The cost is invisible right up until the work has to run somewhere else, or again: the value you tuned by hand is now buried in logic with no record of why it’s 0.73, and anyone running the code elsewhere has to find and change it in place.

The reframe is that externalising configuration is the point at which your code stops being tied to one machine and one run. Pull the values out when the code graduates from scratch to kept, the same threshold as every other practice in this part. Secrets are the one exception to the “later is fine” rule: pull those out immediately, because a hard-coded threshold is a small cleanup deferred, but a committed credential is an expensive mistake that the version history makes permanent.

11.6 Summary

Getting values out of code makes it portable, reproducible, and safe to share:

  1. Configuration is not code. Lift the values a run depends on — paths, thresholds, hyperparameters — into a file loaded at runtime, so the same code runs anywhere and changing a setting doesn’t mean editing logic.

  2. Load config into a typed, validated object. A pydantic model catches a typo, a wrong type, or an out-of-range value at startup, instead of failing silently deep in a run.

  3. Secrets never enter the repository. Load them from the environment, keep a git-ignored .env locally and a committed .env.example template, and use a secret manager in production — a committed secret is permanent and must be rotated.

  4. Configured experiments are reproducible experiments. Hyperparameters in version-controlled config files become a precise record of what each run used.

This closes Part 3. With code that is structured, pipelined, and configured, Part 4 takes it to production, beginning with the safety net that makes frequent change safe: continuous integration.

11.7 Exercises

  1. Take a script or notebook of your own with hard-coded values — paths, thresholds, table names — and lift them into a config file loaded at runtime. Which of those values turned out to differ between your machine and where the code actually needs to run?

  2. Find a secret in your code or, worse, in your repository’s history (a password, key, or token) and move it to an environment variable loaded at runtime; add .env to .gitignore and commit a .env.example template instead. If the secret was ever committed, note why moving it is not sufficient on its own.

  3. Load your configuration into a typed pydantic object rather than a bare dictionary, and add at least one constraint (a numeric range, or an allowed set of values). Feed it a deliberately invalid configuration and confirm it fails at startup with a message that names the problem.

  4. Conceptual: The Data Science Bridge compares externalised configuration to a hyperparameter dictionary. Give one way the analogy holds and one way it breaks down. What does application configuration have to handle that a hyperparameter dictionary never does?

  5. Conceptual: Hard-coding is not always wrong. Distinguish a value that should be configuration from one that’s perfectly fine as a named constant in the code, and state the signal that tells you a hard-coded value has become a liability.