11 Configuration and secrets

11.1 The password in the notebook

Two lines from a real notebook, both committed to a shared repository. The first: THRESHOLD = 0.73, a value tuned by hand one afternoon, sitting in the middle of a cell with no explanation. The second, a few cells down: conn = connect("postgresql://admin:hunter2@prod-db:5432/customers"), a production database password in plain text. Each is a different kind of mistake, and both are endemic to data science code.

The first is a configuration problem: a value that ought to be adjustable is baked into the logic, so changing it means editing (and re-reviewing) code, and there’s no single place that records what the run’s settings actually were. The second is a secrets problem, and a more serious one: a credential that should never have entered the repository is now in its history permanently — recall from Chapter 2 that deleting the file doesn’t remove it from past commits. This chapter is about getting both kinds of value out of your code: configuration into files you can change without touching logic, and secrets out of the repository entirely.

11.2 Configuration is not code

The principle is to separate the what from the how — the values a run depends on (file paths, thresholds, hyperparameters, table names, model versions) from the logic that uses them. The values live in a configuration file, loaded at runtime; the code reads them by name. A small config.yaml might hold:

data:
  raw_path: data/raw/customers.csv
  min_active_days: 1
model:
  test_size: 0.2
  random_state: 42
  n_estimators: 300

This buys three things. The same code runs in development and in production with different config rather than different code. A threshold can change without anyone editing a line of logic, so the change is low-risk and doesn’t need a code review of the algorithm. And the config file is a readable, version-controllable record of the settings a given run used — far better than reconstructing them from scattered literals.

11.3 Typed, validated configuration

Loading config into a bare dictionary trades one problem for another: a typo in a key (raw_paht) gives you a silent None at 2am rather than a clear error. The fix is to load configuration into a typed object that validates itself at startup, so a missing key, a wrong type, or an out-of-range value fails immediately, with a message that says what’s wrong. pydantic is the standard tool.

from pydantic import BaseModel, Field, ValidationError

class ModelConfig(BaseModel):
    test_size: float = Field(gt=0, lt=1)     # must be a proportion
    random_state: int
    n_estimators: int = Field(gt=0)

# A valid config (as if parsed from the YAML above) loads cleanly.
config = ModelConfig(test_size=0.2, random_state=42, n_estimators=300)
print(f"loaded: test_size={config.test_size}, n_estimators={config.n_estimators}")

# An invalid one fails at startup with a precise message — not a silent None.
try:
    ModelConfig(test_size=1.5, random_state=42, n_estimators=300)
except ValidationError as exc:
    print(f"rejected: {exc.errors()[0]['loc'][0]} — {exc.errors()[0]['msg']}")

loaded: test_size=0.2, n_estimators=300
rejected: test_size — Input should be less than 1

The valid config loads into an object whose fields are typed and autocomplete in your editor; the invalid one — a test_size of 1.5, which is not a proportion — is rejected the moment it’s loaded, naming the offending field. Catching a bad configuration at startup, rather than three stages into a pipeline, is the same boundary-validation idea as the pipeline gates of the previous chapter, applied to the run’s settings.

Data Science Bridge

You already externalise configuration, informally. The dictionary of hyperparameters you pass to a model, or the block of PARAMS at the top of a notebook that everything below reads from, is exactly this instinct: pull the knobs out of the logic so they live in one place you can adjust. Moving them into a validated config file is that same move made robust — one place for the settings, checked for sanity before the run starts.

Where it breaks down: a hyperparameter dictionary is consumed once, in one process, to fit one model. Application configuration often selects behaviour across environments — which database to connect to, which storage bucket to write to, whether to run in debug mode — a concern a hyperparameter sweep never has. So config grows a dimension the hyperparameter dict doesn’t: the same code, the same config schema, but different values in dev, staging, and production.

11.4 Secrets never live in the repository

A secret — a database password, an API key, an access token — is configuration that must never be committed. The rule is absolute because the cost is asymmetric: the convenience of pasting a key inline is small, and the consequence of it entering the repository’s permanent history is large. Secrets are loaded from the environment instead (the environment variables from Chapter 4), kept locally in a .env file that is git-ignored, and injected by a secret manager in production — a dedicated service (AWS Secrets Manager, Google Secret Manager, HashiCorp Vault) that stores credentials encrypted, controls who and what may read each one, records every access, and hands the value to your process at startup so it never rests on disk in plain text.

import os

from dotenv import load_dotenv

# Reads .env into the process environment if the file exists, and does
# nothing if it doesn't. Deployed environments have no .env — the platform
# sets the variables directly — so the same call is correct in both places.
load_dotenv()

os.environ.setdefault("DATABASE_URL", "postgresql://localhost/dev")  # demo default

database_url = os.environ["DATABASE_URL"]
print(f"connecting to: {database_url}")
print("The code names the variable; the value lives outside the repository.")

connecting to: postgresql://localhost/dev
The code names the variable; the value lives outside the repository.

The pattern that keeps this honest is to commit a .env.example listing the names of the variables the application needs, with placeholder values, while the real .env stays untracked:

# .env.example  (committed — a template, no real values)
DATABASE_URL=postgresql://user:password@host:5432/dbname
MODEL_API_KEY=your-key-here

The load_dotenv() call is what connects the two halves, and it’s the step most often left out: a .env file is not magic, it’s just a file, and nothing reads it unless something asks. Call it once, as early as your program starts — at the top of the entry point, not inside the module that happens to need a key first — so that every later os.environ lookup sees the same populated environment regardless of import order. Without it, the flow below ends in a KeyError on a variable the reader can plainly see they set.

A new colleague copies .env.example to .env, fills in the real values, and is running — without any secret ever touching the repository. (And if a secret was ever committed, removing it from the code is not enough: it must also be rotated, because it lives in the history.)

11.5 Configuration for experiments

There’s a bonus for data science specifically. Once hyperparameters live in config files rather than inline, each experiment’s settings become a version-controlled artefact — a record of exactly what you ran, which is the experiment-log discipline from Chapter 2 made concrete. A config file per experiment, committed alongside the code, lets you reproduce or compare runs precisely, and tools like Hydra extend this to composing and sweeping over configurations. Configuration stops being plumbing and becomes part of how your results are made reproducible.

Author’s Note

Hard-coding is genuinely faster in the moment, and pretending otherwise would be dishonest. Typing 0.73 directly into the cell beats stopping to set up a config file when you’re iterating, and during exploration that trade is often correct — the value is changing every few minutes and isn’t worth externalising yet. The cost is invisible right up until the work has to run somewhere else, or again: the value you tuned by hand is now buried in logic with no record of why it’s 0.73, and anyone running the code elsewhere has to find and change it in place.

The reframe is that externalising configuration is the point at which your code stops being tied to one machine and one run. Pull the values out when the code graduates from scratch to kept, the same threshold as every other practice in this part. Secrets are the one exception to the “later is fine” rule: pull those out immediately, because a hard-coded threshold is a small cleanup deferred, but a committed credential is an expensive mistake that the version history makes permanent.

11.6 Summary

Getting values out of code makes it portable, reproducible, and safe to share:

Configuration is not code. Lift the values a run depends on — paths, thresholds, hyperparameters — into a file loaded at runtime, so the same code runs anywhere and changing a setting doesn’t mean editing logic.
Load config into a typed, validated object. A pydantic model catches a typo, a wrong type, or an out-of-range value at startup, instead of failing silently deep in a run.
Secrets never enter the repository. Load them from the environment, keep a git-ignored .env locally and a committed .env.example template, and use a secret manager in production — a committed secret is permanent and must be rotated.
Configured experiments are reproducible experiments. Hyperparameters in version-controlled config files become a precise record of what each run used.

With code that is structured, pipelined, and configured, one piece of Part 3 remains: the interface through which everything else reaches your model. The next chapter closes the part with API design.

11.7 Exercises

Take a script or notebook of your own with hard-coded values — paths, thresholds, table names — and lift them into a config file loaded at runtime. Which of those values turned out to differ between your machine and where the code actually needs to run?
Find a secret in your code or, worse, in your repository’s history (a password, key, or token) and move it to an environment variable loaded at runtime; add .env to .gitignore and commit a .env.example template instead. If the secret was ever committed, note why moving it is not sufficient on its own.
Load your configuration into a typed pydantic object rather than a bare dictionary, and add at least one constraint (a numeric range, or an allowed set of values). Feed it a deliberately invalid configuration and confirm it fails at startup with a message that names the problem.
Conceptual: A colleague reasoning from the hyperparameter-dictionary analogy keeps one config file per run and hand-copies it when the pipeline moves to production. Predict what goes wrong the first time production needs the same model settings but a different database — and say which parts of a configuration should be identical across environments and which must differ.
Conceptual: Hard-coding is not always wrong. Distinguish a value that should be configuration from one that’s perfectly fine as a named constant in the code, and state the signal that tells you a hard-coded value has become a liability.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Configuration and secrets {#sec-config-secrets} ## The password in the notebook {#sec-password-in-notebook} Two lines from a real notebook, both committed to a shared repository. The first: `THRESHOLD = 0.73`, a value tuned by hand one afternoon, sitting in the middle of a cell with no explanation. The second, a few cells down: `conn = connect("postgresql://admin:hunter2@prod-db:5432/customers")`, a production database password in plain text. Each is a different kind of mistake, and both are endemic to data science code. The first is a *configuration* problem: a value that ought to be adjustable is baked into the logic, so changing it means editing (and re-reviewing) code, and there's no single place that records what the run's settings actually were. The second is a *secrets* problem, and a more serious one: a credential that should never have entered the repository is now in its history permanently — recall from @sec-version-control that deleting the file doesn't remove it from past commits. This chapter is about getting both kinds of value out of your code: configuration into files you can change without touching logic, and secrets out of the repository entirely. ## Configuration is not code {#sec-config-not-code} The principle is to separate the *what* from the *how* — the values a run depends on (file paths, thresholds, hyperparameters, table names, model versions) from the logic that uses them. The values live in a configuration file, loaded at runtime; the code reads them by name. A small `config.yaml` might hold: ```yaml data: raw_path: data/raw/customers.csv min_active_days: 1 model: test_size: 0.2 random_state: 42 n_estimators: 300 ``` This buys three things. The same code runs in development and in production with different config rather than different code. A threshold can change without anyone editing a line of logic, so the change is low-risk and doesn't need a code review of the algorithm. And the config file is a readable, version-controllable record of the settings a given run used — far better than reconstructing them from scattered literals. ## Typed, validated configuration {#sec-typed-config} Loading config into a bare dictionary trades one problem for another: a typo in a key (`raw_paht`) gives you a silent `None` at 2am rather than a clear error. The fix is to load configuration into a *typed* object that validates itself at startup, so a missing key, a wrong type, or an out-of-range value fails immediately, with a message that says what's wrong. `pydantic` is the standard tool. ```{python} #| label: typed-config #| echo: true from pydantic import BaseModel, Field, ValidationError class ModelConfig(BaseModel): test_size: float = Field(gt=0, lt=1) # must be a proportion random_state: int n_estimators: int = Field(gt=0) # A valid config (as if parsed from the YAML above) loads cleanly. config = ModelConfig(test_size=0.2, random_state=42, n_estimators=300) print(f"loaded: test_size={config.test_size}, n_estimators={config.n_estimators}") # An invalid one fails at startup with a precise message — not a silent None. try: ModelConfig(test_size=1.5, random_state=42, n_estimators=300) except ValidationError as exc: print(f"rejected: {exc.errors()[0]['loc'][0]} — {exc.errors()[0]['msg']}") ``` The valid config loads into an object whose fields are typed and autocomplete in your editor; the invalid one — a `test_size` of 1.5, which is not a proportion — is rejected the moment it's loaded, naming the offending field. Catching a bad configuration at startup, rather than three stages into a pipeline, is the same boundary-validation idea as the pipeline gates of the previous chapter, applied to the run's settings. ::: {.callout-note} ## Data Science Bridge You already externalise configuration, informally. The dictionary of hyperparameters you pass to a model, or the block of `PARAMS` at the top of a notebook that everything below reads from, is exactly this instinct: pull the knobs out of the logic so they live in one place you can adjust. Moving them into a validated config file is that same move made robust — one place for the settings, checked for sanity before the run starts. Where it breaks down: a hyperparameter dictionary is consumed once, in one process, to fit one model. Application configuration often selects *behaviour across environments* — which database to connect to, which storage bucket to write to, whether to run in debug mode — a concern a hyperparameter sweep never has. So config grows a dimension the hyperparameter dict doesn't: the same code, the same config *schema*, but different *values* in dev, staging, and production. ::: ## Secrets never live in the repository {#sec-secrets} A secret — a database password, an API key, an access token — is configuration that must never be committed. The rule is absolute because the cost is asymmetric: the convenience of pasting a key inline is small, and the consequence of it entering the repository's permanent history is large. Secrets are loaded from the *environment* instead (the environment variables from @sec-command-line), kept locally in a `.env` file that is git-ignored, and injected by a *secret manager* in production — a dedicated service (AWS Secrets Manager, Google Secret Manager, HashiCorp Vault) that stores credentials encrypted, controls who and what may read each one, records every access, and hands the value to your process at startup so it never rests on disk in plain text. ```{python} #| label: secret-from-env #| echo: true import os from dotenv import load_dotenv # Reads .env into the process environment if the file exists, and does # nothing if it doesn't. Deployed environments have no .env — the platform # sets the variables directly — so the same call is correct in both places. load_dotenv() os.environ.setdefault("DATABASE_URL", "postgresql://localhost/dev") # demo default database_url = os.environ["DATABASE_URL"] print(f"connecting to: {database_url}") print("The code names the variable; the value lives outside the repository.") ``` The pattern that keeps this honest is to commit a `.env.example` listing the *names* of the variables the application needs, with placeholder values, while the real `.env` stays untracked: ```bash # .env.example (committed — a template, no real values) DATABASE_URL=postgresql://user:password@host:5432/dbname MODEL_API_KEY=your-key-here ``` The `load_dotenv()` call is what connects the two halves, and it's the step most often left out: a `.env` file is not magic, it's just a file, and nothing reads it unless something asks. Call it once, as early as your program starts — at the top of the entry point, not inside the module that happens to need a key first — so that every later `os.environ` lookup sees the same populated environment regardless of import order. Without it, the flow below ends in a `KeyError` on a variable the reader can plainly see they set. A new colleague copies `.env.example` to `.env`, fills in the real values, and is running — without any secret ever touching the repository. (And if a secret *was* ever committed, removing it from the code is not enough: it must also be rotated, because it lives in the history.) ## Configuration for experiments {#sec-config-experiments} There's a bonus for data science specifically. Once hyperparameters live in config files rather than inline, each experiment's settings become a version-controlled artefact — a record of exactly what you ran, which is the experiment-log discipline from @sec-version-control made concrete. A config file per experiment, committed alongside the code, lets you reproduce or compare runs precisely, and tools like Hydra extend this to composing and sweeping over configurations. Configuration stops being plumbing and becomes part of how your results are made reproducible. ::: {.callout-tip} ## Author's Note Hard-coding is genuinely faster in the moment, and pretending otherwise would be dishonest. Typing `0.73` directly into the cell beats stopping to set up a config file when you're iterating, and during exploration that trade is often correct — the value is changing every few minutes and isn't worth externalising yet. The cost is invisible right up until the work has to run somewhere else, or again: the value you tuned by hand is now buried in logic with no record of why it's `0.73`, and anyone running the code elsewhere has to find and change it in place. The reframe is that externalising configuration is the point at which your code stops being tied to one machine and one run. Pull the values out when the code graduates from scratch to kept, the same threshold as every other practice in this part. Secrets are the one exception to the "later is fine" rule: pull those out *immediately*, because a hard-coded threshold is a small cleanup deferred, but a committed credential is an expensive mistake that the version history makes permanent. ::: ## Summary {#sec-config-secrets-summary} Getting values out of code makes it portable, reproducible, and safe to share: 1. **Configuration is not code.** Lift the values a run depends on — paths, thresholds, hyperparameters — into a file loaded at runtime, so the same code runs anywhere and changing a setting doesn't mean editing logic. 2. **Load config into a typed, validated object.** A `pydantic` model catches a typo, a wrong type, or an out-of-range value at startup, instead of failing silently deep in a run. 3. **Secrets never enter the repository.** Load them from the environment, keep a git-ignored `.env` locally and a committed `.env.example` template, and use a secret manager in production — a committed secret is permanent and must be rotated. 4. **Configured experiments are reproducible experiments.** Hyperparameters in version-controlled config files become a precise record of what each run used. With code that is structured, pipelined, and configured, one piece of Part 3 remains: the interface through which everything else reaches your model. The next chapter closes the part with *API design*. ## Exercises {#sec-config-secrets-exercises} 1. Take a script or notebook of your own with hard-coded values — paths, thresholds, table names — and lift them into a config file loaded at runtime. Which of those values turned out to differ between your machine and where the code actually needs to run? 2. Find a secret in your code or, worse, in your repository's history (a password, key, or token) and move it to an environment variable loaded at runtime; add `.env` to `.gitignore` and commit a `.env.example` template instead. If the secret was ever committed, note why moving it is not sufficient on its own. 3. Load your configuration into a typed `pydantic` object rather than a bare dictionary, and add at least one constraint (a numeric range, or an allowed set of values). Feed it a deliberately invalid configuration and confirm it fails at startup with a message that names the problem. 4. **Conceptual:** A colleague reasoning from the hyperparameter-dictionary analogy keeps one config file per run and hand-copies it when the pipeline moves to production. Predict what goes wrong the first time production needs the same model settings but a different database — and say which parts of a configuration should be identical across environments and which must differ. 5. **Conceptual:** Hard-coding is not always wrong. Distinguish a value that should be configuration from one that's perfectly fine as a named constant in the code, and state the signal that tells you a hard-coded value has become a liability.