22 Reproducible research pipeline

22.1 The result no one can reproduce

Six months ago you produced a figure that went into a board deck — a chart showing that high-value customers churn at half the rate of the rest, with a headline number attached. Now someone asks the obvious follow-up: rerun it on this quarter’s data. You open the project and your heart sinks. The notebook has been edited since; there are three CSVs and you’re not certain which one fed the chart; the package versions have moved on; and the figure was exported by hand and pasted into the slides. You can produce a number. You cannot reliably reproduce that number, or update it with confidence that nothing else changed underneath.

This is the failure mode of the previous chapter’s opposite: there, the deliverable was a running service; here, the deliverable is a result — a figure, a table, a number in a report — and a result you can’t regenerate is a result you can’t defend, update, or trust. This chapter builds a pipeline where a single command rebuilds every figure and every number from the raw data, the same way every time, so that “rerun it on new data” is one command and “how did you get that?” has an answer.

22.2 What reproducibility actually requires

A result is the output of feeding inputs through code. To reproduce it exactly, you have to pin every input, because any one of them drifting will move the result. There are four, and you’ve met all but one already: the code (version control, Chapter 2), the environment (a lockfile, Chapter 3), the randomness (fixed seeds, Chapters 1 and 7), and the data — the specific version of it that produced this result. Pin three and leave the fourth floating and the result can still change out from under you. Reproducibility is the discipline of nailing down all four together.

22.3 Versioning the data

The new piece is the data. Code belongs in Git and data does not (Chapters 2 and 9) — but a result depends on a specific version of the data, so it must be versioned too, just separately. Data-versioning tools, of which DVC is the most common, solve this neatly: the large file lives in external storage, while a tiny text pointer to it lives in Git alongside your code.

dvc add data/raw/customers.csv     # stores the data externally, writes customers.csv.dvc
git add data/raw/customers.csv.dvc # the small pointer is what Git tracks
git commit -m "Add Q3 customer snapshot"

Because the pointer is committed with the code, checking out a past commit checks out the exact data that went with it. git checkout of the commit behind that board figure, followed by dvc checkout, restores precisely the dataset, code, and environment that produced it — so the four inputs travel together as one versioned whole.

Data Science Bridge

Versioning the data is the experiment-reproducibility you already value, taken to completion. You fix a random seed so a stochastic step repeats; you freeze a holdout set so this week’s model is comparable to last week’s. Versioning the data applies the same instinct to the input itself — pin the exact dataset behind a result so that “rerun the analysis” means rerun it on the same data, not on whatever the table happens to hold today. A reproducible pipeline is, in this sense, a controlled experiment: hold every input fixed, and the only thing that can change the result is a change you made on purpose. (The companion volume, Thinking in Uncertainty, examines what makes such a comparison statistically valid in the first place.)

Where it breaks down: a seed is a single integer you can drop into the code, but data is large and lives outside it, so it needs its own storage and its own pointer rather than a line in a script. DVC exists precisely to give the dataset a seed-like handle — a small, committable reference to a large, external thing.

22.4 The pipeline as a graph of stages

The analysis itself is a directed graph of stages — raw data to cleaned, cleaned to features, features to the analysis, analysis to figures and tables — exactly the pipeline shape from Chapter 10, now in service of a result rather than a model. Each stage is a function with declared inputs and outputs, and the whole thing is deterministic when every stage that touches randomness is seeded. The crucial property, the one the entire chapter rests on, is that running it twice gives the same answer:

import numpy as np

def run_analysis(seed=None):
    """The pipeline in miniature: generate, bootstrap, summarise — one result."""
    rng = np.random.default_rng(seed)
    spend = rng.exponential(50, 1_000)                       # stage 1: the data
    boot_means = [rng.choice(spend, spend.size, replace=True).mean()
                  for _ in range(200)]                       # stage 2: bootstrap
    return round(float(np.mean(boot_means)), 4)              # stage 3: the headline number

# Without a fixed seed, the "same" analysis produces a different number each run:
print(f"unseeded:  {run_analysis()}  then  {run_analysis()}")

# With a fixed seed, the result is identical every time — reproducible:
print(f"seeded:    {run_analysis(42)}  then  {run_analysis(42)}")
assert run_analysis(42) == run_analysis(42)
print("the seeded pipeline is reproducible to the last digit")

unseeded:  48.7318  then  50.5423
seeded:    50.6909  then  50.6909
the seeded pipeline is reproducible to the last digit

The unseeded runs disagree; the seeded runs are identical to the last digit. That determinism is what makes the rest possible — without it, “rebuild the result” would produce a different result, and reproducibility would be meaningless. A task runner (make, or a research-oriented one like Snakemake) wires the stages into a graph, rebuilding only what changed and running them in dependency order, just as in Chapter 10.

22.5 One command, from raw data to result

The payoff is that the entire result regenerates from one command. A Makefile declares the chain from raw data to rendered report, and make walks it:

data/processed/clean.csv: data/raw/customers.csv src/clean.py
    python -m research.clean

results/figures/churn_by_value.png: data/processed/clean.csv src/analyse.py
    python -m research.analyse

report.html: results/figures/churn_by_value.png report.qmd
    quarto render report.qmd

The report itself is generated, not assembled by hand: written in a tool like Quarto (the one this book uses), it runs the analysis and embeds the figures and numbers directly, so a number in the prose is computed from the data rather than pasted. That single discipline — never paste a result a human could fail to update — is what keeps the report honest, because a changed input flows through to every figure and sentence automatically, and a stale number cannot survive a rebuild.

22.6 Reproducibility as a test

The strongest possible guarantee is to make reproducibility something a machine checks. Continuous integration (Chapter 13) can rebuild the entire result from scratch, in a clean environment, on every change — fetching the versioned data, installing the locked dependencies, running the pipeline end to end, and failing if it can’t, or if the result changed when nothing should have. This turns “is it reproducible?” from a hope into a gate that runs automatically, and it catches the most insidious bug in research code: the silent dependency on something that exists only on your machine — a file in your home directory, a package you pip install-ed once and forgot, an environment variable you set months ago. If the clean rebuild succeeds, the result is genuinely reproducible by someone who is not you, on a machine that is not yours.

Author’s Note

Reproducibility is easy to dismiss as an academic virtue — something you owe a journal or a regulator, not yourself — which is exactly why it’s the first thing dropped when a deadline looms. The reframe that makes it urgent is to notice who actually needs to reproduce your result most often: you, three months from now, when a stakeholder questions the number or asks for the updated version, and “I can’t quite remember which CSV made that figure” is a genuinely frightening sentence to be forming in front of them. A reproducible pipeline is not ceremony for reviewers; it’s the difference between a result you can stand behind and one you can only hope was right.

The cost is real but front-loaded. Building the pipeline — versioning the data, seeding the randomness, wiring the stages, generating the report — is more work than running cells by hand once. But it is paid once, and every rerun afterwards is a single command instead of a day of archaeology. The exploratory notebook earns its place at the discovery stage; the reproducible pipeline is what you build the moment the result is something anyone, including future-you, will have to defend.

22.7 Summary

When the deliverable is a result, reproducibility is the product:

Pin all four inputs. Code, environment, randomness, and data each determine the result; leave any one floating and the result can drift.
Version the data, separately from the code. A tool like DVC keeps the large data external and a small pointer in Git, so checking out a commit restores the exact dataset behind a result.
Build the result as a pipeline, deterministically. Stages with declared inputs and outputs, seeded wherever randomness enters, regenerated from raw data by one command — with the report generated, never hand-pasted.
Make reproducibility a test. CI that rebuilds the whole result from scratch in a clean environment turns reproducibility from a hope into an automatically enforced guarantee.

The final chapter combines the live-service discipline of Chapter 21 with the pipeline discipline of this one into the most demanding end-to-end project: the MLOps pipeline.

22.8 Exercises

Take an analysis of your own and make it reproducible from one command: wire its steps into a Makefile or Snakemake file that goes from raw data to the final figures, and confirm that a single command rebuilds everything. Which step turned out to have a hidden dependency on something only your machine had?
Version a dataset with DVC (or, at minimum, record an immutable, dated copy of the raw data and a checksum). Then reconstruct the exact data that produced an old result, and confirm the result regenerates. What would have made this impossible before?
Find a number or figure in one of your reports that was pasted in by hand, and replace it with one generated from the data at render time (in Quarto, a notebook export, or similar). Why does the hand-pasted version eventually become wrong, and the generated one not?
Conceptual: The Data Science Bridge compares versioning the data to fixing a random seed. Give one way the analogy holds and one way it breaks down. Why does data need a different mechanism (DVC) than the single integer a seed requires?
Conceptual: A colleague says their analysis is reproducible “because it’s all in one notebook”. Identify two of the four inputs (code, environment, randomness, data) that a single notebook does not pin by itself, and describe what could change each one out from under them.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Reproducible research pipeline" --- ## The result no one can reproduce {#sec-result-no-one-can-reproduce} Six months ago you produced a figure that went into a board deck — a chart showing that high-value customers churn at half the rate of the rest, with a headline number attached. Now someone asks the obvious follow-up: rerun it on this quarter's data. You open the project and your heart sinks. The notebook has been edited since; there are three CSVs and you're not certain which one fed the chart; the package versions have moved on; and the figure was exported by hand and pasted into the slides. You can produce *a* number. You cannot reliably reproduce *that* number, or update it with confidence that nothing else changed underneath. This is the failure mode of the previous chapter's opposite: there, the deliverable was a running service; here, the deliverable is a *result* — a figure, a table, a number in a report — and a result you can't regenerate is a result you can't defend, update, or trust. This chapter builds a pipeline where a single command rebuilds every figure and every number from the raw data, the same way every time, so that "rerun it on new data" is one command and "how did you get that?" has an answer. ## What reproducibility actually requires {#sec-what-repro-requires} A result is the output of feeding inputs through code. To reproduce it exactly, you have to pin *every* input, because any one of them drifting will move the result. There are four, and you've met all but one already: the **code** (version control, Chapter 2), the **environment** (a lockfile, Chapter 3), the **randomness** (fixed seeds, Chapters 1 and 7), and the **data** — the specific version of it that produced this result. Pin three and leave the fourth floating and the result can still change out from under you. Reproducibility is the discipline of nailing down all four together. ## Versioning the data {#sec-versioning-data} The new piece is the data. Code belongs in Git and data does not (Chapters 2 and 9) — but a result depends on a *specific* version of the data, so it must be versioned too, just separately. Data-versioning tools, of which DVC is the most common, solve this neatly: the large file lives in external storage, while a tiny text pointer to it lives in Git alongside your code. ```bash dvc add data/raw/customers.csv # stores the data externally, writes customers.csv.dvc git add data/raw/customers.csv.dvc # the small pointer is what Git tracks git commit -m "Add Q3 customer snapshot" ``` Because the pointer is committed with the code, checking out a past commit checks out the exact data that went with it. `git checkout` of the commit behind that board figure, followed by `dvc checkout`, restores precisely the dataset, code, and environment that produced it — so the four inputs travel together as one versioned whole. ::: {.callout-note} ## Data Science Bridge Versioning the data is the experiment-reproducibility you already value, taken to completion. You fix a random seed so a stochastic step repeats; you freeze a holdout set so this week's model is comparable to last week's. Versioning the data applies the same instinct to the input itself — pin the exact dataset behind a result so that "rerun the analysis" means rerun it on the same data, not on whatever the table happens to hold today. A reproducible pipeline is, in this sense, a controlled experiment: hold every input fixed, and the only thing that can change the result is a change you made on purpose. (The companion volume, *Thinking in Uncertainty*, examines what makes such a comparison statistically valid in the first place.) Where it breaks down: a seed is a single integer you can drop into the code, but data is large and lives outside it, so it needs its own storage and its own pointer rather than a line in a script. DVC exists precisely to give the dataset a seed-like handle — a small, committable reference to a large, external thing. ::: ## The pipeline as a graph of stages {#sec-pipeline-graph} The analysis itself is a directed graph of stages — raw data to cleaned, cleaned to features, features to the analysis, analysis to figures and tables — exactly the pipeline shape from Chapter 10, now in service of a result rather than a model. Each stage is a function with declared inputs and outputs, and the whole thing is deterministic when every stage that touches randomness is seeded. The crucial property, the one the entire chapter rests on, is that running it twice gives the same answer: ```{python} #| label: reproducible-pipeline #| echo: true import numpy as np def run_analysis(seed=None): """The pipeline in miniature: generate, bootstrap, summarise — one result.""" rng = np.random.default_rng(seed) spend = rng.exponential(50, 1_000) # stage 1: the data boot_means = [rng.choice(spend, spend.size, replace=True).mean() for _ in range(200)] # stage 2: bootstrap return round(float(np.mean(boot_means)), 4) # stage 3: the headline number # Without a fixed seed, the "same" analysis produces a different number each run: print(f"unseeded: {run_analysis()} then {run_analysis()}") # With a fixed seed, the result is identical every time — reproducible: print(f"seeded: {run_analysis(42)} then {run_analysis(42)}") assert run_analysis(42) == run_analysis(42) print("the seeded pipeline is reproducible to the last digit") ``` The unseeded runs disagree; the seeded runs are identical to the last digit. That determinism is what makes the rest possible — without it, "rebuild the result" would produce a *different* result, and reproducibility would be meaningless. A task runner (`make`, or a research-oriented one like Snakemake) wires the stages into a graph, rebuilding only what changed and running them in dependency order, just as in Chapter 10. ## One command, from raw data to result {#sec-one-command} The payoff is that the entire result regenerates from one command. A `Makefile` declares the chain from raw data to rendered report, and `make` walks it: ```makefile data/processed/clean.csv: data/raw/customers.csv src/clean.py python -m research.clean results/figures/churn_by_value.png: data/processed/clean.csv src/analyse.py python -m research.analyse report.html: results/figures/churn_by_value.png report.qmd quarto render report.qmd ``` The report itself is *generated*, not assembled by hand: written in a tool like Quarto (the one this book uses), it runs the analysis and embeds the figures and numbers directly, so a number in the prose is computed from the data rather than pasted. That single discipline — never paste a result a human could fail to update — is what keeps the report honest, because a changed input flows through to every figure and sentence automatically, and a stale number cannot survive a rebuild. ## Reproducibility as a test {#sec-repro-as-test} The strongest possible guarantee is to make reproducibility something a machine checks. Continuous integration (Chapter 13) can rebuild the entire result from scratch, in a clean environment, on every change — fetching the versioned data, installing the locked dependencies, running the pipeline end to end, and failing if it can't, or if the result changed when nothing should have. This turns "is it reproducible?" from a hope into a gate that runs automatically, and it catches the most insidious bug in research code: the silent dependency on something that exists only on your machine — a file in your home directory, a package you `pip install`-ed once and forgot, an environment variable you set months ago. If the clean rebuild succeeds, the result is genuinely reproducible by someone who is not you, on a machine that is not yours. ::: {.callout-tip} ## Author's Note Reproducibility is easy to dismiss as an academic virtue — something you owe a journal or a regulator, not yourself — which is exactly why it's the first thing dropped when a deadline looms. The reframe that makes it urgent is to notice who actually needs to reproduce your result most often: you, three months from now, when a stakeholder questions the number or asks for the updated version, and "I can't quite remember which CSV made that figure" is a genuinely frightening sentence to be forming in front of them. A reproducible pipeline is not ceremony for reviewers; it's the difference between a result you can stand behind and one you can only hope was right. The cost is real but front-loaded. Building the pipeline — versioning the data, seeding the randomness, wiring the stages, generating the report — is more work than running cells by hand once. But it is paid once, and every rerun afterwards is a single command instead of a day of archaeology. The exploratory notebook earns its place at the discovery stage; the reproducible pipeline is what you build the moment the result is something anyone, including future-you, will have to defend. ::: ## Summary {#sec-reproducible-pipeline-summary} When the deliverable is a result, reproducibility is the product: 1. **Pin all four inputs.** Code, environment, randomness, and data each determine the result; leave any one floating and the result can drift. 2. **Version the data, separately from the code.** A tool like DVC keeps the large data external and a small pointer in Git, so checking out a commit restores the exact dataset behind a result. 3. **Build the result as a pipeline, deterministically.** Stages with declared inputs and outputs, seeded wherever randomness enters, regenerated from raw data by one command — with the report generated, never hand-pasted. 4. **Make reproducibility a test.** CI that rebuilds the whole result from scratch in a clean environment turns reproducibility from a hope into an automatically enforced guarantee. The final chapter combines the live-service discipline of Chapter 21 with the pipeline discipline of this one into the most demanding end-to-end project: *the MLOps pipeline*. ## Exercises {#sec-reproducible-pipeline-exercises} 1. Take an analysis of your own and make it reproducible from one command: wire its steps into a `Makefile` or Snakemake file that goes from raw data to the final figures, and confirm that a single command rebuilds everything. Which step turned out to have a hidden dependency on something only your machine had? 2. Version a dataset with DVC (or, at minimum, record an immutable, dated copy of the raw data and a checksum). Then reconstruct the exact data that produced an old result, and confirm the result regenerates. What would have made this impossible before? 3. Find a number or figure in one of your reports that was pasted in by hand, and replace it with one generated from the data at render time (in Quarto, a notebook export, or similar). Why does the hand-pasted version eventually become wrong, and the generated one not? 4. **Conceptual:** The Data Science Bridge compares versioning the data to fixing a random seed. Give one way the analogy holds and one way it breaks down. Why does data need a different mechanism (DVC) than the single integer a seed requires? 5. **Conceptual:** A colleague says their analysis is reproducible "because it's all in one notebook". Identify two of the four inputs (code, environment, randomness, data) that a single notebook does *not* pin by itself, and describe what could change each one out from under them.