1  From notebook to system

1.1 The notebook that works

Every data scientist has one. A notebook that started as a quick exploration — “let me just check this distribution” — and grew. New cells got inserted between old ones. A variable defined in cell 14 gets used in cell 3 (which you re-ran after lunch). There’s a cell near the top that says # DON'T DELETE — this sets up the connection and another near the bottom that’s commented out but you’re afraid to remove because it might still be needed.

The notebook works. You can run it top-to-bottom — if you restart the kernel first, and remember to skip cell 22, and make sure the CSV is in the right folder. It produces the right numbers. You’ve checked them.

Now someone else needs to run it.

# A condensed version of what real notebooks look like.
# In the wild, the data would come from a file:
#   raw = pd.read_csv("data/customers_q3.csv")
# We use synthetic data here so the code runs anywhere.

import pandas as pd
import numpy as np

rng = np.random.default_rng(42)

# === Cell 1 (originally cell 5, moved up later) ===
raw = pd.DataFrame({
    "user_id": np.arange(1, 101),
    "spend": rng.exponential(50, 100),
    "active_days": rng.integers(0, 365, 100),
    "signup_year": rng.choice([2021, 2022, 2023], 100),
})

# === Cell 2 (the "cleaning" cell — re-run this if you changed the threshold) ===
SPEND_THRESHOLD = 200  # was 150, changed during review meeting
high_value = raw[raw["spend"] > SPEND_THRESHOLD].copy()
high_value["spend_per_day"] = high_value["spend"] / high_value["active_days"].replace(0, np.nan)

# === Cell 3 (depends on cell 2 but also on a variable from cell 7) ===
# segment = high_value.groupby("signup_year")["spend_per_day"].mean()
# ^ commented out — using median now, see cell 12

# === ... 20 more cells of this ...
print(f"High-value customers: {len(high_value)} ({len(high_value)/len(raw):.0%})")
High-value customers: 1 (1%)

This code is doing real work. The logic is sound. But it has a property that would alarm any software engineer: it is not a system. It is a trace of an exploration, frozen in time. It encodes decisions (the threshold change, the switch from mean to median) without recording why. It depends on execution order that isn’t enforced. And it cannot be run by anyone who wasn’t in the room when it was written.

1.2 What makes a system

Software engineers don’t write better code because they’re smarter. They write more maintainable code because they’ve been burned — repeatedly — by the consequences of not doing so. Over decades, the profession has converged on a set of practices that exist for one reason: code that works once is less valuable than code that works reliably.

A system, as distinct from a script, has several properties. The first is reproducibility: given the same inputs, it produces the same outputs — not “usually” or “if you run it right”, but reliably across machines and over time. Perfect bit-for-bit determinism is sometimes unachievable (floating-point arithmetic can vary across hardware, and GPU operations may introduce non-determinism), but the goal is to eliminate every source of variation you can control. For a data scientist, this means pinning your dependencies, fixing your random seeds, and versioning your data, not just your code.

The second is modularity. The logic is broken into pieces that can be understood, tested, and changed independently. A 400-line notebook cell that loads data, cleans it, engineers features, trains a model, and produces a plot is doing five things. If the cleaning logic has a bug, you have to re-read the entire cell to find it.

The third is testability: a way to verify that each piece works correctly, automatically, without a human squinting at output. This is the practice data scientists most often push back on — we’ll spend a whole chapter on it — but it’s also the one with the highest return on investment.

The fourth is readability. Someone who didn’t write the code can understand it. This includes you, in six months. Variable names like df2, tmp, and final_final are a gift to no one.

NoteData Science Bridge

If you’ve ever validated a model using a holdout set, you already understand the principle behind testing. A holdout set answers: “does this model work on data it hasn’t seen?” A test suite answers: “does this code work in situations the developer didn’t manually check?” Both catch problems that looked fine during development. But there’s a critical difference in expectations: model validation is probabilistic: you expect some degradation on unseen data and judge whether it’s acceptable. Software testing is deterministic: a test either passes or fails, and any failure is a defect to be fixed, not a trade-off to be weighed. This distinction matters because the “good enough” mindset that serves model validation well can undermine testing if carried across unchecked.

None of this requires abandoning notebooks. Notebooks are excellent for exploration — that’s what they’re designed for. The problem is using them as the final deliverable. The engineering mindset doesn’t replace exploration; it’s what happens after exploration, when you’ve found something worth keeping.

1.3 The cost of “it works on my machine”

The gap between a notebook that works and a system that works has real consequences. They’re familiar to every data scientist, even if the vocabulary is different.

The most immediate is the reproduction problem. Your colleague pulls your notebook and gets different results. Maybe they have a different version of pandas. Maybe they ran the cells in a different order. Maybe the data file changed since yesterday. You spend two hours on a video call saying “but it works for me” before discovering that scikit-learn 1.3 changed a default parameter.

Then there’s the handover problem. You’re moving to a new project, and someone else needs to take over your analysis. You give them the notebook and a five-minute explanation. Two weeks later they message you asking what MAGIC_NUMBER = 0.73 means (it was the threshold you tuned by hand during the July sprint).

Finally, there’s the production gap. The model works beautifully in a notebook. Now it needs to run every night on fresh data, serve predictions via an API, and alert someone when accuracy drops. The notebook can’t do any of that without being rewritten, and the rewrite introduces bugs because the implicit state and execution order aren’t preserved.

# The notebook way: global state, implicit dependencies
# (fragile — depends on prior cell execution)
THRESHOLD = 200
high_value = raw[raw["spend"] > THRESHOLD]

# The engineered way: explicit inputs, explicit outputs
# (robust — works regardless of what ran before)
def filter_high_value(customers: pd.DataFrame, threshold: float) -> pd.DataFrame:
    """Select customers whose spend exceeds the given threshold."""
    # .copy() prevents the caller's DataFrame from being modified —
    # another form of implicit shared state that pandas makes easy.
    return customers[customers["spend"] > threshold].copy()

# Same logic, but now it's testable, reusable, and self-documenting
result = filter_high_value(raw, threshold=200)
assert len(result) == len(high_value), "Refactored function should match original logic"
print(f"Both approaches select {len(result)} customers")
Both approaches select 1 customers

The refactored version does exactly the same thing. It’s a few lines longer. But it’s gained three properties the original lacked: it has a name that describes what it does, it declares its inputs explicitly, and it can be called from anywhere — a test, a pipeline, a different notebook — without depending on global state. The assert here is a rough sanity check, not a real test — we’ll introduce proper testing with pytest in Testing stochastic code — but it already demonstrates the principle: state your expectations explicitly so the computer can verify them.

TipAuthor’s Note

The resistance to this kind of refactoring is real, and it’s not laziness. Data science is fundamentally exploratory. You’re trying things, checking distributions, iterating fast. Wrapping every operation in a function feels like premature optimisation — and sometimes it is. There’s also a practical reality: in many organisations, time spent on engineering practices is invisible work that doesn’t show up in sprint demos or model performance metrics. The key insight is that engineering practices aren’t meant for the exploration phase. They’re for the moment you decide something is worth keeping. The notebook is a lab bench; the system is the published result. You don’t need to keep your lab bench tidy while you’re experimenting, but you do need to clean up before someone else relies on what you found.

1.4 State: the invisible dependency

The most treacherous property of notebooks is implicit state. In a Python script or module, the execution order is the file order — top to bottom. In a notebook, the execution order is whatever sequence you happened to run cells in. The kernel remembers everything, and the cell numbers in the margin are a record of when you ran each cell, not a guarantee of the order they should be run in.

# This code runs top-to-bottom here, but imagine re-executing
# Cell 1 AFTER Cell 2 has already computed its result — the
# kind of thing that happens constantly during exploration.

# Cell 1: set the training sample size
train_size = 1000

# Cell 2: compute the test ratio based on train_size
test_ratio = 200 / train_size  # 0.2 — 20% held out for testing

# Later, you decide to use a smaller dataset and re-run Cell 1:
train_size = 500

# But test_ratio is still 0.2, not the 0.4 you'd expect.
# The kernel remembers the old calculation.
print(f"train_size = {train_size}, test_ratio = {test_ratio}")
print(f"Expected test_ratio = {200 / train_size} if cells ran fresh, but got {test_ratio}")
print(f"This mismatch is the state problem in miniature.")
train_size = 500, test_ratio = 0.2
Expected test_ratio = 0.4 if cells ran fresh, but got 0.2
This mismatch is the state problem in miniature.

The output confirms the mismatch: train_size is 500, but test_ratio is still 0.2 (computed when train_size was 1000) rather than the 0.4 you’d expect. Every data scientist has been caught by this. You re-run a cell further up the notebook, which redefines a variable, but the cells below still hold the old value. The notebook looks correct — the cells are in a logical order — but the kernel state tells a different story.

Software engineers have a name for this class of problem: shared mutable state. In concurrent programming, it manifests as race conditions — two threads reading and writing the same variable. In notebooks, the mechanism is different but the root cause is identical: one piece of code silently changes what another piece depends on, and there’s nothing in the system to prevent it or flag the inconsistency. The engineering solution is always the same: make dependencies explicit and eliminate implicit shared state.

NoteData Science Bridge

If you’ve ever dealt with data leakage in a machine learning pipeline — where information from the test set accidentally influences training — you already understand why implicit state is dangerous. The mechanisms differ: data leakage is about information flowing across a boundary that should be impermeable, while notebook state bugs are about execution order violating the reader’s assumptions. But the root cause is the same — implicit, invisible dependencies that let one part of the system corrupt another without anyone noticing until the results are wrong.

In practice, the most effective defence against state bugs is to prefer functions over global variables — if a piece of logic needs data, pass it as an argument rather than relying on whatever happens to be in scope. As a minimum bar, every notebook should produce correct results when you restart the kernel and run every cell in order; if it doesn’t, it has a state bug. And once you’ve settled on a cleaning step or feature engineering function, move it out of the notebook into a module and import it. The notebook becomes a thin orchestration layer, and the logic it depends on gains the benefits of explicit interfaces: type hints that document what goes in and out, docstrings that explain the intent, and a structure that testing tools can work with.

We’ll cover each of these practices in depth in later chapters. For now, the principle is what matters: visible structure beats invisible state.

1.5 The engineering contract

There is a contract that software engineers honour, usually without thinking about it. It goes something like this:

My code will run on your machine the same way it runs on mine. If something changes, the tests will catch it. If the tests don’t catch it, that’s a bug in the tests, not an excuse.

This is aspirational — no real codebase achieves it perfectly — but the aspiration matters. Data scientists don’t typically make this contract. It’s not because they don’t care about correctness — they do, deeply. It’s because the tools and workflows of data science were built for exploration, not for guarantees. Notebooks are designed for trying things. pip install grabs whatever version is newest today. File paths are hard-coded to wherever the data landed on your laptop.

This book is about learning to make that contract — not by abandoning the exploratory tools that make data science productive, but by adding the engineering scaffolding that makes the results trustworthy. The notebook is where you discover that the median is a better metric than the mean. Version control is how you record that decision. A test is how you ensure it’s still true next month. A reproducible environment is how your colleague can verify it without spending a day debugging import errors. The companion volume, Thinking in Uncertainty, covers the opposite journey, helping software engineers develop the exploratory and statistical thinking that data scientists bring naturally.

The contract isn’t all-or-nothing. Not every analysis needs a continuous integration pipeline. Not every notebook needs to become a Python package. The skill is knowing when the investment pays off, and it almost always pays off sooner than data scientists expect.

1.6 Summary

The gap between a notebook that works and a system that works comes down to four ideas:

  1. Reproducibility requires deliberate effort — pinned dependencies, fixed seeds, versioned data. “It works on my machine” is not a guarantee; it’s a warning.

  2. Implicit state is the root of most notebook bugs — cells that depend on execution order, global variables that get silently redefined. Make dependencies explicit.

  3. Engineering practices complement exploration — they’re not a replacement for notebooks, but what happens after you’ve found something worth keeping.

  4. The investment pays off early — a function with a name is easier to debug than a cell with a comment. A test that runs automatically catches bugs that manual checking misses. The payoff isn’t theoretical; it’s saving you time next week.

In the next chapter, we’ll start with the most fundamental engineering practice: version control.

1.7 Exercises

  1. Take a notebook from one of your own projects and run “Restart Kernel and Run All.” Does it complete without errors? If not, identify the cells that fail. Common culprits: variables used before they’re defined in top-to-bottom order, cells that depend on data loaded or transformed in a non-adjacent cell, or imports that were added mid-notebook after the cell that needs them.

  2. Choose a single data transformation from one of your notebooks (a cleaning step, a feature engineering function, or a filtering operation). Extract it into a standalone function with explicit inputs and outputs. Write a simple assert statement that verifies it produces the expected result on a small test input.

  3. The chapter describes four properties that distinguish a system from a script: reproducibility, modularity, testability, and readability. Pick a notebook you’ve written recently and score it (honestly) on each property from 1 to 5, where 1 means the property is entirely absent (e.g., for testability: no way to verify output without manual inspection) and 5 means it’s fully achieved (e.g., every transformation has an automated check). Which property scores lowest? Why?

  4. Conceptual: The “Data Science Bridge” callout in this chapter compares a test suite to a holdout set. Identify two ways this analogy holds and two ways it breaks down. What does each breakdown tell you about the differences between validating a model and validating code?

  5. Find a notebook written by a colleague (with their permission) and attempt to run it from scratch, without any verbal explanation. Document every assumption you had to discover — file paths, environment setup, execution order, magic numbers — and note which of the chapter’s four system properties would have made the process easier.