6 Functions, modules, packages

6.1 The function that lives in three notebooks

You wrote a good cleaning function once. It parses the awkward date column, fixes the known data-entry quirk, and returns a tidy frame. It worked, so when the next analysis needed the same cleaning, you copied it across. And the one after that. Now it lives in three notebooks, and last week you found a bug in it — a timezone it mishandles — and fixed it in the notebook you happened to have open. The other two still have the bug. You won’t find out until their results quietly disagree.

This is duplication, and it’s one of the most reliable sources of error in data science work, precisely because copy-paste is so frictionless. The engineering answer is a single source of truth: write the logic once, in one place, and import it everywhere it’s needed, so that a fix applied once is a fix applied everywhere. Getting there is a progression — from functions, to modules, to packages — and each step buys a little more reusability for a little more structure.

6.2 Functions, properly

We’ve leaned on functions since Chapter 1, but it’s worth stating what makes a function good for reuse, because not every function qualifies. The key property is being pure: its output depends only on its inputs, it has no hidden dependencies on global state, and it doesn’t quietly change things elsewhere. A pure function is a contract — give it the same arguments and it returns the same result, anywhere, regardless of what ran before it.

The opposite is the function that reaches out to a global variable, which is the state bug from Chapter 1 wearing a different hat:

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
customers = pd.DataFrame({"spend": rng.exponential(50, 200)})

# Fragile: depends on a global that something else might change.
THRESHOLD = 200
def count_high_value_fragile():
    return int((customers["spend"] > THRESHOLD).sum())

# Pure: everything it needs comes in as an argument, so it gives the
# same answer anywhere and can be tested in isolation.
def count_high_value(customers: pd.DataFrame, threshold: float) -> int:
    return int((customers["spend"] > threshold).sum())

before = count_high_value_fragile()
THRESHOLD = 100                       # something elsewhere changes the global...
after = count_high_value_fragile()    # ...and the "same" call now answers differently

print(f"Fragile function, threshold silently changed: {before} then {after}")
print(f"Pure function, explicit threshold:            {count_high_value(customers, 200)}")

Fragile function, threshold silently changed: 2 then 27
Pure function, explicit threshold:            2

The fragile function returns two different answers from identical calls, because its real input — the global THRESHOLD — is invisible at the call site. The pure version can’t do that: everything it depends on is named in its signature. This is what makes a function safe to reuse and, as we’ll see in the next chapter, possible to test at all.

6.3 From cells to modules

Once a function is pure and worth keeping, it should leave the notebook. A module is just a .py file containing Python definitions, and importing one is how you reuse its contents. Moving your kept logic into modules realises the “thin notebook, thick module” pattern from Chapter 1: the notebook becomes a thin orchestration layer that imports functions and strings them together, while the logic those functions contain lives in version-controlled, reviewable, testable .py files.

The mechanics are familiar, because you do it with other people’s code constantly. A file features.py containing your feature functions is imported exactly as any library is:

# In the notebook — the logic lives elsewhere, the notebook just uses it
from features import add_spend_per_day, filter_high_value

high_value = filter_high_value(customers, threshold=200)
enriched = add_spend_per_day(high_value)

Related functions group naturally into modules by responsibility — data.py for loading and cleaning, features.py for feature engineering, models.py for training and evaluation. The grouping is itself documentation: someone new can guess where the cleaning logic lives without reading every file.

Data Science Bridge

A module of feature functions is a reusable feature library, and importing it is exactly the move you make every day with from sklearn.preprocessing import StandardScaler. The only difference is that now you are the library author: your from features import add_spend_per_day is the same mechanism, pointed at code you wrote. Once you see your own modules this way, the value is obvious — scikit-learn doesn’t ask you to copy its source into every notebook, and neither should your own feature code.

Where the analogy breaks down: scikit-learn is a public library with a stable API, versioning, and deprecation warnings, because thousands of people depend on it not changing under them. Your internal module has exactly one user — you — until it doesn’t, so you can refactor it freely. The library-author disciplines (stable interfaces, careful deprecation) only start to matter the moment a colleague imports your code, which is worth remembering both ways: don’t over-engineer a module only you use, but do tighten up the moment someone else depends on it.

6.4 From modules to packages

A handful of modules in one folder works until you try to import them from somewhere else — another notebook, a test file, a script on a server — and hit ModuleNotFoundError, or paper over it with sys.path hacks that work on your machine and nowhere else. A package solves this properly. A package is a directory of modules made installable, so that your code can be imported by name from anywhere in the environment, just like any third-party library.

The modern layout puts your package under a src/ directory and describes it with a pyproject.toml:

customer-value/
├── pyproject.toml          # describes the package and its dependencies
├── src/
│   └── customer_value/
│       ├── __init__.py     # marks this directory as a package
│       ├── data.py
│       └── features.py
├── notebooks/
│   └── exploration.ipynb
└── tests/
    └── test_features.py

A minimal pyproject.toml needs only a name, a version, and your dependencies:

[project]
name = "customer-value"
version = "0.1.0"
dependencies = ["pandas>=2.2", "numpy>=1.26"]

[build-system]
requires = ["setuptools>=64"]
build-backend = "setuptools.build_meta"

You then install it into your environment in editable mode, which links the package rather than copying it, so edits to the source take effect immediately:

pip install -e .

From that point, from customer_value.features import add_spend_per_day works in every notebook, script, and test in the environment, with no path manipulation. Your analysis code has become a proper library — the same status as the tools you import without a second thought. (This is also the structure the rest of the book assumes; we return to the full project layout in Project structure.)

Author’s Note

Notebooks quietly discourage this entire progression, and it’s worth being honest about why, because the friction is real rather than imagined. In a notebook there is no natural home for a function — only cells — and the moment you decide to move logic into a .py file, you have to leave the notebook for an editor, breaking the tight write-run-inspect loop that makes exploration feel fast. So the function stays in the cell, then gets copied to the next notebook, and the duplication compounds.

The reframe is to see the package as the place where logic goes to be trusted. Code in a cell is provisional — it works here, now, for you. Code imported from a package is something you and others can rely on: a bug fixed once is fixed everywhere it’s used, the function can be tested (the subject of the next chapter), and its interface is explicit enough that a colleague can use it without reading the implementation. The editor friction is a one-time cost paid when the code graduates; the payoff — no more hunting down which of three copies has the fix — compounds for as long as the code lives.

6.5 Summary

Reuse is a progression from copy-paste to a single source of truth:

Duplication is a bug factory. Copy-pasted logic drifts out of sync the first time you fix one copy and not the others. The fix is to write it once and import it.
Pure functions are the reusable unit. A function whose output depends only on its inputs gives the same answer anywhere — and, unlike one that leans on global state, can be tested in isolation.
Kept logic belongs in modules. Move it out of cells into .py files grouped by responsibility; the notebook becomes a thin layer that imports and orchestrates.
Packages make your code importable anywhere. A src/ layout with a pyproject.toml, installed with pip install -e ., turns your analysis code into a library you can import without path hacks.

With code that is readable and reusable, the next chapter tackles the practice that pure, importable functions finally make possible — and that data scientists resist most: testing stochastic code.

6.6 Exercises

Find a function you’ve copy-pasted across notebooks or projects. Extract it into a single .py module and import it everywhere that used a copy. Now introduce a small change (a new edge case it should handle) — how many places did you have to change it, compared with before?
Take a piece of notebook code that relies on a global variable and convert it into a pure function that receives its inputs as arguments and returns a result. Confirm it gives the same answer regardless of what cells ran before it.
Take a project with a few modules and make it installable: write a minimal pyproject.toml, run pip install -e ., and then import your package from a fresh notebook or script with no sys.path manipulation. What error were you previously working around?
Conceptual: The Data Science Bridge says importing your own module is like from sklearn import ... — you’re now the library author. Name one discipline library authors follow that you do not need for code only you use, and one you should adopt the moment a colleague depends on it.
Conceptual: Not everything belongs in a package. Describe a piece of code that should stay in the notebook and one that has earned a place in a module, and state the signal that tells you logic has crossed from the first category to the second.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Functions, modules, packages" --- ## The function that lives in three notebooks {#sec-copy-paste} You wrote a good cleaning function once. It parses the awkward date column, fixes the known data-entry quirk, and returns a tidy frame. It worked, so when the next analysis needed the same cleaning, you copied it across. And the one after that. Now it lives in three notebooks, and last week you found a bug in it — a timezone it mishandles — and fixed it in the notebook you happened to have open. The other two still have the bug. You won't find out until their results quietly disagree. This is duplication, and it's one of the most reliable sources of error in data science work, precisely because copy-paste is so frictionless. The engineering answer is a single source of truth: write the logic once, in one place, and *import* it everywhere it's needed, so that a fix applied once is a fix applied everywhere. Getting there is a progression — from functions, to modules, to packages — and each step buys a little more reusability for a little more structure. ## Functions, properly {#sec-functions-properly} We've leaned on functions since Chapter 1, but it's worth stating what makes a function *good* for reuse, because not every function qualifies. The key property is being **pure**: its output depends only on its inputs, it has no hidden dependencies on global state, and it doesn't quietly change things elsewhere. A pure function is a contract — give it the same arguments and it returns the same result, anywhere, regardless of what ran before it. The opposite is the function that reaches out to a global variable, which is the state bug from Chapter 1 wearing a different hat: ```{python} #| label: pure-vs-global #| echo: true import numpy as np import pandas as pd rng = np.random.default_rng(42) customers = pd.DataFrame({"spend": rng.exponential(50, 200)}) # Fragile: depends on a global that something else might change. THRESHOLD = 200 def count_high_value_fragile(): return int((customers["spend"] > THRESHOLD).sum()) # Pure: everything it needs comes in as an argument, so it gives the # same answer anywhere and can be tested in isolation. def count_high_value(customers: pd.DataFrame, threshold: float) -> int: return int((customers["spend"] > threshold).sum()) before = count_high_value_fragile() THRESHOLD = 100 # something elsewhere changes the global... after = count_high_value_fragile() # ...and the "same" call now answers differently print(f"Fragile function, threshold silently changed: {before} then {after}") print(f"Pure function, explicit threshold: {count_high_value(customers, 200)}") ``` The fragile function returns two different answers from identical calls, because its real input — the global `THRESHOLD` — is invisible at the call site. The pure version can't do that: everything it depends on is named in its signature. This is what makes a function safe to reuse and, as we'll see in the next chapter, possible to test at all. ## From cells to modules {#sec-cells-to-modules} Once a function is pure and worth keeping, it should leave the notebook. A **module** is just a `.py` file containing Python definitions, and importing one is how you reuse its contents. Moving your kept logic into modules realises the "thin notebook, thick module" pattern from Chapter 1: the notebook becomes a thin orchestration layer that imports functions and strings them together, while the logic those functions contain lives in version-controlled, reviewable, testable `.py` files. The mechanics are familiar, because you do it with other people's code constantly. A file `features.py` containing your feature functions is imported exactly as any library is: ```python # In the notebook — the logic lives elsewhere, the notebook just uses it from features import add_spend_per_day, filter_high_value high_value = filter_high_value(customers, threshold=200) enriched = add_spend_per_day(high_value) ``` Related functions group naturally into modules by responsibility — `data.py` for loading and cleaning, `features.py` for feature engineering, `models.py` for training and evaluation. The grouping is itself documentation: someone new can guess where the cleaning logic lives without reading every file. ::: {.callout-note} ## Data Science Bridge A module of feature functions is a reusable feature library, and importing it is exactly the move you make every day with `from sklearn.preprocessing import StandardScaler`. The only difference is that now *you* are the library author: your `from features import add_spend_per_day` is the same mechanism, pointed at code you wrote. Once you see your own modules this way, the value is obvious — scikit-learn doesn't ask you to copy its source into every notebook, and neither should your own feature code. Where the analogy breaks down: scikit-learn is a *public* library with a stable API, versioning, and deprecation warnings, because thousands of people depend on it not changing under them. Your internal module has exactly one user — you — until it doesn't, so you can refactor it freely. The library-author disciplines (stable interfaces, careful deprecation) only start to matter the moment a colleague imports your code, which is worth remembering both ways: don't over-engineer a module only you use, but do tighten up the moment someone else depends on it. ::: ## From modules to packages {#sec-modules-to-packages} A handful of modules in one folder works until you try to import them from somewhere else — another notebook, a test file, a script on a server — and hit `ModuleNotFoundError`, or paper over it with `sys.path` hacks that work on your machine and nowhere else. A **package** solves this properly. A package is a directory of modules made installable, so that your code can be imported by name from anywhere in the environment, just like any third-party library. The modern layout puts your package under a `src/` directory and describes it with a `pyproject.toml`: ```text customer-value/ ├── pyproject.toml # describes the package and its dependencies ├── src/ │ └── customer_value/ │ ├── __init__.py # marks this directory as a package │ ├── data.py │ └── features.py ├── notebooks/ │ └── exploration.ipynb └── tests/ └── test_features.py ``` A minimal `pyproject.toml` needs only a name, a version, and your dependencies: ```toml [project] name = "customer-value" version = "0.1.0" dependencies = ["pandas>=2.2", "numpy>=1.26"] [build-system] requires = ["setuptools>=64"] build-backend = "setuptools.build_meta" ``` You then install it into your environment in *editable* mode, which links the package rather than copying it, so edits to the source take effect immediately: ```bash pip install -e . ``` From that point, `from customer_value.features import add_spend_per_day` works in every notebook, script, and test in the environment, with no path manipulation. Your analysis code has become a proper library — the same status as the tools you import without a second thought. (This is also the structure the rest of the book assumes; we return to the full project layout in *Project structure*.) ::: {.callout-tip} ## Author's Note Notebooks quietly discourage this entire progression, and it's worth being honest about why, because the friction is real rather than imagined. In a notebook there is no natural home for a function — only cells — and the moment you decide to move logic into a `.py` file, you have to leave the notebook for an editor, breaking the tight write-run-inspect loop that makes exploration feel fast. So the function stays in the cell, then gets copied to the next notebook, and the duplication compounds. The reframe is to see the package as the place where logic goes to be *trusted*. Code in a cell is provisional — it works here, now, for you. Code imported from a package is something you and others can rely on: a bug fixed once is fixed everywhere it's used, the function can be tested (the subject of the next chapter), and its interface is explicit enough that a colleague can use it without reading the implementation. The editor friction is a one-time cost paid when the code graduates; the payoff — no more hunting down which of three copies has the fix — compounds for as long as the code lives. ::: ## Summary {#sec-functions-modules-packages-summary} Reuse is a progression from copy-paste to a single source of truth: 1. **Duplication is a bug factory.** Copy-pasted logic drifts out of sync the first time you fix one copy and not the others. The fix is to write it once and import it. 2. **Pure functions are the reusable unit.** A function whose output depends only on its inputs gives the same answer anywhere — and, unlike one that leans on global state, can be tested in isolation. 3. **Kept logic belongs in modules.** Move it out of cells into `.py` files grouped by responsibility; the notebook becomes a thin layer that imports and orchestrates. 4. **Packages make your code importable anywhere.** A `src/` layout with a `pyproject.toml`, installed with `pip install -e .`, turns your analysis code into a library you can import without path hacks. With code that is readable and reusable, the next chapter tackles the practice that pure, importable functions finally make possible — and that data scientists resist most: *testing stochastic code*. ## Exercises {#sec-functions-modules-packages-exercises} 1. Find a function you've copy-pasted across notebooks or projects. Extract it into a single `.py` module and import it everywhere that used a copy. Now introduce a small change (a new edge case it should handle) — how many places did you have to change it, compared with before? 2. Take a piece of notebook code that relies on a global variable and convert it into a pure function that receives its inputs as arguments and returns a result. Confirm it gives the same answer regardless of what cells ran before it. 3. Take a project with a few modules and make it installable: write a minimal `pyproject.toml`, run `pip install -e .`, and then import your package from a *fresh* notebook or script with no `sys.path` manipulation. What error were you previously working around? 4. **Conceptual:** The Data Science Bridge says importing your own module is like `from sklearn import ...` — you're now the library author. Name one discipline library authors follow that you do *not* need for code only you use, and one you *should* adopt the moment a colleague depends on it. 5. **Conceptual:** Not everything belongs in a package. Describe a piece of code that should stay in the notebook and one that has earned a place in a module, and state the signal that tells you logic has crossed from the first category to the second.