9  Project structure

9.1 Forty notebooks in a folder

Picture the project folder a few months into a piece of work. analysis.ipynb, analysis_2024_03_14.ipynb, analysis_final.ipynb, Untitled7.ipynb. Alongside them: data.csv, data_clean.csv, data_clean_v2.csv, a model.pkl, three PNGs, a scratch.py, and a requirements.txt that may or may not be current. Everything lives at the top level, in the order it was created, with nothing to say what depends on what or which file is the one that matters.

The folder works, in the sense that you can still — just about — find things. But a newcomer can’t, and neither can you after a holiday. There’s no signal about where the data loading lives, which notebook produces the model, whether data_clean_v2.csv is an input or an output, or how to run the thing end to end. Structure is the cheapest fix for all of this: a small number of conventions about where each kind of thing belongs, so that the layout itself answers those questions before anyone has to ask.

This is the same move we made inside the code in Part 2 — names and modules so a reader can navigate the logic — applied one level up, to the project as a whole.

9.2 A place for everything

A widely used layout for a data science project looks like this, and the specifics matter less than the principle behind them:

customer-value/
├── README.md               # what this is, how to set up, how to run
├── pyproject.toml          # the installable package (Chapter 6)
├── requirements.txt        # the locked environment (Chapter 3)
├── data/
│   ├── raw/                # immutable inputs — never edited in place
│   ├── interim/            # intermediate, regenerable artefacts
│   └── processed/          # final, model-ready data
├── src/
│   └── customer_value/     # the importable package: data.py, features.py, …
├── notebooks/              # exploration that imports from src/
├── tests/                  # the test suite (Chapter 7)
├── models/                 # trained model artefacts
└── configs/                # configuration files (Chapter 11)

The organising principle is to separate things by kind and lifecycle. Code is authored and version-controlled; it lives in src/ and tests/. Data is received or generated and stays out of Git (Chapter 2); it lives under data/. Outputs like trained models are generated and regenerable; they live in models/. Configuration is authored; it lives in configs/. Each kind has one home, so “where does the cleaning logic live?” has an answer you can guess without being told.

NoteData Science Bridge

A project layout is a schema for your work. You already know that data without a schema is unusable — tidy data gives each variable a column and each observation a row precisely so that anyone (and any tool) can navigate it without a guided tour. A project structure does the same thing for files: a known place for each kind of thing means the structure itself tells you where to look, exactly as column names tell you where a variable lives.

Where the analogy breaks down: a DataFrame’s schema is enforced — the columns are there, with those types, or the code fails. A directory convention is enforced by nothing but discipline (and, at best, a project template or linter). It will drift the moment someone drops a stray CSV at the top level, which is why the convention has to be maintained rather than assumed. The structure is a schema you have to keep honest yourself.

9.3 Raw data is sacred

The subdivision of data/ encodes a rule worth stating on its own: raw data is immutable. The files in data/raw/ are the inputs your work begins from, and you never edit them in place — not to fix a typo, not to drop a bad row, not to rename a column. Every cleaning and transformation step reads from raw and writes a new file under interim/ or processed/, leaving the original untouched.

This is the same instinct as not mutating a function’s arguments (Chapter 6), scaled up to files, and it buys the same thing: when a result looks wrong, you can always go back to the unaltered source and replay the transformations to find where it diverged. If you’d edited the raw file, that ground truth would be gone. It also makes the whole pipeline reproducible — interim/ and processed/ are derived artefacts that can be regenerated from raw at any time, which is exactly why they, like the environment, don’t need to live in version control.

9.4 The thin notebook, revisited

Good structure is what finally makes the “thin notebook, thick module” pattern from Chapters 1 and 6 real. Notebooks live in notebooks/ and import their logic from the src/ package; they orchestrate and explore, while the functions they call are version-controlled, tested, and reusable. The README in the root orients a newcomer in minutes — what the project does, how to build the environment, how to run the pipeline — turning a week of archaeology into an afternoon.

One small practice ties the layout together: derive every file location from a single project root rather than hard-coding paths. A data.csv referenced as /Users/you/Desktop/project/data/raw/data.csv works only on your machine; the same path built relative to the project root works everywhere.

from pathlib import Path

# One source of truth for where things live — usually a small `paths.py`
# in the package. Every other module imports these instead of hard-coding.
PROJECT_ROOT = Path("/work/customer-value")   # resolved once, e.g. from __file__
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
MODELS = PROJECT_ROOT / "models"

# Code refers to these names, never to a literal path string.
customers_file = DATA_RAW / "customers.csv"
model_file = MODELS / "churn_model.pkl"

print(customers_file)
print(model_file)
print(f"Same layout on any machine; only PROJECT_ROOT changes.")
/work/customer-value/data/raw/customers.csv
/work/customer-value/models/churn_model.pkl
Same layout on any machine; only PROJECT_ROOT changes.

Because every path descends from PROJECT_ROOT, moving the project — to a colleague’s laptop, a server, a container — changes exactly one line, and nothing in the codebase contains a path that only exists on your machine. The structure and the paths reinforce each other: a known layout, addressed from a single root.

TipAuthor’s Note

Notebooks sprawl because nothing pushes back. Creating another one costs nothing, naming it carefully costs a moment you don’t have mid-analysis, and so the folder accretes — _v2, _final, _actual — until the current version is whichever one you most recently had open. The mess isn’t a personal failing; it’s the default outcome of a workflow with no structural pressure toward order.

The reframe is to see structure as a message to whoever opens the project next — a colleague, or you in six months with no memory of any of it. A repository someone can open and immediately understand — data here, code there, run it like this — lets them contribute in an hour instead of a week, and lets you return to the work without re-deriving how it fits together. The specific convention barely matters; a team that agrees on any sensible layout is far better off than one where every project is arranged differently. What matters is the predictability: that the answer to “where is the X?” is the same every time.

9.5 Summary

Structure turns a folder of files into a navigable project:

  1. A flat folder doesn’t scale. Without conventions, no one — including you — can tell inputs from outputs or find the current version.

  2. Separate by kind and lifecycle. Code in src/ and tests/, data under data/, generated artefacts in models/, config in configs/ — each kind with one home, so the layout answers “where is the X?”.

  3. Raw data is immutable. Never edit inputs in place; derive interim/ and processed/ from them, so you can always replay transformations from an untouched source.

  4. Address everything from one root. Derive file paths from a single project root so the project runs anywhere, and let notebooks stay thin by importing logic from the package.

With a place for everything, the next chapter turns the monolithic analysis cell into something that runs in that structure: data pipelines.

9.6 Exercises

  1. Take a flat project of your own and reorganise it into the standard layout — src/, data/raw and data/processed, notebooks/, tests/. Move the raw inputs into data/raw and make them read-only. In the process, did you find anything that was being edited in place, and where should its output have gone instead?

  2. Write a README that orients a newcomer: what the project does, how to set up the environment, how to run the pipeline end to end, and where the data comes from. Hand it to a colleague and note every question they still have to ask — each is a gap in either the README or the structure.

  3. Introduce a single source of truth for file locations: a paths.py (or a config) that derives every path from one project root, and replace at least one hard-coded absolute path with a reference to it. What would have broken if a colleague had run the original on their machine?

  4. Conceptual: The Data Science Bridge compares a project layout to tidy data or a schema. Give one way the analogy holds and one way it breaks down. What enforces a DataFrame’s schema that does not enforce a directory convention, and what follows from that difference?

  5. Conceptual: A full project structure can be overkill. Describe a piece of work that should stay a single notebook in a single folder, and name the specific signal that tells you it has earned the scaffolding of src/, tests/, and a structured data/.