9 Project structure

9.1 Forty notebooks in a folder

Picture the project folder a few months into a piece of work. analysis.ipynb, analysis_2024_03_14.ipynb, analysis_final.ipynb, Untitled7.ipynb. Alongside them: data.csv, data_clean.csv, data_clean_v2.csv, a model.pkl, three PNGs, a scratch.py, and a requirements.txt that may or may not be current. Everything lives at the top level, in the order it was created, with nothing to say what depends on what or which file is the one that matters.

The folder works, in the sense that you can still — just about — find things. But a newcomer can’t, and neither can you after a holiday. There’s no signal about where the data loading lives, which notebook produces the model, whether data_clean_v2.csv is an input or an output, or how to run the thing end to end. Structure is the cheapest fix for all of this: a small number of conventions about where each kind of thing belongs, so that the layout itself answers those questions before anyone has to ask.

This is the same move we made inside the code in Part 2 — names and modules so a reader can navigate the logic — applied one level up, to the project as a whole.

9.2 A place for everything

A widely used layout for a data science project looks like this, and the specifics matter less than the principle behind them:

customer-value/
├── README.md               # what this is, how to set up, how to run
├── pyproject.toml          # the installable package (Chapter 6)
├── requirements.txt        # the locked environment (Chapter 3)
├── data/
│   ├── raw/                # immutable inputs — never edited in place
│   ├── interim/            # intermediate, regenerable artefacts
│   └── processed/          # final, model-ready data
├── src/
│   └── customer_value/     # the importable package: data.py, features.py, …
├── notebooks/              # exploration that imports from src/
├── tests/                  # the test suite (Chapter 7)
├── models/                 # trained model artefacts
└── configs/                # configuration files (Chapter 11)

The organising principle is to separate things by kind and lifecycle. Code is authored and version-controlled; it lives in src/ and tests/. Data is received or generated and stays out of Git (Chapter 2); it lives under data/. Outputs like trained models are generated and regenerable; they live in models/. Configuration is authored; it lives in configs/. Each kind has one home, so “where does the cleaning logic live?” has an answer you can guess without being told.

Data Science Bridge

A project layout is a schema for your work. You already know that data without a schema is unusable — tidy data gives each variable a column and each observation a row precisely so that anyone (and any tool) can navigate it without a guided tour. A project structure does the same thing for files: a known place for each kind of thing means the structure itself tells you where to look, exactly as column names tell you where a variable lives.

Where the analogy breaks down: a DataFrame’s schema is enforced — the columns are there, with those types, or the code fails. A directory convention is enforced by nothing but discipline (and, at best, a project template or linter). It will drift the moment someone drops a stray CSV at the top level, which is why the convention has to be maintained rather than assumed. The structure is a schema you have to keep honest yourself.

9.3 The layout bends to the shape of the project

That layout is a starting point rather than a standard, and it is worth being explicit about what pulls it out of shape, because most projects are one of three kinds and each weights the directories differently.

An analysis delivers a finding. The deliverable is a number, a chart, or a report, and the question a reader will ask is how did you get this. Here notebooks/ and data/ carry the weight, src/ holds only the handful of functions the notebooks import, and it earns an extra reports/ directory for the rendered output. Because the sequence is part of the answer, numbering the notebooks — 01-explore.ipynb, 02-model.ipynb — does real work: it records the order in which the argument was built.

A library delivers the code itself, to be imported by projects you may never see. Now src/ and tests/ dominate, data/ shrinks to a few small fixtures living inside tests/, and a docs/ directory appears because the users are developers reading about functions rather than analysts reading about customers. The distinction that matters most is which parts of the package are public — the names other people may import and that you therefore cannot rename freely — and which are internal.

A service delivers a running process. Its src/ is joined by a Dockerfile and a configs/ directory with one file per environment (Chapter 11 and Chapter 14), and the trained model inverts its role entirely: it is no longer an output the project produces but an input the project loads, often built by a separate training repository and pulled in as a versioned artefact.

Most data science work is a mixture, and the useful discipline is not picking a pure form but naming which one the deliverable is. A project that is really an analysis but has been dressed as a service accumulates scaffolding nobody uses; a service that is still arranged as an analysis has its production logic sitting in a notebook.

The same reasoning applies to project templates. Tools like cookiecutter and copier generate a whole layout from a template, so consistency across a team stops depending on anyone remembering the convention. The value is less the particular directories a template gives you than the fact that every project on the team opens the same way. Treat the generated tree as a draft, though — delete the directories this project will never fill, because an empty models/ in an analysis that trains nothing is noise pretending to be structure.

9.4 Raw data is sacred

The subdivision of data/ encodes a rule worth stating on its own: raw data is immutable. The files in data/raw/ are the inputs your work begins from, and you never edit them in place — not to fix a typo, not to drop a bad row, not to rename a column. Every cleaning and transformation step reads from raw and writes a new file under interim/ or processed/, leaving the original untouched.

This is the same instinct as not mutating a function’s arguments (Chapter 6), scaled up to files, and it buys the same thing: when a result looks wrong, you can always go back to the unaltered source and replay the transformations to find where it diverged. If you’d edited the raw file, that ground truth would be gone. It also makes the whole pipeline reproducible — interim/ and processed/ are derived artefacts that can be regenerated from raw at any time, which is exactly why they, like the environment, don’t need to live in version control.

9.5 The thin notebook, revisited

Good structure is what finally makes the “thin notebook, thick module” pattern from Chapter 1 and Chapter 6 real. Notebooks live in notebooks/ and import their logic from the src/ package; they orchestrate and explore, while the functions they call are version-controlled, tested, and reusable. The README in the root orients a newcomer in minutes — what the project does, how to build the environment, how to run the pipeline — turning a week of archaeology into an afternoon.

One small practice ties the layout together: derive every file location from a single project root rather than hard-coding paths. A data.csv referenced as /Users/you/Desktop/project/data/raw/data.csv works only on your machine; the same path built relative to the project root works everywhere.

from pathlib import Path

# One source of truth for where things live — usually a small `paths.py`
# in the package. Every other module imports these instead of hard-coding.
PROJECT_ROOT = Path("/work/customer-value")   # resolved once, e.g. from __file__
DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
MODELS = PROJECT_ROOT / "models"

# Code refers to these names, never to a literal path string.
customers_file = DATA_RAW / "customers.csv"
model_file = MODELS / "churn_model.pkl"

print(customers_file)
print(model_file)
print(f"Same layout on any machine; only PROJECT_ROOT changes.")

/work/customer-value/data/raw/customers.csv
/work/customer-value/models/churn_model.pkl
Same layout on any machine; only PROJECT_ROOT changes.

Because every path descends from PROJECT_ROOT, moving the project — to a colleague’s laptop, a server, a container — changes exactly one line, and nothing in the codebase contains a path that only exists on your machine. The structure and the paths reinforce each other: a known layout, addressed from a single root.

Author’s Note

Notebooks sprawl because nothing pushes back. Creating another one costs nothing, naming it carefully costs a moment you don’t have mid-analysis, and so the folder accretes — _v2, _final, _actual — until the current version is whichever one you most recently had open. The mess isn’t a personal failing; it’s the default outcome of a workflow with no structural pressure toward order.

The reframe is to see structure as a message to whoever opens the project next — a colleague, or you in six months with no memory of any of it. A repository someone can open and immediately understand — data here, code there, run it like this — lets them contribute in an hour instead of a week, and lets you return to the work without re-deriving how it fits together. The specific convention barely matters; a team that agrees on any sensible layout is far better off than one where every project is arranged differently. What matters is the predictability: that the answer to “where is the X?” is the same every time.

9.6 Summary

Structure turns a folder of files into a navigable project:

A flat folder doesn’t scale. Without conventions, no one — including you — can tell inputs from outputs or find the current version.
Separate by kind and lifecycle. Code in src/ and tests/, data under data/, generated artefacts in models/, config in configs/ — each kind with one home, so the layout answers “where is the X?”.
Raw data is immutable. Never edit inputs in place; derive interim/ and processed/ from them, so you can always replay transformations from an untouched source.
Address everything from one root. Derive file paths from a single project root so the project runs anywhere, and let notebooks stay thin by importing logic from the package.

With a place for everything, the next chapter turns the monolithic analysis cell into something that runs in that structure: data pipelines.

9.7 Exercises

Take a flat project of your own and reorganise it into the standard layout — src/, data/raw and data/processed, notebooks/, tests/. Move the raw inputs into data/raw and make them read-only. In the process, did you find anything that was being edited in place, and where should its output have gone instead?
Write a README that orients a newcomer: what the project does, how to set up the environment, how to run the pipeline end to end, and where the data comes from. Hand it to a colleague and note every question they still have to ask — each is a gap in either the README or the structure.
Introduce a single source of truth for file locations: a paths.py (or a config) that derives every path from one project root, and replace at least one hard-coded absolute path with a reference to it. What would have broken if a colleague had run the original on their machine?
Conceptual: Tidy data has a rule sharp enough to settle arguments: one variable per column, one observation per row. Write the equivalent one-sentence rule for a project you actually work on — sharp enough that a colleague could use it to decide where a new file belongs without asking you. Then find a file in that project your rule doesn’t cleanly place, and decide whether the file is in the wrong place or the rule is too crude.
Conceptual: A full project structure can be overkill. Describe a piece of work that should stay a single notebook in a single folder, and name the specific signal that tells you it has earned the scaffolding of src/, tests/, and a structured data/.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Project structure {#sec-project-structure} ## Forty notebooks in a folder {#sec-forty-notebooks} Picture the project folder a few months into a piece of work. `analysis.ipynb`, `analysis_2024_03_14.ipynb`, `analysis_final.ipynb`, `Untitled7.ipynb`. Alongside them: `data.csv`, `data_clean.csv`, `data_clean_v2.csv`, a `model.pkl`, three PNGs, a `scratch.py`, and a `requirements.txt` that may or may not be current. Everything lives at the top level, in the order it was created, with nothing to say what depends on what or which file is the one that matters. The folder works, in the sense that you can still — just about — find things. But a newcomer can't, and neither can you after a holiday. There's no signal about where the data loading lives, which notebook produces the model, whether `data_clean_v2.csv` is an input or an output, or how to run the thing end to end. Structure is the cheapest fix for all of this: a small number of conventions about where each kind of thing belongs, so that the layout itself answers those questions before anyone has to ask. This is the same move we made inside the code in Part 2 — names and modules so a reader can navigate the logic — applied one level up, to the project as a whole. ## A place for everything {#sec-place-for-everything} A widely used layout for a data science project looks like this, and the specifics matter less than the principle behind them: ```text customer-value/ ├── README.md # what this is, how to set up, how to run ├── pyproject.toml # the installable package (Chapter 6) ├── requirements.txt # the locked environment (Chapter 3) ├── data/ │ ├── raw/ # immutable inputs — never edited in place │ ├── interim/ # intermediate, regenerable artefacts │ └── processed/ # final, model-ready data ├── src/ │ └── customer_value/ # the importable package: data.py, features.py, … ├── notebooks/ # exploration that imports from src/ ├── tests/ # the test suite (Chapter 7) ├── models/ # trained model artefacts └── configs/ # configuration files (Chapter 11) ``` The organising principle is to separate things by *kind and lifecycle*. Code is authored and version-controlled; it lives in `src/` and `tests/`. Data is received or generated and stays out of Git (@sec-version-control); it lives under `data/`. Outputs like trained models are generated and regenerable; they live in `models/`. Configuration is authored; it lives in `configs/`. Each kind has one home, so "where does the cleaning logic live?" has an answer you can guess without being told. ::: {.callout-note} ## Data Science Bridge A project layout is a schema for your work. You already know that data without a schema is unusable — tidy data gives each variable a column and each observation a row precisely so that anyone (and any tool) can navigate it without a guided tour. A project structure does the same thing for files: a known place for each kind of thing means the structure itself tells you where to look, exactly as column names tell you where a variable lives. Where the analogy breaks down: a DataFrame's schema is *enforced* — the columns are there, with those types, or the code fails. A directory convention is enforced by nothing but discipline (and, at best, a project template or linter). It will drift the moment someone drops a stray CSV at the top level, which is why the convention has to be maintained rather than assumed. The structure is a schema you have to keep honest yourself. ::: ## The layout bends to the shape of the project {#sec-three-shapes} That layout is a starting point rather than a standard, and it is worth being explicit about what pulls it out of shape, because most projects are one of three kinds and each weights the directories differently. An **analysis** delivers a finding. The deliverable is a number, a chart, or a report, and the question a reader will ask is *how did you get this*. Here `notebooks/` and `data/` carry the weight, `src/` holds only the handful of functions the notebooks import, and it earns an extra `reports/` directory for the rendered output. Because the sequence is part of the answer, numbering the notebooks — `01-explore.ipynb`, `02-model.ipynb` — does real work: it records the order in which the argument was built. A **library** delivers the code itself, to be imported by projects you may never see. Now `src/` and `tests/` dominate, `data/` shrinks to a few small fixtures living inside `tests/`, and a `docs/` directory appears because the users are developers reading about functions rather than analysts reading about customers. The distinction that matters most is which parts of the package are public — the names other people may import and that you therefore cannot rename freely — and which are internal. A **service** delivers a running process. Its `src/` is joined by a `Dockerfile` and a `configs/` directory with one file per environment (@sec-config-secrets and @sec-containerisation), and the trained model inverts its role entirely: it is no longer an output the project produces but an input the project *loads*, often built by a separate training repository and pulled in as a versioned artefact. Most data science work is a mixture, and the useful discipline is not picking a pure form but naming which one the deliverable is. A project that is really an analysis but has been dressed as a service accumulates scaffolding nobody uses; a service that is still arranged as an analysis has its production logic sitting in a notebook. The same reasoning applies to project templates. Tools like `cookiecutter` and `copier` generate a whole layout from a template, so consistency across a team stops depending on anyone remembering the convention. The value is less the particular directories a template gives you than the fact that every project on the team opens the same way. Treat the generated tree as a draft, though — delete the directories this project will never fill, because an empty `models/` in an analysis that trains nothing is noise pretending to be structure. ## Raw data is sacred {#sec-raw-data-sacred} The subdivision of `data/` encodes a rule worth stating on its own: **raw data is immutable.** The files in `data/raw/` are the inputs your work begins from, and you never edit them in place — not to fix a typo, not to drop a bad row, not to rename a column. Every cleaning and transformation step reads from raw and writes a *new* file under `interim/` or `processed/`, leaving the original untouched. This is the same instinct as not mutating a function's arguments (@sec-functions-modules), scaled up to files, and it buys the same thing: when a result looks wrong, you can always go back to the unaltered source and replay the transformations to find where it diverged. If you'd edited the raw file, that ground truth would be gone. It also makes the whole pipeline reproducible — `interim/` and `processed/` are derived artefacts that can be regenerated from raw at any time, which is exactly why they, like the environment, don't need to live in version control. ## The thin notebook, revisited {#sec-thin-notebook} Good structure is what finally makes the "thin notebook, thick module" pattern from @sec-notebook-to-system and @sec-functions-modules real. Notebooks live in `notebooks/` and import their logic from the `src/` package; they orchestrate and explore, while the functions they call are version-controlled, tested, and reusable. The README in the root orients a newcomer in minutes — what the project does, how to build the environment, how to run the pipeline — turning a week of archaeology into an afternoon. One small practice ties the layout together: derive every file location from a single project root rather than hard-coding paths. A `data.csv` referenced as `/Users/you/Desktop/project/data/raw/data.csv` works only on your machine; the same path built relative to the project root works everywhere. ```{python} #| label: paths-from-root #| echo: true from pathlib import Path # One source of truth for where things live — usually a small `paths.py` # in the package. Every other module imports these instead of hard-coding. PROJECT_ROOT = Path("/work/customer-value") # resolved once, e.g. from __file__ DATA_RAW = PROJECT_ROOT / "data" / "raw" DATA_PROCESSED = PROJECT_ROOT / "data" / "processed" MODELS = PROJECT_ROOT / "models" # Code refers to these names, never to a literal path string. customers_file = DATA_RAW / "customers.csv" model_file = MODELS / "churn_model.pkl" print(customers_file) print(model_file) print(f"Same layout on any machine; only PROJECT_ROOT changes.") ``` Because every path descends from `PROJECT_ROOT`, moving the project — to a colleague's laptop, a server, a container — changes exactly one line, and nothing in the codebase contains a path that only exists on your machine. The structure and the paths reinforce each other: a known layout, addressed from a single root. ::: {.callout-tip} ## Author's Note Notebooks sprawl because nothing pushes back. Creating another one costs nothing, naming it carefully costs a moment you don't have mid-analysis, and so the folder accretes — `_v2`, `_final`, `_actual` — until the current version is whichever one you most recently had open. The mess isn't a personal failing; it's the default outcome of a workflow with no structural pressure toward order. The reframe is to see structure as a message to whoever opens the project next — a colleague, or you in six months with no memory of any of it. A repository someone can open and immediately understand — data here, code there, run it like this — lets them contribute in an hour instead of a week, and lets you return to the work without re-deriving how it fits together. The specific convention barely matters; a team that agrees on *any* sensible layout is far better off than one where every project is arranged differently. What matters is the predictability: that the answer to "where is the X?" is the same every time. ::: ## Summary {#sec-project-structure-summary} Structure turns a folder of files into a navigable project: 1. **A flat folder doesn't scale.** Without conventions, no one — including you — can tell inputs from outputs or find the current version. 2. **Separate by kind and lifecycle.** Code in `src/` and `tests/`, data under `data/`, generated artefacts in `models/`, config in `configs/` — each kind with one home, so the layout answers "where is the X?". 3. **Raw data is immutable.** Never edit inputs in place; derive `interim/` and `processed/` from them, so you can always replay transformations from an untouched source. 4. **Address everything from one root.** Derive file paths from a single project root so the project runs anywhere, and let notebooks stay thin by importing logic from the package. With a place for everything, the next chapter turns the monolithic analysis cell into something that runs in that structure: *data pipelines*. ## Exercises {#sec-project-structure-exercises} 1. Take a flat project of your own and reorganise it into the standard layout — `src/`, `data/raw` and `data/processed`, `notebooks/`, `tests/`. Move the raw inputs into `data/raw` and make them read-only. In the process, did you find anything that was being edited in place, and where should its output have gone instead? 2. Write a README that orients a newcomer: what the project does, how to set up the environment, how to run the pipeline end to end, and where the data comes from. Hand it to a colleague and note every question they still have to ask — each is a gap in either the README or the structure. 3. Introduce a single source of truth for file locations: a `paths.py` (or a config) that derives every path from one project root, and replace at least one hard-coded absolute path with a reference to it. What would have broken if a colleague had run the original on their machine? 4. **Conceptual:** Tidy data has a rule sharp enough to settle arguments: one variable per column, one observation per row. Write the equivalent one-sentence rule for a project you actually work on — sharp enough that a colleague could use it to decide where a new file belongs without asking you. Then find a file in that project your rule doesn't cleanly place, and decide whether the file is in the wrong place or the rule is too crude. 5. **Conceptual:** A full project structure can be overkill. Describe a piece of work that should stay a single notebook in a single folder, and name the specific signal that tells you it has earned the scaffolding of `src/`, `tests/`, and a structured `data/`.