2 Version control

2.1 The folder full of versions

Open the folder where last quarter’s analysis lives. There’s a fair chance it looks something like this: churn_model.ipynb, churn_model_v2.ipynb, churn_model_v2_fixed.ipynb, churn_model_final.ipynb, and — inevitably — churn_model_final_ACTUAL.ipynb. Somewhere there’s a notes.txt meant to record which one produced the figures in the board deck. It says “use the final one (not final2)”.

This is version control. It’s just version control done by hand, and badly. You already understand the need: some instinct told you that overwriting the previous version was dangerous, so you kept it, with a name that seemed meaningful at 7pm and means nothing now. The instinct is sound. The implementation is the problem — filenames don’t record why you made a change, they don’t tell you which files changed together, and they collapse the moment two people edit in parallel.

Git replaces the whole _final_ACTUAL ritual with something that actually answers the questions you were trying to answer: what changed, when, why, and how do I get the old version back. It has a reputation for being arcane, and the command line doesn’t help. But the mental model underneath is simple, and it maps onto something you already do every day.

2.2 What Git actually does

Strip away the intimidating surface and Git does one thing: it takes snapshots of your project and remembers the order you took them in. Each snapshot is a commit. A commit records the complete state of your tracked files at a moment you chose, along with a message saying what changed and why, who made it, and when. The project’s history is just the ordered chain of those commits, and because every snapshot is kept, you can return to any point, compare any two, and recover anything you didn’t mean to lose.

There’s one wrinkle worth learning early, because it trips up newcomers. Git doesn’t snapshot your files automatically. You choose what goes into each commit by first moving changes into a staging area — think of it as deciding what to include in the photo before you take it. The everyday rhythm is three steps: change files in your working directory, stage the changes you want to keep together, then commit them with a message.

# The everyday loop
git status                      # what have I changed?
git add src/features.py         # stage the changes I want to commit together
git commit -m "Use median spend per day, not mean — robust to outliers"
git log --oneline               # the history of decisions, newest first

The other concept you’ll use constantly is the branch. A branch is an independent line of commits — a parallel version of the project you can develop without disturbing the main one. You make a branch to try something, commit freely on it, and then either merge it back if it worked or abandon it if it didn’t. This is the part that directly replaces the _v2_experimental filenames: instead of copying the whole notebook to try a new feature set, you branch, experiment, and keep main in a known-good state the whole time.

Data Science Bridge

A Git history is the lab notebook you always meant to keep. Every experimentalist knows the value of a dated record of what you tried and why, so that a result is traceable back to the conditions that produced it. A commit is a dated, attributed entry in that record, and git log is the notebook read back. The analogy holds well for code and decisions: which cleaning rule changed, when you switched metrics, why you dropped a feature.

Where it breaks down: Git versions files, not runs. It won’t record that this version of the code scored an AUC of 0.83 on Tuesday’s data with learning_rate=0.05. That’s the job of an experiment tracker (MLflow, Weights & Biases, DVC’s experiment tools), which logs metrics, parameters, and artefacts per run. The two are complementary — Git answers “what is the code?”, the tracker answers “what did this run produce?” — and mature projects use both.

2.3 Why notebooks fight back

If Git is so useful, why do so many data scientists bounce off it? Part of the answer is that the notebook — the format you live in — is genuinely hostile to version control, and it’s worth understanding why rather than concluding the tool is broken.

A .ipynb file is not the code you see in Jupyter. It’s a JSON document that wraps your code together with its cached outputs, the execution count of every cell, and a pile of metadata. When you re-run a notebook without changing a line of logic, the execution counts change, the outputs change, and embedded image data changes. Git, which compares files line by line, sees all of that as “changes” — so a one-line edit drowns in noise, and merging two people’s notebooks produces conflicts in machine-generated JSON that no human can resolve by hand.

import json
import difflib

# The same one-line logic change, represented two ways.

# As a plain .py module, version control sees exactly what changed:
script_before = "THRESHOLD = 150\n"
script_after = "THRESHOLD = 200\n"

# As a notebook, that line is buried in JSON next to an execution
# count and cached output that change every time the cell is run.
def notebook_cell(source, execution_count, output):
    return json.dumps({
        "cells": [{
            "cell_type": "code",
            "execution_count": execution_count,
            "outputs": [{"output_type": "stream", "text": output}],
            "source": [source],
        }],
        "metadata": {}, "nbformat": 4, "nbformat_minor": 5,
    }, indent=1)

nb_before = notebook_cell("THRESHOLD = 150", execution_count=7, output="150\n")
nb_after = notebook_cell("THRESHOLD = 200", execution_count=12, output="200\n")

def changed_lines(a, b):
    """Count the +/- lines a line-based diff (like Git's) would report."""
    diff = difflib.unified_diff(a.splitlines(), b.splitlines())
    return sum(1 for line in diff
               if line[:1] in "+-" and not line.startswith(("+++", "---")))

print(f"Lines changed, as a .py script:   {changed_lines(script_before, script_after)}")
print(f"Lines changed, as a .ipynb file:  {changed_lines(nb_before, nb_after)}")

Lines changed, as a .py script:   2
Lines changed, as a .ipynb file:  6

The same one-character change to a threshold shows up as a clean two-line diff in a script and a scattered, noisier diff in the notebook — and that’s with a single trivial cell. In a real notebook with dozens of cells and rich outputs, the signal disappears entirely.

There are three good fixes, and you’ll likely use a combination. Strip outputs before committing, so only code and markdown are versioned — nbstripout can do this automatically on every commit. Use a notebook-aware diff tool such as nbdime, which understands the JSON structure and shows you cell-level changes instead of raw text. Or pair each notebook with a plain-text script representation using jupytext, and commit the script: it diffs cleanly, reviews like normal code, and regenerates the notebook on demand. None of these asks you to stop using notebooks. They just stop the notebook format from undermining the history you’re trying to keep.

2.4 What belongs in version control

Git is built for source: code, configuration, documentation, and small fixtures — text files that change deliberately and that you want to track line by line. A surprising amount of what sits in a data science project does not belong in it, and committing the wrong things is one of the fastest ways to make a repository painful to work with.

Data is the big one. Datasets are often large, frequently change, and sometimes contain information that must never end up in a shared history. Git keeps every version of every file forever, so committing a 2 GB CSV — or worse, ten revisions of it — bloats the repository permanently, and a credential committed once lives in the history even after you delete the file. Model artefacts (.pkl, .joblib, saved weights) are the same: large, binary, and regenerated rather than authored. These need their own kind of versioning, which we’ll come to in the data and reproducibility chapters; the tool of choice, DVC, deliberately stores large files outside Git while keeping a small text pointer inside it.

You tell Git what to ignore with a .gitignore file. A reasonable starting point for a data science project:

# Data and artefacts — versioned separately, not in Git
data/
*.csv
*.parquet
models/
*.pkl
*.joblib

# Secrets — never commit these (see "Configuration and secrets")
.env
*.key

# Notebook and Python noise
.ipynb_checkpoints/
__pycache__/
*.pyc

# Local environments
.venv/
venv/

The principle is to commit the things you author and ignore the things you generate or receive. Code, configuration, and the requirements.txt that pins your environment are authored — they belong in Git. Data, outputs, caches, and secrets are not, and each has a better home: data versioning for datasets, a secrets manager or local .env for credentials (the subject of Configuration and secrets), and a registry or artefact store for models.

2.5 Commits that tell a story

The mechanics of committing are easy. The discipline that makes version control valuable is choosing what a commit contains and what its message says — and this is where the lab-notebook analogy earns its keep.

A good commit is small and focused: it makes one coherent change, so that its message can describe that change in a sentence and so that, if it later turns out to be wrong, you can undo it without unpicking four unrelated edits tangled in the same snapshot. The message should record the why, not the what — the diff already shows what changed. Recall the customer-spend notebook from the previous chapter, with its SPEND_THRESHOLD = 200 # was 150, changed during review meeting. That comment is doing the job a commit message should do, and doing it poorly, because it will be deleted the next time someone tidies the code. A commit message — “Raise spend threshold to 200 after Q3 review; 150 was admitting too many low-value accounts” — puts that reasoning somewhere permanent, attached to the exact change it explains, where git log and git blame will surface it for whoever asks “why is this 200?” in eighteen months.

Branches are how you keep that history clean while still moving fast. When you want to try a different feature set or a new cleaning rule, make a branch, commit your experiment there, and leave main in a state that always runs. If the experiment works, merge it; if it doesn’t, delete the branch and the dead end vanishes without cluttering the main line. This is the engineered replacement for copying model.ipynb to model_v2.ipynb — you get the same freedom to try things, but the versions are tracked, comparable, and recoverable rather than scattered across filenames.

Author’s Note

The resistance to Git among data scientists is usually pinned on the notebook-diffing problem, and that’s real — but there’s a deeper reason, and naming it helps. Exploration feels too fluid to commit. When you’re trying things cell by cell, the idea of stopping to stage and write a message for every change seems absurd, and it would be. The misunderstanding is about what a commit is for. The unit of a commit isn’t a keystroke or a cell re-run; it’s a decision worth recording. You don’t commit while you’re poking at a distribution. You commit when you’ve decided something: that the median beats the mean, that a feature leaks, that these are the columns you’re keeping. Those are precisely the decisions your future self and your colleagues will need to reconstruct, and they’re the ones that vanish first from memory and from notebook comments. Seen that way, version control isn’t bureaucracy imposed on exploration; it’s the thing that lets exploration leave a trace worth keeping.

2.6 Summary

Version control replaces a fragile, manual habit with a reliable one. The shift comes down to four ideas:

You already version your work — Git just does it properly. Filenames record neither why a change was made nor which files changed together; commits record both, and let you recover any past state.
Notebooks fight version control, and that’s fixable. The .ipynb JSON format buries code changes in execution counts and outputs. Strip outputs, use a notebook-aware diff (nbdime), or commit a paired script (jupytext) — don’t abandon notebooks, just stop them from polluting the history.
Commit what you author, ignore what you generate. Code and configuration belong in Git; data, models, secrets, and caches do not. Each of those has a better-suited home.
A commit is a decision worth recording. Keep commits small and focused, write messages that capture the why, and use branches to experiment without putting main at risk.

In the next chapter, we turn to the other half of “it works on my machine”: environments and dependencies.

2.7 Exercises

Take one of your own existing projects and put it under version control: run git init, write a .gitignore appropriate for a data science project, and make an initial commit of the code and configuration only. List what you deliberately excluded, and for each excluded item name where it should be versioned or stored instead.
Configure clean diffs for a notebook. Either set up nbstripout to strip outputs automatically on commit, or pair the notebook with a script using jupytext. Make a one-line logic change and confirm that the tracked diff now shows only the change you made, not a wall of metadata.
Look back at a recent analysis and identify three decisions you made — a threshold, a metric, a dropped feature. For each, write the one-sentence commit message you wish your past self had left. What information is in your message that a filename or an inline comment could not have captured?
Use a branch to run an experiment. Create a branch, make a change you’re genuinely unsure about (a different feature set, a new imputation rule), commit it, and then decide whether to merge it into main or discard it. How does this compare to how you currently try things out, and what did keeping main untouched buy you?
Conceptual: The “Data Science Bridge” in this chapter compares a Git history to an experiment log. Name one thing Git version control does that an experiment tracker such as MLflow does not, and one thing an experiment tracker does that Git does not. What does the division of labour between them tell you about the difference between versioning your code and tracking your results?

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Version control" --- ## The folder full of versions {#sec-versions-by-filename} Open the folder where last quarter's analysis lives. There's a fair chance it looks something like this: `churn_model.ipynb`, `churn_model_v2.ipynb`, `churn_model_v2_fixed.ipynb`, `churn_model_final.ipynb`, and — inevitably — `churn_model_final_ACTUAL.ipynb`. Somewhere there's a `notes.txt` meant to record which one produced the figures in the board deck. It says "use the final one (not final2)". This is version control. It's just version control done by hand, and badly. You already understand the *need*: some instinct told you that overwriting the previous version was dangerous, so you kept it, with a name that seemed meaningful at 7pm and means nothing now. The instinct is sound. The implementation is the problem — filenames don't record *why* you made a change, they don't tell you which files changed together, and they collapse the moment two people edit in parallel. Git replaces the whole `_final_ACTUAL` ritual with something that actually answers the questions you were trying to answer: what changed, when, why, and how do I get the old version back. It has a reputation for being arcane, and the command line doesn't help. But the mental model underneath is simple, and it maps onto something you already do every day. ## What Git actually does {#sec-what-git-does} Strip away the intimidating surface and Git does one thing: it takes snapshots of your project and remembers the order you took them in. Each snapshot is a **commit**. A commit records the complete state of your tracked files at a moment you chose, along with a message saying what changed and why, who made it, and when. The project's history is just the ordered chain of those commits, and because every snapshot is kept, you can return to any point, compare any two, and recover anything you didn't mean to lose. There's one wrinkle worth learning early, because it trips up newcomers. Git doesn't snapshot your files automatically. You choose what goes into each commit by first moving changes into a **staging area** — think of it as deciding what to include in the photo before you take it. The everyday rhythm is three steps: change files in your working directory, stage the changes you want to keep together, then commit them with a message. ```bash # The everyday loop git status # what have I changed? git add src/features.py # stage the changes I want to commit together git commit -m "Use median spend per day, not mean — robust to outliers" git log --oneline # the history of decisions, newest first ``` The other concept you'll use constantly is the **branch**. A branch is an independent line of commits — a parallel version of the project you can develop without disturbing the main one. You make a branch to try something, commit freely on it, and then either merge it back if it worked or abandon it if it didn't. This is the part that directly replaces the `_v2_experimental` filenames: instead of copying the whole notebook to try a new feature set, you branch, experiment, and keep `main` in a known-good state the whole time. ::: {.callout-note} ## Data Science Bridge A Git history is the lab notebook you always meant to keep. Every experimentalist knows the value of a dated record of what you tried and why, so that a result is traceable back to the conditions that produced it. A commit is a dated, attributed entry in that record, and `git log` is the notebook read back. The analogy holds well for *code and decisions*: which cleaning rule changed, when you switched metrics, why you dropped a feature. Where it breaks down: Git versions *files*, not *runs*. It won't record that this version of the code scored an AUC of 0.83 on Tuesday's data with `learning_rate=0.05`. That's the job of an experiment tracker (MLflow, Weights & Biases, DVC's experiment tools), which logs metrics, parameters, and artefacts per run. The two are complementary — Git answers "what is the code?", the tracker answers "what did this run produce?" — and mature projects use both. ::: ## Why notebooks fight back {#sec-notebooks-fight} If Git is so useful, why do so many data scientists bounce off it? Part of the answer is that the notebook — the format you live in — is genuinely hostile to version control, and it's worth understanding *why* rather than concluding the tool is broken. A `.ipynb` file is not the code you see in Jupyter. It's a JSON document that wraps your code together with its cached outputs, the execution count of every cell, and a pile of metadata. When you re-run a notebook without changing a line of logic, the execution counts change, the outputs change, and embedded image data changes. Git, which compares files line by line, sees all of that as "changes" — so a one-line edit drowns in noise, and merging two people's notebooks produces conflicts in machine-generated JSON that no human can resolve by hand. ```{python} #| label: notebook-vs-script-diff #| echo: true import json import difflib # The same one-line logic change, represented two ways. # As a plain .py module, version control sees exactly what changed: script_before = "THRESHOLD = 150\n" script_after = "THRESHOLD = 200\n" # As a notebook, that line is buried in JSON next to an execution # count and cached output that change every time the cell is run. def notebook_cell(source, execution_count, output): return json.dumps({ "cells": [{ "cell_type": "code", "execution_count": execution_count, "outputs": [{"output_type": "stream", "text": output}], "source": [source], }], "metadata": {}, "nbformat": 4, "nbformat_minor": 5, }, indent=1) nb_before = notebook_cell("THRESHOLD = 150", execution_count=7, output="150\n") nb_after = notebook_cell("THRESHOLD = 200", execution_count=12, output="200\n") def changed_lines(a, b): """Count the +/- lines a line-based diff (like Git's) would report.""" diff = difflib.unified_diff(a.splitlines(), b.splitlines()) return sum(1 for line in diff if line[:1] in "+-" and not line.startswith(("+++", "---"))) print(f"Lines changed, as a .py script: {changed_lines(script_before, script_after)}") print(f"Lines changed, as a .ipynb file: {changed_lines(nb_before, nb_after)}") ``` The same one-character change to a threshold shows up as a clean two-line diff in a script and a scattered, noisier diff in the notebook — and that's with a single trivial cell. In a real notebook with dozens of cells and rich outputs, the signal disappears entirely. There are three good fixes, and you'll likely use a combination. Strip outputs before committing, so only code and markdown are versioned — `nbstripout` can do this automatically on every commit. Use a notebook-aware diff tool such as `nbdime`, which understands the JSON structure and shows you cell-level changes instead of raw text. Or pair each notebook with a plain-text script representation using `jupytext`, and commit the script: it diffs cleanly, reviews like normal code, and regenerates the notebook on demand. None of these asks you to stop using notebooks. They just stop the notebook format from undermining the history you're trying to keep. ## What belongs in version control {#sec-what-to-commit} Git is built for source: code, configuration, documentation, and small fixtures — text files that change deliberately and that you want to track line by line. A surprising amount of what sits in a data science project does *not* belong in it, and committing the wrong things is one of the fastest ways to make a repository painful to work with. Data is the big one. Datasets are often large, frequently change, and sometimes contain information that must never end up in a shared history. Git keeps every version of every file forever, so committing a 2 GB CSV — or worse, ten revisions of it — bloats the repository permanently, and a credential committed once lives in the history even after you delete the file. Model artefacts (`.pkl`, `.joblib`, saved weights) are the same: large, binary, and regenerated rather than authored. These need their own kind of versioning, which we'll come to in the data and reproducibility chapters; the tool of choice, DVC, deliberately stores large files outside Git while keeping a small text pointer inside it. You tell Git what to ignore with a `.gitignore` file. A reasonable starting point for a data science project: ```gitignore # Data and artefacts — versioned separately, not in Git data/ *.csv *.parquet models/ *.pkl *.joblib # Secrets — never commit these (see "Configuration and secrets") .env *.key # Notebook and Python noise .ipynb_checkpoints/ __pycache__/ *.pyc # Local environments .venv/ venv/ ``` The principle is to commit the things you *author* and ignore the things you *generate* or *receive*. Code, configuration, and the `requirements.txt` that pins your environment are authored — they belong in Git. Data, outputs, caches, and secrets are not, and each has a better home: data versioning for datasets, a secrets manager or local `.env` for credentials (the subject of *Configuration and secrets*), and a registry or artefact store for models. ## Commits that tell a story {#sec-commit-hygiene} The mechanics of committing are easy. The discipline that makes version control valuable is choosing *what* a commit contains and *what its message says* — and this is where the lab-notebook analogy earns its keep. A good commit is small and focused: it makes one coherent change, so that its message can describe that change in a sentence and so that, if it later turns out to be wrong, you can undo it without unpicking four unrelated edits tangled in the same snapshot. The message should record the *why*, not the *what* — the diff already shows what changed. Recall the customer-spend notebook from the previous chapter, with its `SPEND_THRESHOLD = 200 # was 150, changed during review meeting`. That comment is doing the job a commit message should do, and doing it poorly, because it will be deleted the next time someone tidies the code. A commit message — "Raise spend threshold to 200 after Q3 review; 150 was admitting too many low-value accounts" — puts that reasoning somewhere permanent, attached to the exact change it explains, where `git log` and `git blame` will surface it for whoever asks "why is this 200?" in eighteen months. Branches are how you keep that history clean while still moving fast. When you want to try a different feature set or a new cleaning rule, make a branch, commit your experiment there, and leave `main` in a state that always runs. If the experiment works, merge it; if it doesn't, delete the branch and the dead end vanishes without cluttering the main line. This is the engineered replacement for copying `model.ipynb` to `model_v2.ipynb` — you get the same freedom to try things, but the versions are tracked, comparable, and recoverable rather than scattered across filenames. ::: {.callout-tip} ## Author's Note The resistance to Git among data scientists is usually pinned on the notebook-diffing problem, and that's real — but there's a deeper reason, and naming it helps. Exploration feels too fluid to commit. When you're trying things cell by cell, the idea of stopping to stage and write a message for every change seems absurd, and it would be. The misunderstanding is about what a commit is *for*. The unit of a commit isn't a keystroke or a cell re-run; it's a *decision worth recording*. You don't commit while you're poking at a distribution. You commit when you've decided something: that the median beats the mean, that a feature leaks, that these are the columns you're keeping. Those are precisely the decisions your future self and your colleagues will need to reconstruct, and they're the ones that vanish first from memory and from notebook comments. Seen that way, version control isn't bureaucracy imposed on exploration; it's the thing that lets exploration leave a trace worth keeping. ::: ## Summary {#sec-version-control-summary} Version control replaces a fragile, manual habit with a reliable one. The shift comes down to four ideas: 1. **You already version your work — Git just does it properly.** Filenames record neither why a change was made nor which files changed together; commits record both, and let you recover any past state. 2. **Notebooks fight version control, and that's fixable.** The `.ipynb` JSON format buries code changes in execution counts and outputs. Strip outputs, use a notebook-aware diff (`nbdime`), or commit a paired script (`jupytext`) — don't abandon notebooks, just stop them from polluting the history. 3. **Commit what you author, ignore what you generate.** Code and configuration belong in Git; data, models, secrets, and caches do not. Each of those has a better-suited home. 4. **A commit is a decision worth recording.** Keep commits small and focused, write messages that capture the *why*, and use branches to experiment without putting `main` at risk. In the next chapter, we turn to the other half of "it works on my machine": *environments and dependencies*. ## Exercises {#sec-version-control-exercises} 1. Take one of your own existing projects and put it under version control: run `git init`, write a `.gitignore` appropriate for a data science project, and make an initial commit of the code and configuration only. List what you deliberately excluded, and for each excluded item name where it *should* be versioned or stored instead. 2. Configure clean diffs for a notebook. Either set up `nbstripout` to strip outputs automatically on commit, or pair the notebook with a script using `jupytext`. Make a one-line logic change and confirm that the tracked diff now shows only the change you made, not a wall of metadata. 3. Look back at a recent analysis and identify three decisions you made — a threshold, a metric, a dropped feature. For each, write the one-sentence commit message you wish your past self had left. What information is in your message that a filename or an inline comment could not have captured? 4. Use a branch to run an experiment. Create a branch, make a change you're genuinely unsure about (a different feature set, a new imputation rule), commit it, and then decide whether to merge it into `main` or discard it. How does this compare to how you currently try things out, and what did keeping `main` untouched buy you? 5. **Conceptual:** The "Data Science Bridge" in this chapter compares a Git history to an experiment log. Name one thing Git version control does that an experiment tracker such as MLflow does not, and one thing an experiment tracker does that Git does not. What does the division of labour between them tell you about the difference between *versioning your code* and *tracking your results*?