4 The command line

4.1 Beyond the Run button

Almost everything you’ve run as a data scientist, you’ve run by clicking. Run this cell, run all cells, run the script in your IDE. It works beautifully — until the day the work has to happen somewhere there’s no one to click. Someone gives you SSH access to a GPU server with no desktop environment. A pipeline needs to run at 2am whether or not you’re awake. A colleague asks you to “just run it in CI”. Each of these hits the same wall: the click-to-run workflow assumes a human, a screen, and a mouse, and production assumes none of them.

The command line is how code runs when nobody is there to click. It has a deserved reputation for being terse and unwelcoming, and the goal of this chapter isn’t to make you love it — it’s to give you enough fluency that a headless server or a scheduled job stops being a barrier. The good news is that the core idea is one you already use every day, just in a different room of the house.

4.2 The shell is a REPL for your computer

You’re already fluent in one interactive prompt: the Python REPL, or IPython, or a Jupyter cell. You type an expression, it evaluates, you inspect the result, you type the next thing building on the last. The shell is exactly that, but the “expressions” are programs and the “values” are files and streams of text. ls lists, cd moves, cat prints a file — and, crucially, you compose them.

Data Science Bridge

A shell pipeline is method chaining for your operating system. When you write df.dropna().groupby("region")["spend"].mean(), you’re composing small operations into a larger one, each step doing a single job and handing its output to the next. The shell’s pipe does the same: cat sales.csv | grep "2026" | wc -l reads a file, keeps the matching lines, and counts them — three small tools composed into one answer. The mental model of “small operations, composed left to right” transfers directly.

Where it breaks down: pandas passes a richly typed object — a DataFrame, with columns and dtypes — from one method to the next inside a single process. Shell pipes pass untyped text, line by line, between separate programs. That’s what makes the shell so general (every tool speaks text) and also what makes it the wrong choice once your data has real structure: the moment you’re parsing CSV fields by counting commas in awk, you’ve found the boundary, and it’s time to reach for Python.

The handful of constructs worth knowing early are the pipe (|, send one command’s output into the next), redirection (> to write output to a file, >> to append, < to read from one), globbing (*.csv to match many files at once), and the exit code — every command returns a number when it finishes, zero for success and non-zero for failure. That last one is invisible day to day but is the foundation everything in Continuous integration is built on.

4.3 Running your code as a program

A notebook is run by a human, interactively. A program is run by anything — a person, a scheduler, a CI server — with no interaction at all. Turning your analysis into something runnable from the command line is mostly the move you already made in the previous chapters: get the logic out of cells and into a module, then give it an entry point.

The entry point is the if __name__ == "__main__": guard. Code under that guard runs when the file is executed as a script (python train.py) but not when it’s imported by something else — so the same file can be both an importable module and a runnable program. Arguments come in through sys.argv, though in practice you’d use argparse from the standard library or typer for anything beyond a single flag.

The part that feels alien coming from notebooks is the exit code. A notebook cell that raises an exception just shows red text and you move on; a program that fails needs to say so in a way the shell can detect, by exiting with a non-zero code. That single convention is what lets you chain steps safely and what lets an automated system know your job failed.

import subprocess
import sys

# A tiny "validation script" that succeeds, run as a real subprocess.
ok = subprocess.run(
    [sys.executable, "-c", "rows = 100; assert rows > 0; print('validation passed')"],
    capture_output=True, text=True,
)
print(f"stdout: {ok.stdout.strip()!r}   exit code: {ok.returncode}")

# The same, but the check fails — the assertion raises, Python exits non-zero.
bad = subprocess.run(
    [sys.executable, "-c", "rows = 0; assert rows > 0, 'no data!'"],
    capture_output=True, text=True,
)
print(f"stderr: {bad.stderr.strip().splitlines()[-1]!r}   exit code: {bad.returncode}")

print("\nZero means success; non-zero means failure — that is the whole contract.")

stdout: 'validation passed'   exit code: 0
stderr: 'AssertionError: no data!'   exit code: 1

Zero means success; non-zero means failure — that is the whole contract.

The successful run exits 0; the failing one exits 1 and writes its error to the standard error stream. This is exactly how the shell and a CI server tell whether your code worked. It’s also what makes chaining meaningful: python validate.py && python train.py runs the training step only if validation exited zero, so a failed check stops the pipeline instead of feeding bad data downstream.

4.4 Working on machines that aren’t yours

Sooner or later the compute you need isn’t your laptop — it’s a server with more memory, a GPU, or simply somewhere a job can run unattended. You reach it with ssh user@host, which drops you into a shell on that machine, and you move files with scp or, better, rsync (which copies only what changed and can resume). Configuration and credentials on these machines usually come through environment variables rather than hard-coded values — export API_KEY=..., read in Python with os.environ — which is the thread we pick up properly in Configuration and secrets.

The detail that catches everyone out the first time is what happens to a job when the connection drops. A command you start in a plain SSH session is tied to that session: close your laptop or lose the wifi, and the job dies with the disconnect. A terminal multiplexer — tmux or screen — solves this by running your shell in a session that lives on the server independently of your connection. You start tmux, launch the long training run inside it, detach, and close your laptop; the job keeps running, and you reattach later to find it finished. For any job measured in hours, this is not optional.

4.5 Automating the workflow

Once your steps are runnable programs, the natural next move is to stop typing them by hand. A shell script is the simplest option — a file of commands run top to bottom — and it captures a sequence so you (and others) run it the same way every time. But for the multi-step workflows typical of data science, a task runner earns its place, and the venerable choice is make.

A Makefile declares named targets, each with the commands that build it. It reads almost like documentation of your pipeline:

data/raw/customers.csv: src/download_data.py
    python -m src.download_data

data/features.parquet: data/raw/customers.csv src/build_features.py
    python -m src.build_features

models/model.pkl: data/features.parquet src/train_model.py
    python -m src.train_model

train: models/model.pkl        # a convenient name for "build the model"

clean:
    rm -rf data/features.parquet models/model.pkl

.PHONY: train clean

Now make train runs the whole chain in order — and, more usefully, runs only what’s required. The mechanism is the one thing worth understanding about make: each target is a file, and each prerequisite is a file that target depends on. If models/model.pkl is newer than everything listed after the colon, make does nothing. Edit src/build_features.py and it rebuilds the features and the model, but doesn’t re-download the raw data. That is the whole value proposition, and it only works if the names on the left are genuinely the files the recipe writes and the names on the right are genuinely the files it reads.

Which is why .PHONY appears only on train and clean. A phony target is one that names an action rather than a file, and make runs it unconditionally. That’s correct for clean, and for train as an alias — but marking a real output like models/model.pkl phony would switch off timestamp checking entirely and rebuild everything on every invocation. The most common way to end up with a Makefile that quietly does no incremental work at all is to declare every target phony because it made an error message go away.

It serves double duty: it automates the workflow and documents the steps, so a newcomer can read the Makefile and see how the project fits together. It’s a deliberately simple tool — for complex, branching data pipelines you’ll later meet purpose-built orchestrators in Data pipelines — but for capturing “the three commands you always run in this order”, nothing is faster to adopt.

Author’s Note

There’s a ceiling built into the graphical, click-to-run way of working, and most data scientists meet it without quite naming it. Pointing and clicking is wonderfully immediate for exploration — you see the result instantly, you poke at it, you adjust. But the same immediacy becomes the bottleneck the moment the work has to happen repeatedly, unattended, or somewhere without a screen. You cannot click a button on a server you’ve disconnected from, and you cannot schedule a mouse.

The resistance to the command line is reasonable: it’s terse, it’s unforgiving, and its error messages assume you already know what went wrong. But what it offers in return is leverage. Anything you can type, you can put in a script; anything you can script, you can schedule, repeat, and hand to a machine to run a thousand times without you. The shell is the point at which your work stops needing you in the room — and for code that’s meant to run in production, needing you in the room is the thing you’re trying to engineer away.

4.6 Summary

The command line is how code runs without a human to drive it:

The shell is a REPL for your computer. Small tools composed with pipes are the same idea as chaining pandas operations — until the data gains structure, at which point you switch back to Python.
Make your code a program, not just a notebook. An entry point and a meaningful exit code turn an analysis into something a scheduler or CI server can run and check.
You’ll work on machines that aren’t yours. ssh and rsync get you there and move data; tmux keeps long jobs alive when your connection doesn’t.
Automate the sequence. A shell script or a Makefile captures the steps you always run, so they run the same way every time — and documents them in the process.

With version control, reproducible environments, and command-line fluency in place, Part 2 turns from running code to writing code that others — and future-you — can read, starting with readable code.

4.7 Exercises

Take a multi-step workflow you currently run by hand — clicking through cells, or running scripts one after another — and capture it as a shell script or a Makefile with one target per step. Run the whole thing from a single command, and note any step that turned out to depend on you remembering to do something first.
Answer a question about a data file using only the shell: with a pipeline of cat, head, grep, sort, uniq, and wc, count how many rows match a condition, or how many distinct values appear in a column, without opening Python. Where did this feel natural, and where did you wish you had a DataFrame?
Write a small data-validation script that exits non-zero when a check fails (for example, if a required column is missing or a row count is zero), and chain it with && so a downstream step runs only when validation passes. Explain why this exit-code behaviour matters for automation and continuous integration.
If you have access to a remote machine, start a long-running command inside tmux, detach, disconnect, then reconnect and confirm it survived. If you don’t, write down what would happen to a job started in a plain SSH session when the connection drops, and explain why tmux changes that.
Conceptual: A colleague counts last year’s orders with grep "2026" sales.csv | wc -l and reports the result as the number of orders placed in 2026. Predict at least two distinct ways that figure could be wrong, and say what len(df[df["order_date"].dt.year == 2026]) would have done differently in each case. Which property of the pipe itself — not of grep — is responsible?

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # The command line {#sec-command-line} ## Beyond the Run button {#sec-beyond-the-button} Almost everything you've run as a data scientist, you've run by clicking. Run this cell, run all cells, run the script in your IDE. It works beautifully — until the day the work has to happen somewhere there's no one to click. Someone gives you SSH access to a GPU server with no desktop environment. A pipeline needs to run at 2am whether or not you're awake. A colleague asks you to "just run it in CI". Each of these hits the same wall: the click-to-run workflow assumes a human, a screen, and a mouse, and production assumes none of them. The command line is how code runs when nobody is there to click. It has a deserved reputation for being terse and unwelcoming, and the goal of this chapter isn't to make you love it — it's to give you enough fluency that a headless server or a scheduled job stops being a barrier. The good news is that the core idea is one you already use every day, just in a different room of the house. ## The shell is a REPL for your computer {#sec-shell-repl} You're already fluent in one interactive prompt: the Python REPL, or IPython, or a Jupyter cell. You type an expression, it evaluates, you inspect the result, you type the next thing building on the last. The shell is exactly that, but the "expressions" are programs and the "values" are files and streams of text. `ls` lists, `cd` moves, `cat` prints a file — and, crucially, you compose them. ::: {.callout-note} ## Data Science Bridge A shell pipeline is method chaining for your operating system. When you write `df.dropna().groupby("region")["spend"].mean()`, you're composing small operations into a larger one, each step doing a single job and handing its output to the next. The shell's pipe does the same: `cat sales.csv | grep "2026" | wc -l` reads a file, keeps the matching lines, and counts them — three small tools composed into one answer. The mental model of "small operations, composed left to right" transfers directly. Where it breaks down: pandas passes a richly *typed* object — a DataFrame, with columns and dtypes — from one method to the next inside a single process. Shell pipes pass untyped *text*, line by line, between separate programs. That's what makes the shell so general (every tool speaks text) and also what makes it the wrong choice once your data has real structure: the moment you're parsing CSV fields by counting commas in `awk`, you've found the boundary, and it's time to reach for Python. ::: The handful of constructs worth knowing early are the pipe (`|`, send one command's output into the next), redirection (`>` to write output to a file, `>>` to append, `<` to read from one), globbing (`*.csv` to match many files at once), and the exit code — every command returns a number when it finishes, zero for success and non-zero for failure. That last one is invisible day to day but is the foundation everything in *Continuous integration* is built on. ## Running your code as a program {#sec-code-as-program} A notebook is run by a human, interactively. A program is run by *anything* — a person, a scheduler, a CI server — with no interaction at all. Turning your analysis into something runnable from the command line is mostly the move you already made in the previous chapters: get the logic out of cells and into a module, then give it an entry point. The entry point is the `if __name__ == "__main__":` guard. Code under that guard runs when the file is executed as a script (`python train.py`) but not when it's imported by something else — so the same file can be both an importable module and a runnable program. Arguments come in through `sys.argv`, though in practice you'd use `argparse` from the standard library or `typer` for anything beyond a single flag. The part that feels alien coming from notebooks is the exit code. A notebook cell that raises an exception just shows red text and you move on; a program that fails needs to *say so* in a way the shell can detect, by exiting with a non-zero code. That single convention is what lets you chain steps safely and what lets an automated system know your job failed. ```{python} #| label: exit-codes #| echo: true import subprocess import sys # A tiny "validation script" that succeeds, run as a real subprocess. ok = subprocess.run( [sys.executable, "-c", "rows = 100; assert rows > 0; print('validation passed')"], capture_output=True, text=True, ) print(f"stdout: {ok.stdout.strip()!r} exit code: {ok.returncode}") # The same, but the check fails — the assertion raises, Python exits non-zero. bad = subprocess.run( [sys.executable, "-c", "rows = 0; assert rows > 0, 'no data!'"], capture_output=True, text=True, ) print(f"stderr: {bad.stderr.strip().splitlines()[-1]!r} exit code: {bad.returncode}") print("\nZero means success; non-zero means failure — that is the whole contract.") ``` The successful run exits `0`; the failing one exits `1` and writes its error to the standard error stream. This is exactly how the shell and a CI server tell whether your code worked. It's also what makes chaining meaningful: `python validate.py && python train.py` runs the training step *only if* validation exited zero, so a failed check stops the pipeline instead of feeding bad data downstream. ## Working on machines that aren't yours {#sec-remote} Sooner or later the compute you need isn't your laptop — it's a server with more memory, a GPU, or simply somewhere a job can run unattended. You reach it with `ssh user@host`, which drops you into a shell on that machine, and you move files with `scp` or, better, `rsync` (which copies only what changed and can resume). Configuration and credentials on these machines usually come through environment variables rather than hard-coded values — `export API_KEY=...`, read in Python with `os.environ` — which is the thread we pick up properly in *Configuration and secrets*. The detail that catches everyone out the first time is what happens to a job when the connection drops. A command you start in a plain SSH session is tied to that session: close your laptop or lose the wifi, and the job dies with the disconnect. A terminal multiplexer — `tmux` or `screen` — solves this by running your shell in a session that lives on the server independently of your connection. You start `tmux`, launch the long training run inside it, detach, and close your laptop; the job keeps running, and you reattach later to find it finished. For any job measured in hours, this is not optional. ## Automating the workflow {#sec-automation} Once your steps are runnable programs, the natural next move is to stop typing them by hand. A shell script is the simplest option — a file of commands run top to bottom — and it captures a sequence so you (and others) run it the same way every time. But for the multi-step workflows typical of data science, a task runner earns its place, and the venerable choice is `make`. A `Makefile` declares named targets, each with the commands that build it. It reads almost like documentation of your pipeline: ```makefile data/raw/customers.csv: src/download_data.py python -m src.download_data data/features.parquet: data/raw/customers.csv src/build_features.py python -m src.build_features models/model.pkl: data/features.parquet src/train_model.py python -m src.train_model train: models/model.pkl # a convenient name for "build the model" clean: rm -rf data/features.parquet models/model.pkl .PHONY: train clean ``` Now `make train` runs the whole chain in order — and, more usefully, runs *only* what's required. The mechanism is the one thing worth understanding about `make`: each target is a **file**, and each prerequisite is a file that target depends on. If `models/model.pkl` is newer than everything listed after the colon, `make` does nothing. Edit `src/build_features.py` and it rebuilds the features and the model, but doesn't re-download the raw data. That is the whole value proposition, and it only works if the names on the left are genuinely the files the recipe writes and the names on the right are genuinely the files it reads. Which is why `.PHONY` appears only on `train` and `clean`. A phony target is one that names an *action* rather than a file, and `make` runs it unconditionally. That's correct for `clean`, and for `train` as an alias — but marking a real output like `models/model.pkl` phony would switch off timestamp checking entirely and rebuild everything on every invocation. The most common way to end up with a `Makefile` that quietly does no incremental work at all is to declare every target phony because it made an error message go away. It serves double duty: it automates the workflow *and* documents the steps, so a newcomer can read the `Makefile` and see how the project fits together. It's a deliberately simple tool — for complex, branching data pipelines you'll later meet purpose-built orchestrators in *Data pipelines* — but for capturing "the three commands you always run in this order", nothing is faster to adopt. ::: {.callout-tip} ## Author's Note There's a ceiling built into the graphical, click-to-run way of working, and most data scientists meet it without quite naming it. Pointing and clicking is wonderfully immediate for exploration — you see the result instantly, you poke at it, you adjust. But the same immediacy becomes the bottleneck the moment the work has to happen repeatedly, unattended, or somewhere without a screen. You cannot click a button on a server you've disconnected from, and you cannot schedule a mouse. The resistance to the command line is reasonable: it's terse, it's unforgiving, and its error messages assume you already know what went wrong. But what it offers in return is leverage. Anything you can type, you can put in a script; anything you can script, you can schedule, repeat, and hand to a machine to run a thousand times without you. The shell is the point at which your work stops needing you in the room — and for code that's meant to run in production, needing you in the room is the thing you're trying to engineer away. ::: ## Summary {#sec-command-line-summary} The command line is how code runs without a human to drive it: 1. **The shell is a REPL for your computer.** Small tools composed with pipes are the same idea as chaining pandas operations — until the data gains structure, at which point you switch back to Python. 2. **Make your code a program, not just a notebook.** An entry point and a meaningful exit code turn an analysis into something a scheduler or CI server can run and check. 3. **You'll work on machines that aren't yours.** `ssh` and `rsync` get you there and move data; `tmux` keeps long jobs alive when your connection doesn't. 4. **Automate the sequence.** A shell script or a `Makefile` captures the steps you always run, so they run the same way every time — and documents them in the process. With version control, reproducible environments, and command-line fluency in place, Part 2 turns from *running* code to *writing* code that others — and future-you — can read, starting with *readable code*. ## Exercises {#sec-command-line-exercises} 1. Take a multi-step workflow you currently run by hand — clicking through cells, or running scripts one after another — and capture it as a shell script or a `Makefile` with one target per step. Run the whole thing from a single command, and note any step that turned out to depend on you remembering to do something first. 2. Answer a question about a data file using only the shell: with a pipeline of `cat`, `head`, `grep`, `sort`, `uniq`, and `wc`, count how many rows match a condition, or how many distinct values appear in a column, without opening Python. Where did this feel natural, and where did you wish you had a DataFrame? 3. Write a small data-validation script that exits non-zero when a check fails (for example, if a required column is missing or a row count is zero), and chain it with `&&` so a downstream step runs only when validation passes. Explain why this exit-code behaviour matters for automation and continuous integration. 4. If you have access to a remote machine, start a long-running command inside `tmux`, detach, disconnect, then reconnect and confirm it survived. If you don't, write down what would happen to a job started in a plain SSH session when the connection drops, and explain why `tmux` changes that. 5. **Conceptual:** A colleague counts last year's orders with `grep "2026" sales.csv | wc -l` and reports the result as the number of orders placed in 2026. Predict at least two distinct ways that figure could be wrong, and say what `len(df[df["order_date"].dt.year == 2026])` would have done differently in each case. Which property of the pipe itself — not of `grep` — is responsible?