8 Debugging and profiling

8.1 From print to a process

Two scenes, both familiar. In the first, something in your notebook is producing wrong numbers, so you scatter print(df.shape) and print(df.head()) and print("got here") between the cells, re-running and squinting, slowly triangulating where it goes wrong. In the second, a pipeline that should take seconds takes forty minutes, and you have no idea which part is slow, so you start commenting things out to see what helps.

Both are diagnosis problems, and in both, data science folklore reaches for the same blunt instrument: print. It works — that’s why everyone does it — but it doesn’t scale to code that runs unattended, fails only sometimes, or is slow rather than wrong. Software engineering has sharper tools for each of these questions, and none of them is harder to use than scattering print statements; they’re just unfamiliar. This chapter is about trading the blunt instrument for the right one.

8.2 Reading the traceback

Before any tool, there’s a skill: actually reading the traceback. Many “mysterious” errors aren’t mysterious at all — the answer is in the traceback, which most people skim in a panic instead of reading. A traceback is a map of how execution reached the error, and it reads bottom to top. The last line names the exception and its message — what actually went wrong. The lines above it are the chain of calls that led there, deepest last, and somewhere in that chain is the deepest frame in your own code, which is almost always where to look first.

Traceback (most recent call last):
  File "pipeline.py", line 42, in <module>
    result = build_features(raw)
  File "features.py", line 18, in build_features
    df["spend_per_day"] = df["spend"] / df["active_days"]
  File ".../pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  ...
KeyError: 'active_days'

The bottom line tells you what (a column that isn’t there); the frame in features.py tells you where (line 18); together they tell you what to check (whether the upstream step renamed or dropped active_days). Note also which frames to skip: the deepest frames here are inside pandas itself, and they’re rarely where the bug lives. The deepest frame in your code — features.py, line 18 — is the one to read first. Reading the traceback properly answers a large fraction of bugs before you reach for anything else.

The harder lesson is what this example nearly was. A missing column raises; a zero in active_days does not. Divide a pandas Series by zero and you get inf back, quietly, with no traceback at all — and inf propagates through means and sums into a report without anything going red. Failures that announce themselves are the easy ones, precisely because a traceback is a starting point handed to you. We return to that in Technical debt, where a missing zero-guard is the running example of a shortcut that stays invisible until the data changes.

8.3 A real debugger

When the traceback isn’t enough — the code doesn’t crash, it just produces the wrong answer — the print approach means re-running with ever more print statements. A debugger replaces all of that by pausing the program at the point you care about and letting you inspect the live state directly. Drop breakpoint() into the code and execution stops there, handing you a prompt where you can print any variable, evaluate expressions, step line by line, and try out a fix against the real data.

def build_features(df):
    breakpoint()                      # execution pauses here, in the live state
    df["spend_per_day"] = df["spend"] / df["active_days"]
    return df

# At the (Pdb) prompt:
#   p df["active_days"].min()     inspect: is there a zero?
#   n                             run the next line
#   c                             continue to the end (or next breakpoint)
#   q                             quit

This is the difference between asking “what was in this variable when I last printed it?” and “what is the entire state at the moment things go wrong?”. Editors such as VS Code wrap the same machinery in a visual interface — click a line to set a breakpoint, hover to inspect variables — but the command-line pdb works everywhere, including on the remote servers from Chapter 4 where there’s no GUI at all.

There’s one more entry point worth knowing, and it’s the one that fits the notebook workflow best. When a cell has already raised, you don’t need to re-run it with a breakpoint() inserted: type %debug in the next cell and IPython drops you into the debugger at the exact frame that failed, with the state that caused the failure still intact.

# A cell just raised KeyError: 'active_days'. In the next cell:
%debug

# At the (ipdb) prompt, in the failing frame:
#   u / d                    move up and down the call stack
#   p df.columns.tolist()    what columns did it actually have?
#   p raw.shape              inspect any frame's locals

This is pdb.post_mortem() under a shorter name, and it changes the economics of the traceback. The frames the traceback merely names become frames you can interrogate — you stop guessing what df contained and simply ask it. For the read-a-traceback-then-form-a-hypothesis loop this chapter is built around, %debug is the fastest path from the second step to the answer, and it costs nothing to try before you reach for anything else. The one caveat is that it inspects the most recent exception, so run it before executing another cell that raises.

Data Science Bridge

Debugging is the scientific method you already practise on models, applied to code. When a model won’t converge or scores worse than it should, you don’t randomly flip settings; you form a hypothesis (“the target is leaking”, “the features aren’t scaled”), isolate the relevant part, and check it against evidence. A bug is identical: you hypothesise (“the join is dropping rows”, “this column has unexpected zeros”), isolate the step, and inspect the state to confirm or reject it. A debugger is simply the instrument that lets you observe the internal state, the way a confusion matrix or a plot of intermediate activations lets you observe what a model is doing.

Where it breaks down: a model’s failure is usually statistical and graded — it’s 8% worse than you hoped — whereas a code bug is typically discrete and reproducible, and once isolated it’s either present or it isn’t. Debugging rewards the same disciplined hypothesis-testing, but it terminates in a definite “found it”, which model diagnosis rarely does.

8.4 Logging instead of print

For understanding code while it runs — especially code that runs unattended — logging is what print is reaching toward but never quite becomes. A log message carries a severity level (DEBUG, INFO, WARNING, ERROR), a timestamp, and its source, and a single setting controls how much you see, so the same statements that help you develop become the diagnostics you keep in production, turned down to INFO or up to DEBUG without touching the code.

import logging
import io

# Send formatted log records to a buffer so we can display them here.
buffer = io.StringIO()
handler = logging.StreamHandler(buffer)
handler.setFormatter(logging.Formatter("%(asctime)s  %(levelname)-7s  %(message)s",
                                        datefmt="%H:%M:%S"))
logger = logging.getLogger("pipeline")
logger.handlers, logger.level = [handler], logging.INFO   # one switch sets verbosity

logger.debug("loaded 200,000 rows")                   # below INFO — suppressed
logger.info("dropped 1,284 rows with a missing target")
logger.warning("12 rows had negative spend; clipped to zero")

print(buffer.getvalue(), end="")

07:11:46  INFO     dropped 1,284 rows with a missing target
07:11:46  WARNING  12 rows had negative spend; clipped to zero

The DEBUG line is suppressed because the level is set to INFO; the INFO and WARNING lines appear, each timestamped and labelled. None of this is possible with print, which has no levels, no structure, and no off switch short of deleting the lines. This is also the on-ramp to production observability — the same logging becomes the basis for the monitoring we cover in Monitoring and observability.

8.5 Finding the slow part

Performance is the other diagnosis problem, and the cardinal rule is: measure, don’t guess. Programmers are notoriously bad at predicting where time goes, and data scientists doubly so, because the expensive operation is rarely the one that looks expensive. A profiler tells you the truth: cProfile attributes time to each function and line_profiler goes line by line. For memory rather than time, memray tracks every allocation and will point at the line that made the copy you didn’t know about (Linux and macOS only), while scalene covers CPU and memory together and runs on Windows too. You may still see memory_profiler recommended in older material; it has been unmaintained since 2022, and the two above have replaced it. Almost always, one hot spot dominates, and once a profiler points at it, the fix is frequently the same — replace a Python-level loop over a DataFrame with a vectorised operation.

import numpy as np
import pandas as pd
import time

rng = np.random.default_rng(0)
df = pd.DataFrame({"spend": rng.exponential(50, 200_000),
                   "active_days": rng.integers(1, 365, 200_000)})

def spend_per_day_loop(df):           # the notebook reflex: iterate rows
    return [row.spend / row.active_days for row in df.itertuples()]

def spend_per_day_vectorised(df):     # the same computation, all at once
    return df["spend"] / df["active_days"]

t0 = time.perf_counter(); spend_per_day_loop(df);       loop = time.perf_counter() - t0
t0 = time.perf_counter(); spend_per_day_vectorised(df); vec = time.perf_counter() - t0

print(f"Row loop:    {loop * 1000:8.1f} ms")
print(f"Vectorised:  {vec * 1000:8.1f} ms")
print(f"Speed-up:    {loop / vec:8.0f}x")

Row loop:        75.7 ms
Vectorised:       0.7 ms
Speed-up:         107x

The two functions compute identical results; the vectorised version is faster by a wide margin, because it pushes the work into NumPy’s compiled code instead of looping in Python. The exact ratio depends on your hardware, but it’s routinely hundreds of times — the difference between a pipeline that takes a coffee break and one that takes an afternoon. The discipline, though, is to profile first: optimise the hot spot the profiler identifies, not the part you assume is slow, and don’t optimise at all until something is actually too slow to live with.

Author’s Note

Print-debugging is genuinely effective in a notebook, and that’s exactly why data scientists never move past it. The notebook keeps the kernel state live, so a quick print shows you what’s in a variable right now, immediately — there’s no friction and no reason to learn anything else. The trap is that this works right up until it doesn’t: until the code runs on a server you’re not watching, fails one time in fifty, or is slow rather than wrong, and then print has no answer.

The reframe is that “debugging” is really four different questions, and print only answers the first. What’s in this variable right now? — a print, fine. What is the whole state at the moment it breaks? — a debugger. What happened during a run I wasn’t watching? — logging. Where did the time go? — a profiler. Reaching for print regardless is like using a holdout score to answer every question about a model: it’s the right tool for one job and a poor stand-in for the others. The investment isn’t large — each tool is a few minutes to learn — and it converts “I can’t reproduce it” and “it’s slow for some reason” from afternoons into minutes.

8.6 Summary

Diagnosis has better tools than print, one per question:

Read the traceback first. It reads bottom-up: the last line is what went wrong, and the deepest frame in your own code is where to look. Most errors are answered here.
Use a debugger for wrong answers. breakpoint() pauses the program in its live state so you can inspect everything at the moment of failure, rather than re-running with more prints.
Log, don’t print, for code that runs unattended. Levels, timestamps, and a single verbosity switch turn development diagnostics into production observability.
Profile before you optimise. Measure where the time actually goes; fix the one hot spot that dominates, usually by vectorising — and don’t optimise code that’s fast enough.

This closes Part 2. With code that is readable, reusable, tested, and diagnosable, Part 3 steps back from individual files to the shape of a whole project, beginning with project structure.

8.7 Exercises

Find a recent traceback from your own work (or trigger one deliberately) and read it bottom-up. Identify the exception and message, the deepest frame in your own code, and the one fact you’d check first. How much of the bug did the traceback alone explain?
Take a bug you’d normally chase with print statements and use a debugger instead: drop in breakpoint() (or set one in your editor), inspect the relevant variables at the point of failure, and find the cause without re-running. What did seeing the full live state show you that a print would have missed?
Replace the print statements in one of your scripts with logging, using levels appropriately (INFO for milestones, WARNING for recoverable oddities, DEBUG for detail). Then change one setting to switch between a quiet and a verbose run without editing any of the log statements.
Take a slow piece of your own code and profile it with cProfile (or time the suspect parts directly). Was the hot spot where you expected? Fix the dominant one — vectorising a loop, caching a repeated computation — and measure the improvement.
Conceptual: The Author’s Note frames debugging as four questions, each with its own tool. For each of the four — a variable’s current value, the full state at a failure, what happened in an unwatched run, and where the time went — name the tool, and explain why print is an inadequate substitute for three of them.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Debugging and profiling {#sec-debugging} ## From print to a process {#sec-print-to-process} Two scenes, both familiar. In the first, something in your notebook is producing wrong numbers, so you scatter `print(df.shape)` and `print(df.head())` and `print("got here")` between the cells, re-running and squinting, slowly triangulating where it goes wrong. In the second, a pipeline that should take seconds takes forty minutes, and you have no idea which part is slow, so you start commenting things out to see what helps. Both are diagnosis problems, and in both, data science folklore reaches for the same blunt instrument: `print`. It works — that's why everyone does it — but it doesn't scale to code that runs unattended, fails only sometimes, or is slow rather than wrong. Software engineering has sharper tools for each of these questions, and none of them is harder to use than scattering print statements; they're just unfamiliar. This chapter is about trading the blunt instrument for the right one. ## Reading the traceback {#sec-tracebacks} Before any tool, there's a skill: actually reading the traceback. Many "mysterious" errors aren't mysterious at all — the answer is in the traceback, which most people skim in a panic instead of reading. A traceback is a map of how execution reached the error, and it reads **bottom to top**. The last line names the exception and its message — what actually went wrong. The lines above it are the chain of calls that led there, deepest last, and somewhere in that chain is the deepest frame *in your own code*, which is almost always where to look first. ```text Traceback (most recent call last): File "pipeline.py", line 42, in <module> result = build_features(raw) File "features.py", line 18, in build_features df["spend_per_day"] = df["spend"] / df["active_days"] File ".../pandas/core/frame.py", line 4102, in __getitem__ indexer = self.columns.get_loc(key) ... KeyError: 'active_days' ``` The bottom line tells you *what* (a column that isn't there); the frame in `features.py` tells you *where* (line 18); together they tell you *what to check* (whether the upstream step renamed or dropped `active_days`). Note also which frames to skip: the deepest frames here are inside pandas itself, and they're rarely where the bug lives. The deepest frame in *your* code — `features.py`, line 18 — is the one to read first. Reading the traceback properly answers a large fraction of bugs before you reach for anything else. The harder lesson is what this example nearly was. A missing column raises; a zero in `active_days` does not. Divide a pandas Series by zero and you get `inf` back, quietly, with no traceback at all — and `inf` propagates through means and sums into a report without anything going red. Failures that announce themselves are the easy ones, precisely because a traceback is a starting point handed to you. We return to that in *Technical debt*, where a missing zero-guard is the running example of a shortcut that stays invisible until the data changes. ## A real debugger {#sec-debugger} When the traceback isn't enough — the code doesn't crash, it just produces the wrong answer — the print approach means re-running with ever more print statements. A debugger replaces all of that by pausing the program at the point you care about and letting you inspect the live state directly. Drop `breakpoint()` into the code and execution stops there, handing you a prompt where you can print any variable, evaluate expressions, step line by line, and try out a fix against the real data. ```python def build_features(df): breakpoint() # execution pauses here, in the live state df["spend_per_day"] = df["spend"] / df["active_days"] return df # At the (Pdb) prompt: # p df["active_days"].min() inspect: is there a zero? # n run the next line # c continue to the end (or next breakpoint) # q quit ``` This is the difference between asking "what was in this variable when I last printed it?" and "what is the *entire* state at the moment things go wrong?". Editors such as VS Code wrap the same machinery in a visual interface — click a line to set a breakpoint, hover to inspect variables — but the command-line `pdb` works everywhere, including on the remote servers from @sec-command-line where there's no GUI at all. There's one more entry point worth knowing, and it's the one that fits the notebook workflow best. When a cell has *already* raised, you don't need to re-run it with a `breakpoint()` inserted: type `%debug` in the next cell and IPython drops you into the debugger at the exact frame that failed, with the state that caused the failure still intact. ```python # A cell just raised KeyError: 'active_days'. In the next cell: %debug # At the (ipdb) prompt, in the failing frame: # u / d move up and down the call stack # p df.columns.tolist() what columns did it actually have? # p raw.shape inspect any frame's locals ``` This is `pdb.post_mortem()` under a shorter name, and it changes the economics of the traceback. The frames the traceback merely *names* become frames you can interrogate — you stop guessing what `df` contained and simply ask it. For the read-a-traceback-then-form-a-hypothesis loop this chapter is built around, `%debug` is the fastest path from the second step to the answer, and it costs nothing to try before you reach for anything else. The one caveat is that it inspects the most recent exception, so run it before executing another cell that raises. ::: {.callout-note} ## Data Science Bridge Debugging is the scientific method you already practise on models, applied to code. When a model won't converge or scores worse than it should, you don't randomly flip settings; you form a hypothesis ("the target is leaking", "the features aren't scaled"), isolate the relevant part, and check it against evidence. A bug is identical: you hypothesise ("the join is dropping rows", "this column has unexpected zeros"), isolate the step, and inspect the state to confirm or reject it. A debugger is simply the instrument that lets you observe the internal state, the way a confusion matrix or a plot of intermediate activations lets you observe what a model is doing. Where it breaks down: a model's failure is usually statistical and graded — it's 8% worse than you hoped — whereas a code bug is typically discrete and reproducible, and once isolated it's either present or it isn't. Debugging rewards the same disciplined hypothesis-testing, but it terminates in a definite "found it", which model diagnosis rarely does. ::: ## Logging instead of print {#sec-logging} For understanding code while it runs — especially code that runs unattended — `logging` is what `print` is reaching toward but never quite becomes. A log message carries a severity level (`DEBUG`, `INFO`, `WARNING`, `ERROR`), a timestamp, and its source, and a single setting controls how much you see, so the same statements that help you develop become the diagnostics you keep in production, turned down to `INFO` or up to `DEBUG` without touching the code. ```{python} #| label: logging-vs-print #| echo: true import logging import io # Send formatted log records to a buffer so we can display them here. buffer = io.StringIO() handler = logging.StreamHandler(buffer) handler.setFormatter(logging.Formatter("%(asctime)s %(levelname)-7s %(message)s", datefmt="%H:%M:%S")) logger = logging.getLogger("pipeline") logger.handlers, logger.level = [handler], logging.INFO # one switch sets verbosity logger.debug("loaded 200,000 rows") # below INFO — suppressed logger.info("dropped 1,284 rows with a missing target") logger.warning("12 rows had negative spend; clipped to zero") print(buffer.getvalue(), end="") ``` The `DEBUG` line is suppressed because the level is set to `INFO`; the `INFO` and `WARNING` lines appear, each timestamped and labelled. None of this is possible with `print`, which has no levels, no structure, and no off switch short of deleting the lines. This is also the on-ramp to production observability — the same logging becomes the basis for the monitoring we cover in *Monitoring and observability*. ## Finding the slow part {#sec-profiling} Performance is the other diagnosis problem, and the cardinal rule is: **measure, don't guess.** Programmers are notoriously bad at predicting where time goes, and data scientists doubly so, because the expensive operation is rarely the one that looks expensive. A profiler tells you the truth: `cProfile` attributes time to each function and `line_profiler` goes line by line. For memory rather than time, `memray` tracks every allocation and will point at the line that made the copy you didn't know about (Linux and macOS only), while `scalene` covers CPU and memory together and runs on Windows too. You may still see `memory_profiler` recommended in older material; it has been unmaintained since 2022, and the two above have replaced it. Almost always, one hot spot dominates, and once a profiler points at it, the fix is frequently the same — replace a Python-level loop over a DataFrame with a vectorised operation. ```{python} #| label: profile-and-fix #| echo: true import numpy as np import pandas as pd import time rng = np.random.default_rng(0) df = pd.DataFrame({"spend": rng.exponential(50, 200_000), "active_days": rng.integers(1, 365, 200_000)}) def spend_per_day_loop(df): # the notebook reflex: iterate rows return [row.spend / row.active_days for row in df.itertuples()] def spend_per_day_vectorised(df): # the same computation, all at once return df["spend"] / df["active_days"] t0 = time.perf_counter(); spend_per_day_loop(df); loop = time.perf_counter() - t0 t0 = time.perf_counter(); spend_per_day_vectorised(df); vec = time.perf_counter() - t0 print(f"Row loop: {loop * 1000:8.1f} ms") print(f"Vectorised: {vec * 1000:8.1f} ms") print(f"Speed-up: {loop / vec:8.0f}x") ``` The two functions compute identical results; the vectorised version is faster by a wide margin, because it pushes the work into NumPy's compiled code instead of looping in Python. The exact ratio depends on your hardware, but it's routinely hundreds of times — the difference between a pipeline that takes a coffee break and one that takes an afternoon. The discipline, though, is to profile *first*: optimise the hot spot the profiler identifies, not the part you assume is slow, and don't optimise at all until something is actually too slow to live with. ::: {.callout-tip} ## Author's Note Print-debugging is genuinely effective in a notebook, and that's exactly why data scientists never move past it. The notebook keeps the kernel state live, so a quick `print` shows you what's in a variable right now, immediately — there's no friction and no reason to learn anything else. The trap is that this works right up until it doesn't: until the code runs on a server you're not watching, fails one time in fifty, or is slow rather than wrong, and then `print` has no answer. The reframe is that "debugging" is really four different questions, and `print` only answers the first. *What's in this variable right now?* — a print, fine. *What is the whole state at the moment it breaks?* — a debugger. *What happened during a run I wasn't watching?* — logging. *Where did the time go?* — a profiler. Reaching for `print` regardless is like using a holdout score to answer every question about a model: it's the right tool for one job and a poor stand-in for the others. The investment isn't large — each tool is a few minutes to learn — and it converts "I can't reproduce it" and "it's slow for some reason" from afternoons into minutes. ::: ## Summary {#sec-debugging-summary} Diagnosis has better tools than `print`, one per question: 1. **Read the traceback first.** It reads bottom-up: the last line is what went wrong, and the deepest frame in your own code is where to look. Most errors are answered here. 2. **Use a debugger for wrong answers.** `breakpoint()` pauses the program in its live state so you can inspect everything at the moment of failure, rather than re-running with more prints. 3. **Log, don't print, for code that runs unattended.** Levels, timestamps, and a single verbosity switch turn development diagnostics into production observability. 4. **Profile before you optimise.** Measure where the time actually goes; fix the one hot spot that dominates, usually by vectorising — and don't optimise code that's fast enough. This closes Part 2. With code that is readable, reusable, tested, and diagnosable, Part 3 steps back from individual files to the shape of a whole project, beginning with *project structure*. ## Exercises {#sec-debugging-exercises} 1. Find a recent traceback from your own work (or trigger one deliberately) and read it bottom-up. Identify the exception and message, the deepest frame in your own code, and the one fact you'd check first. How much of the bug did the traceback alone explain? 2. Take a bug you'd normally chase with `print` statements and use a debugger instead: drop in `breakpoint()` (or set one in your editor), inspect the relevant variables at the point of failure, and find the cause without re-running. What did seeing the full live state show you that a print would have missed? 3. Replace the `print` statements in one of your scripts with `logging`, using levels appropriately (`INFO` for milestones, `WARNING` for recoverable oddities, `DEBUG` for detail). Then change one setting to switch between a quiet and a verbose run without editing any of the log statements. 4. Take a slow piece of your own code and profile it with `cProfile` (or time the suspect parts directly). Was the hot spot where you expected? Fix the dominant one — vectorising a loop, caching a repeated computation — and measure the improvement. 5. **Conceptual:** The Author's Note frames debugging as four questions, each with its own tool. For each of the four — a variable's current value, the full state at a failure, what happened in an unwatched run, and where the time went — name the tool, and explain why `print` is an inadequate substitute for three of them.