8  Debugging and profiling

8.1 From print to a process

Two scenes, both familiar. In the first, something in your notebook is producing wrong numbers, so you scatter print(df.shape) and print(df.head()) and print("got here") between the cells, re-running and squinting, slowly triangulating where it goes wrong. In the second, a pipeline that should take seconds takes forty minutes, and you have no idea which part is slow, so you start commenting things out to see what helps.

Both are diagnosis problems, and in both, data science folklore reaches for the same blunt instrument: print. It works — that’s why everyone does it — but it doesn’t scale to code that runs unattended, fails only sometimes, or is slow rather than wrong. Software engineering has sharper tools for each of these questions, and none of them is harder to use than scattering print statements; they’re just unfamiliar. This chapter is about trading the blunt instrument for the right one.

8.2 Reading the traceback

Before any tool, there’s a skill: actually reading the traceback. Many “mysterious” errors aren’t mysterious at all — the answer is in the traceback, which most people skim in a panic instead of reading. A traceback is a map of how execution reached the error, and it reads bottom to top. The last line names the exception and its message — what actually went wrong. The lines above it are the chain of calls that led there, deepest last, and somewhere in that chain is the deepest frame in your own code, which is almost always where to look first.

Traceback (most recent call last):
  File "pipeline.py", line 42, in <module>
    result = build_features(raw)
  File "features.py", line 18, in build_features
    df["spend_per_day"] = df["spend"] / df["active_days"]
  ...
ZeroDivisionError: division by zero

The bottom line tells you what (a division by zero); the frame in features.py tells you where (line 18, dividing by active_days); together they tell you what to check (an active_days of zero). Reading the traceback properly answers a large fraction of bugs before you reach for anything else.

8.3 A real debugger

When the traceback isn’t enough — the code doesn’t crash, it just produces the wrong answer — the print approach means re-running with ever more print statements. A debugger replaces all of that by pausing the program at the point you care about and letting you inspect the live state directly. Drop breakpoint() into the code and execution stops there, handing you a prompt where you can print any variable, evaluate expressions, step line by line, and try out a fix against the real data.

def build_features(df):
    breakpoint()                      # execution pauses here, in the live state
    df["spend_per_day"] = df["spend"] / df["active_days"]
    return df

# At the (Pdb) prompt:
#   p df["active_days"].min()     inspect: is there a zero?
#   n                             run the next line
#   c                             continue to the end (or next breakpoint)
#   q                             quit

This is the difference between asking “what was in this variable when I last printed it?” and “what is the entire state at the moment things go wrong?”. Editors such as VS Code wrap the same machinery in a visual interface — click a line to set a breakpoint, hover to inspect variables — but the command-line pdb works everywhere, including on the remote servers from Chapter 4 where there’s no GUI at all.

NoteData Science Bridge

Debugging is the scientific method you already practise on models, applied to code. When a model won’t converge or scores worse than it should, you don’t randomly flip settings; you form a hypothesis (“the target is leaking”, “the features aren’t scaled”), isolate the relevant part, and check it against evidence. A bug is identical: you hypothesise (“the join is dropping rows”, “this column has unexpected zeros”), isolate the step, and inspect the state to confirm or reject it. A debugger is simply the instrument that lets you observe the internal state, the way a confusion matrix or a plot of intermediate activations lets you observe what a model is doing.

Where it breaks down: a model’s failure is usually statistical and graded — it’s 8% worse than you hoped — whereas a code bug is typically discrete and reproducible, and once isolated it’s either present or it isn’t. Debugging rewards the same disciplined hypothesis-testing, but it terminates in a definite “found it”, which model diagnosis rarely does.

8.4 Logging instead of print

For understanding code while it runs — especially code that runs unattended — logging is what print is reaching toward but never quite becomes. A log message carries a severity level (DEBUG, INFO, WARNING, ERROR), a timestamp, and its source, and a single setting controls how much you see, so the same statements that help you develop become the diagnostics you keep in production, turned down to INFO or up to DEBUG without touching the code.

import logging
import io

# Send formatted log records to a buffer so we can display them here.
buffer = io.StringIO()
handler = logging.StreamHandler(buffer)
handler.setFormatter(logging.Formatter("%(asctime)s  %(levelname)-7s  %(message)s",
                                        datefmt="%H:%M:%S"))
logger = logging.getLogger("pipeline")
logger.handlers, logger.level = [handler], logging.INFO   # one switch sets verbosity

logger.debug("loaded 200,000 rows")                   # below INFO — suppressed
logger.info("dropped 1,284 rows with a missing target")
logger.warning("12 rows had negative spend; clipped to zero")

print(buffer.getvalue(), end="")
22:03:08  INFO     dropped 1,284 rows with a missing target
22:03:08  WARNING  12 rows had negative spend; clipped to zero

The DEBUG line is suppressed because the level is set to INFO; the INFO and WARNING lines appear, each timestamped and labelled. None of this is possible with print, which has no levels, no structure, and no off switch short of deleting the lines. This is also the on-ramp to production observability — the same logging becomes the basis for the monitoring we cover in Monitoring and observability.

8.5 Finding the slow part

Performance is the other diagnosis problem, and the cardinal rule is: measure, don’t guess. Programmers are notoriously bad at predicting where time goes, and data scientists doubly so, because the expensive operation is rarely the one that looks expensive. A profiler tells you the truth: cProfile attributes time to each function, line_profiler goes line by line, and memory_profiler does the same for memory. Almost always, one hot spot dominates, and once a profiler points at it, the fix is frequently the same — replace a Python-level loop over a DataFrame with a vectorised operation.

import numpy as np
import pandas as pd
import time

rng = np.random.default_rng(0)
df = pd.DataFrame({"spend": rng.exponential(50, 200_000),
                   "active_days": rng.integers(1, 365, 200_000)})

def spend_per_day_loop(df):           # the notebook reflex: iterate rows
    return [row.spend / row.active_days for row in df.itertuples()]

def spend_per_day_vectorised(df):     # the same computation, all at once
    return df["spend"] / df["active_days"]

t0 = time.perf_counter(); spend_per_day_loop(df);       loop = time.perf_counter() - t0
t0 = time.perf_counter(); spend_per_day_vectorised(df); vec = time.perf_counter() - t0

print(f"Row loop:    {loop * 1000:8.1f} ms")
print(f"Vectorised:  {vec * 1000:8.1f} ms")
print(f"Speed-up:    {loop / vec:8.0f}x")
Row loop:       127.5 ms
Vectorised:       0.8 ms
Speed-up:         151x

The two functions compute identical results; the vectorised version is faster by a wide margin, because it pushes the work into NumPy’s compiled code instead of looping in Python. The exact ratio depends on your hardware, but it’s routinely hundreds of times — the difference between a pipeline that takes a coffee break and one that takes an afternoon. The discipline, though, is to profile first: optimise the hot spot the profiler identifies, not the part you assume is slow, and don’t optimise at all until something is actually too slow to live with.

TipAuthor’s Note

Print-debugging is genuinely effective in a notebook, and that’s exactly why data scientists never move past it. The notebook keeps the kernel state live, so a quick print shows you what’s in a variable right now, immediately — there’s no friction and no reason to learn anything else. The trap is that this works right up until it doesn’t: until the code runs on a server you’re not watching, fails one time in fifty, or is slow rather than wrong, and then print has no answer.

The reframe is that “debugging” is really four different questions, and print only answers the first. What’s in this variable right now? — a print, fine. What is the whole state at the moment it breaks? — a debugger. What happened during a run I wasn’t watching? — logging. Where did the time go? — a profiler. Reaching for print regardless is like using a holdout score to answer every question about a model: it’s the right tool for one job and a poor stand-in for the others. The investment isn’t large — each tool is a few minutes to learn — and it converts “I can’t reproduce it” and “it’s slow for some reason” from afternoons into minutes.

8.6 Summary

Diagnosis has better tools than print, one per question:

  1. Read the traceback first. It reads bottom-up: the last line is what went wrong, and the deepest frame in your own code is where to look. Most errors are answered here.

  2. Use a debugger for wrong answers. breakpoint() pauses the program in its live state so you can inspect everything at the moment of failure, rather than re-running with more prints.

  3. Log, don’t print, for code that runs unattended. Levels, timestamps, and a single verbosity switch turn development diagnostics into production observability.

  4. Profile before you optimise. Measure where the time actually goes; fix the one hot spot that dominates, usually by vectorising — and don’t optimise code that’s fast enough.

This closes Part 2. With code that is readable, reusable, tested, and diagnosable, Part 3 steps back from individual files to the shape of a whole project, beginning with project structure.

8.7 Exercises

  1. Find a recent traceback from your own work (or trigger one deliberately) and read it bottom-up. Identify the exception and message, the deepest frame in your own code, and the one fact you’d check first. How much of the bug did the traceback alone explain?

  2. Take a bug you’d normally chase with print statements and use a debugger instead: drop in breakpoint() (or set one in your editor), inspect the relevant variables at the point of failure, and find the cause without re-running. What did seeing the full live state show you that a print would have missed?

  3. Replace the print statements in one of your scripts with logging, using levels appropriately (INFO for milestones, WARNING for recoverable oddities, DEBUG for detail). Then change one setting to switch between a quiet and a verbose run without editing any of the log statements.

  4. Take a slow piece of your own code and profile it with cProfile (or time the suspect parts directly). Was the hot spot where you expected? Fix the dominant one — vectorising a loop, caching a repeated computation — and measure the improvement.

  5. Conceptual: The Author’s Note frames debugging as four questions, each with its own tool. For each of the four — a variable’s current value, the full state at a failure, what happened in an unwatched run, and where the time went — name the tool, and explain why print is an inadequate substitute for three of them.