Building with Certainty

Transpose: Data Science to Software Engineering

Author

Matthew Gibbons

Published

20 July 2026

Preface

Who this book is for

Your model works. It scores well on the holdout set, the stakeholders like the dashboard, and the notebook runs top-to-bottom — most of the time. But then someone asks you to run it on new data, and you can’t quite remember which cell to skip. Or a colleague tries to reproduce your results and gets different numbers. Or someone says the word “deploy” and the room goes quiet.

You’re a data scientist. You can wrangle data, build models, and extract insight from noise. But nobody taught you how to build software that other people can run, test, maintain, and trust — and increasingly, that’s what’s being asked of you.

Most software engineering resources assume you either have a computer science degree or are starting from scratch. The first kind buries you in design patterns and type theory that feel disconnected from your work. The second kind explains what a variable is. This book does neither. It starts from what you already know — Python fluency, Jupyter notebooks, the iterative rhythm of exploratory analysis — and builds the engineering practices you need to make your work reliable, reproducible, and production-ready.

This is not a book about becoming a software engineer. It’s a book about becoming a data scientist whose code works beyond their own laptop.

What this book covers

The book is organised into six parts, each building on the last.

Part 1: Foundations starts with the uncomfortable truth that a notebook is not a system. It introduces the engineering mindset — version control, reproducible environments, and command-line fluency — as the foundation everything else rests on. If your current workflow involves emailing .ipynb files or a folder called model_final_v3_ACTUAL.py, this is where things start to change.

Part 2: Code Quality is about writing code that lasts longer than a single analysis. Readable code, modular design, testing (yes, even when outputs are stochastic), and systematic debugging. These aren’t bureaucratic hoops; they’re the practices that let you trust your own work six months from now.

Part 3: Architecture zooms out from individual files to the shape of a whole project. How to structure a data science project so others can navigate it. How to build data pipelines that are composable and testable. How to manage configuration without hard-coding paths. How to expose a model through an API.

Part 4: Operations takes your code from “it runs on my machine” to “it runs reliably, automatically, and at scale.” Continuous integration, containerisation, deployment strategies, and monitoring. This is where data science meets the real world — and where a notebook, however good the work inside it, stops being enough on its own.

Part 5: Collaboration addresses the human side. Code review, documentation that people actually read, technical debt (every data scientist has a notebook graveyard), and working effectively with software engineers who speak a slightly different language.

Part 6: Applied puts everything together in three end-to-end projects: taking a notebook to a production API, building a reproducible research pipeline, and constructing an MLOps pipeline for real-time predictions. Each project uses the practices from Parts 1–5 in context, showing how they fit together.

Four appendices provide supporting material: a tooling reference for the libraries and tools used throughout the book, a concept bridge mapping data science terms to software engineering terms, a reading list, and worked answers to all chapter exercises.

A note on scope. This is not a systems programming book; there are no memory models or compiler internals. It is not a “learn Python” book; Python fluency is assumed. And the treatment is practical and applied — aimed at the working data scientist who needs to ship reliable code, not at someone pursuing a software architecture certification.

How this book works

Every concept follows the same rhythm: start with a situation you recognise, show what goes wrong without engineering practices, then introduce the practice and demonstrate the improvement. Code examples are executable and use realistic data science contexts — you’ll test feature engineering functions, not calculators.

Throughout, two types of callout bridge the gap between data science and engineering thinking:

Data Science Bridge

These connect a software engineering concept to something you already understand from data science. For example, a test suite is like a holdout set for your code — it catches problems you didn’t anticipate by evaluating against cases the development process didn’t see. Where the analogy holds, we push it. Where it breaks down, we say so.

Author’s Note

These are observations about the gap between data science and software engineering practice: why certain engineering habits feel unnatural to data scientists, why that resistance is often reasonable, and what changes when you adopt the practice anyway. They focus on the conceptual tension, not prescriptive rules.

All code in this book is executable Python, and the source that produced each example is shown alongside it. Engineering is a practice, not a theory, and the code shows you what that practice looks like in a data science context.

What you’ll need

Python fluency is assumed. You know pandas, numpy, and scikit-learn. You can write a class, use a context manager, and read a stack trace.

Data science experience — you’ve built models, cleaned data, and presented results. You know what a training/test split is and why you do it. The statistical and ML concepts are your home ground; this book won’t explain them.

No prior software engineering training. That is what this book provides, from first principles.

Running the code

All code examples are executable Python. You can install the dependencies with:

pip install -r requirements.txt

Getting started

The book is designed to be read in order. Each part builds on the last, and later chapters assume practices introduced earlier. That said, if you want a feel for where this is heading, glance at Part 4 or the end-to-end projects in Part 6 to see what the payoff looks like. Then start at Part 1, 1 From notebook to system: the gap between a notebook that works and a system you can trust is where everything begins.

--- number-sections: false --- # Preface {.unnumbered} ## Who this book is for Your model works. It scores well on the holdout set, the stakeholders like the dashboard, and the notebook runs top-to-bottom — most of the time. But then someone asks you to run it on new data, and you can't quite remember which cell to skip. Or a colleague tries to reproduce your results and gets different numbers. Or someone says the word "deploy" and the room goes quiet. You're a data scientist. You can wrangle data, build models, and extract insight from noise. But nobody taught you how to build software that other people can run, test, maintain, and trust — and increasingly, that's what's being asked of you. Most software engineering resources assume you either have a computer science degree or are starting from scratch. The first kind buries you in design patterns and type theory that feel disconnected from your work. The second kind explains what a variable is. This book does neither. It starts from what you already know — Python fluency, Jupyter notebooks, the iterative rhythm of exploratory analysis — and builds the engineering practices you need to make your work reliable, reproducible, and production-ready. This is not a book about becoming a software engineer. It's a book about becoming a data scientist whose code works beyond their own laptop. ## What this book covers The book is organised into six parts, each building on the last. **Part 1: Foundations** starts with the uncomfortable truth that a notebook is not a system. It introduces the engineering mindset — version control, reproducible environments, and command-line fluency — as the foundation everything else rests on. If your current workflow involves emailing `.ipynb` files or a folder called `model_final_v3_ACTUAL.py`, this is where things start to change. **Part 2: Code Quality** is about writing code that lasts longer than a single analysis. Readable code, modular design, testing (yes, even when outputs are stochastic), and systematic debugging. These aren't bureaucratic hoops; they're the practices that let you trust your own work six months from now. **Part 3: Architecture** zooms out from individual files to the shape of a whole project. How to structure a data science project so others can navigate it. How to build data pipelines that are composable and testable. How to manage configuration without hard-coding paths. How to expose a model through an API. **Part 4: Operations** takes your code from "it runs on my machine" to "it runs reliably, automatically, and at scale." Continuous integration, containerisation, deployment strategies, and monitoring. This is where data science meets the real world — and where a notebook, however good the work inside it, stops being enough on its own. **Part 5: Collaboration** addresses the human side. Code review, documentation that people actually read, technical debt (every data scientist has a notebook graveyard), and working effectively with software engineers who speak a slightly different language. **Part 6: Applied** puts everything together in three end-to-end projects: taking a notebook to a production API, building a reproducible research pipeline, and constructing an MLOps pipeline for real-time predictions. Each project uses the practices from Parts 1–5 in context, showing how they fit together. Four appendices provide supporting material: a tooling reference for the libraries and tools used throughout the book, a concept bridge mapping data science terms to software engineering terms, a reading list, and worked answers to all chapter exercises. A note on scope. This is not a systems programming book; there are no memory models or compiler internals. It is not a "learn Python" book; Python fluency is assumed. And the treatment is practical and applied — aimed at the working data scientist who needs to ship reliable code, not at someone pursuing a software architecture certification. ## How this book works Every concept follows the same rhythm: start with a situation you recognise, show what goes wrong without engineering practices, then introduce the practice and demonstrate the improvement. Code examples are executable and use realistic data science contexts — you'll test feature engineering functions, not calculators. Throughout, two types of callout bridge the gap between data science and engineering thinking: ::: {.callout-note} ## Data Science Bridge These connect a software engineering concept to something you already understand from data science. For example, a test suite is like a holdout set for your code — it catches problems you didn't anticipate by evaluating against cases the development process didn't see. Where the analogy holds, we push it. Where it breaks down, we say so. ::: ::: {.callout-tip} ## Author's Note These are observations about the gap between data science and software engineering practice: why certain engineering habits feel unnatural to data scientists, why that resistance is often reasonable, and what changes when you adopt the practice anyway. They focus on the conceptual tension, not prescriptive rules. ::: All code in this book is executable Python, and the source that produced each example is shown alongside it. Engineering is a practice, not a theory, and the code shows you what that practice looks like in a data science context. ## What you'll need **Python fluency** is assumed. You know pandas, numpy, and scikit-learn. You can write a class, use a context manager, and read a stack trace. **Data science experience** — you've built models, cleaned data, and presented results. You know what a training/test split is and why you do it. The statistical and ML concepts are your home ground; this book won't explain them. **No prior software engineering training.** That is what this book provides, from first principles. ## Running the code All code examples are executable Python. You can install the dependencies with: ```bash pip install -r requirements.txt ```  ## Getting started The book is designed to be read in order. Each part builds on the last, and later chapters assume practices introduced earlier. That said, if you want a feel for where this is heading, glance at Part 4 or the end-to-end projects in Part 6 to see what the payoff looks like. Then start at Part 1, @sec-notebook-to-system: the gap between a notebook that works and a system you can trust is where everything begins.