Appendix C — Reading list

If a chapter left you wanting more depth, more rigour, or a different voice on the same idea, the resources below are where to look. They are organised by purpose rather than by chapter, and each entry includes enough context for you to decide whether it is worth your time. The list is deliberately short: these are the books and resources that repay a data scientist’s attention most directly, not an exhaustive software engineering syllabus.

C.1 The engineering craft

These are about how to write code that lasts, pitched at someone who can already program and wants the judgement that experience usually buys.

The Pragmatic Programmer: Your Journey to Mastery — Andrew Hunt and David Thomas (Addison-Wesley, 20th Anniversary edition, 2019). The closest thing the profession has to a book of first principles for working programmers. It is a collection of short, practical essays — on duplication, on coupling, on automation, on taking responsibility for your work — that together describe the mindset behind most of this book. Nothing in it is data-science-specific, which is the point: it is the engineering sensibility, distilled.

A Philosophy of Software Design — John Ousterhout (2nd edition, 2021). A short, opinionated book built around a single idea: that the central challenge of software is managing complexity, and that good design is whatever reduces it. Its treatment of what makes a module “deep” rather than shallow is the best short articulation of the instinct behind Chapter 5 and Chapter 6. Read it when you want to understand why some interfaces feel clean and others fight you.

Refactoring: Improving the Design of Existing Code — Martin Fowler (Addison-Wesley, 2nd edition, 2018). The definitive treatment of changing code safely without changing what it does — the discipline behind paying down the technical debt of Chapter 19. Its catalogue of named refactorings matters less than its central argument: that you keep code malleable by improving it continuously, in small steps, backed by tests. (Robert C. Martin’s Clean Code, Prentice Hall, 2008, covers adjacent ground and is widely read; treat its more absolute rules as prompts for thought rather than law.)

C.2 Python, more deeply

You arrive fluent in Python; these take you from fluent to idiomatic, and into using the language to structure real systems.

Fluent Python — Luciano Ramalho (O’Reilly, 2nd edition, 2022). The book that turns competent Python into expert Python. It explains why the language works as it does — data models, protocols, descriptors, concurrency — so that the engineering patterns in this book sit on a solid understanding of the tool you are using. Dip into the chapters relevant to a problem you have rather than reading cover to cover.

Architecture Patterns with Python — Harry Percival and Bob Gregory (O’Reilly, 2020, free online). Sometimes called “Cosmic Python”, this applies established software architecture patterns — dependency inversion, repositories, service layers — in idiomatic Python, with tests throughout. It is the natural next step after Part 3 for anyone whose data science work is growing into a real application that others depend on.

C.3 Testing

Python Testing with pytest — Brian Okken (Pragmatic Bookshelf, 2nd edition, 2022). A focused, practical guide to the test runner this book uses, going well beyond Chapter 7: fixtures, parametrisation, plugins, and how to structure a growing test suite. The reference to keep beside you as your testing gets more ambitious. Pair it with the documentation for Hypothesis (hypothesis.readthedocs.io) for property-based testing.

C.4 Reproducibility and scientific computing

Good Enough Practices in Scientific Computing — Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal (PLOS Computational Biology, 2017). A short, generous paper that meets researchers where they are, recommending the minimum practices that make computational work reproducible — sensible project organisation, data management, and version control — without demanding a full engineering apparatus. The single best starting point if Parts 1 and 6 felt like a lot at once.

The Turing Way — The Turing Way Community (online handbook, the-turing-way.netlify.app). A continually updated, community-written handbook on reproducible, ethical, and collaborative data science. Broad rather than deep, it is an excellent map of the territory — reproducibility, research data management, collaboration, ethics — with pointers onward to depth on each topic.

C.5 Machine learning engineering and MLOps

“Hidden Technical Debt in Machine Learning Systems” — D. Sculley et al. (NeurIPS, 2015). The paper behind Chapter 19’s observation that the model is the small part of a production ML system. It catalogues the debts specific to machine learning — data dependencies, configuration sprawl, feedback loops, pipeline jungles — and is essential reading for anyone moving a model toward production. Short, sobering, and frequently cited for good reason.

Designing Machine Learning Systems — Chip Huyen (O’Reilly, 2022). The most directly relevant book for a data scientist who has finished Parts 4 and 6 and wants to go deeper: feature stores, train–serve skew, data and model management, deployment patterns, and monitoring, all treated from an engineering perspective. Where Sculley et al. diagnose the problems, Huyen works through the solutions.

C.6 Operations and delivery

Continuous Delivery — Jez Humble and David Farley (Addison-Wesley, 2010). The foundational text on getting software into production reliably and repeatably through automation — the thinking behind Chapter 13 and Chapter 15. Parts are dated in their specific tooling, but the principles (automate everything, keep everything in version control, build once and promote) are exactly the ones the operations chapters apply to models.

Software Engineering at Google — Titus Winters, Tom Manshreck, and Hyrum Wright (O’Reilly, 2020, free online). A look at how engineering practices — testing, code review, version control, deprecation — work at very large scale and over long time horizons. Useful less as a manual than as a perspective on why these disciplines exist and what they buy a team, which is the through-line of Part 5.

The Twelve-Factor App — Adam Wiggins (12factor.net, 2011). A concise web essay listing twelve principles for building software that deploys and scales cleanly — configuration in the environment, explicit dependencies, disposability. Several map directly onto Chapter 3, Chapter 11, and Chapter 14, and the whole thing takes twenty minutes to read.

C.7 Documentation

Diátaxis — Daniele Procida (diataxis.fr). The documentation framework introduced in Chapter 18, and the structure this book itself follows. The site is brief and clarifying: it explains why documentation so often fails (by trying to do several jobs at once) and how separating tutorials, how-to guides, reference, and explanation fixes it. Read it before you write your next README.

C.8 Foundational references

For the tools themselves, the official documentation is usually the best and most current source; Appendix A lists each tool with its starting point. Two references are worth singling out: Pro Git by Scott Chacon and Ben Straub (Apress, 2nd edition, 2014, free at git-scm.com/book) is the comprehensive, freely available guide to Git when you need more than Chapter 2; and the scikit-learn, FastAPI, pytest, and Docker documentation sites are all unusually well written and worth reading directly rather than through secondary summaries.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Reading list" --- If a chapter left you wanting more depth, more rigour, or a different voice on the same idea, the resources below are where to look. They are organised by purpose rather than by chapter, and each entry includes enough context for you to decide whether it is worth your time. The list is deliberately short: these are the books and resources that repay a data scientist's attention most directly, not an exhaustive software engineering syllabus. ## The engineering craft {#sec-reading-craft} These are about how to write code that lasts, pitched at someone who can already program and wants the judgement that experience usually buys. **The Pragmatic Programmer: Your Journey to Mastery** — Andrew Hunt and David Thomas (Addison-Wesley, 20th Anniversary edition, 2019). The closest thing the profession has to a book of first principles for working programmers. It is a collection of short, practical essays — on duplication, on coupling, on automation, on taking responsibility for your work — that together describe the mindset behind most of this book. Nothing in it is data-science-specific, which is the point: it is the engineering sensibility, distilled. **A Philosophy of Software Design** — John Ousterhout (2nd edition, 2021). A short, opinionated book built around a single idea: that the central challenge of software is managing complexity, and that good design is whatever reduces it. Its treatment of what makes a module "deep" rather than shallow is the best short articulation of the instinct behind @sec-readable-code and @sec-functions-modules. Read it when you want to understand *why* some interfaces feel clean and others fight you. **Refactoring: Improving the Design of Existing Code** — Martin Fowler (Addison-Wesley, 2nd edition, 2018). The definitive treatment of changing code safely without changing what it does — the discipline behind paying down the technical debt of @sec-technical-debt. Its catalogue of named refactorings matters less than its central argument: that you keep code malleable by improving it continuously, in small steps, backed by tests. (Robert C. Martin's *Clean Code*, Prentice Hall, 2008, covers adjacent ground and is widely read; treat its more absolute rules as prompts for thought rather than law.) ## Python, more deeply {#sec-reading-python} You arrive fluent in Python; these take you from fluent to idiomatic, and into using the language to structure real systems. **Fluent Python** — Luciano Ramalho (O'Reilly, 2nd edition, 2022). The book that turns competent Python into expert Python. It explains *why* the language works as it does — data models, protocols, descriptors, concurrency — so that the engineering patterns in this book sit on a solid understanding of the tool you are using. Dip into the chapters relevant to a problem you have rather than reading cover to cover. **Architecture Patterns with Python** — Harry Percival and Bob Gregory (O'Reilly, 2020, free online). Sometimes called "Cosmic Python", this applies established software architecture patterns — dependency inversion, repositories, service layers — in idiomatic Python, with tests throughout. It is the natural next step after Part 3 for anyone whose data science work is growing into a real application that others depend on. ## Testing {#sec-reading-testing} **Python Testing with pytest** — Brian Okken (Pragmatic Bookshelf, 2nd edition, 2022). A focused, practical guide to the test runner this book uses, going well beyond @sec-testing: fixtures, parametrisation, plugins, and how to structure a growing test suite. The reference to keep beside you as your testing gets more ambitious. Pair it with the documentation for **Hypothesis** (hypothesis.readthedocs.io) for property-based testing. ## Reproducibility and scientific computing {#sec-reading-reproducibility} **Good Enough Practices in Scientific Computing** — Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal (*PLOS Computational Biology*, 2017). A short, generous paper that meets researchers where they are, recommending the *minimum* practices that make computational work reproducible — sensible project organisation, data management, and version control — without demanding a full engineering apparatus. The single best starting point if Parts 1 and 6 felt like a lot at once. **The Turing Way** — The Turing Way Community (online handbook, the-turing-way.netlify.app). A continually updated, community-written handbook on reproducible, ethical, and collaborative data science. Broad rather than deep, it is an excellent map of the territory — reproducibility, research data management, collaboration, ethics — with pointers onward to depth on each topic. ## Machine learning engineering and MLOps {#sec-reading-mlops} **"Hidden Technical Debt in Machine Learning Systems"** — D. Sculley et al. (NeurIPS, 2015). The paper behind @sec-technical-debt's observation that the model is the small part of a production ML system. It catalogues the debts specific to machine learning — data dependencies, configuration sprawl, feedback loops, pipeline jungles — and is essential reading for anyone moving a model toward production. Short, sobering, and frequently cited for good reason. **Designing Machine Learning Systems** — Chip Huyen (O'Reilly, 2022). The most directly relevant book for a data scientist who has finished Parts 4 and 6 and wants to go deeper: feature stores, train–serve skew, data and model management, deployment patterns, and monitoring, all treated from an engineering perspective. Where Sculley et al. diagnose the problems, Huyen works through the solutions. ## Operations and delivery {#sec-reading-operations} **Continuous Delivery** — Jez Humble and David Farley (Addison-Wesley, 2010). The foundational text on getting software into production reliably and repeatably through automation — the thinking behind @sec-ci and @sec-deployment. Parts are dated in their specific tooling, but the principles (automate everything, keep everything in version control, build once and promote) are exactly the ones the operations chapters apply to models. **Software Engineering at Google** — Titus Winters, Tom Manshreck, and Hyrum Wright (O'Reilly, 2020, free online). A look at how engineering practices — testing, code review, version control, deprecation — work at very large scale and over long time horizons. Useful less as a manual than as a perspective on *why* these disciplines exist and what they buy a team, which is the through-line of Part 5. **The Twelve-Factor App** — Adam Wiggins (12factor.net, 2011). A concise web essay listing twelve principles for building software that deploys and scales cleanly — configuration in the environment, explicit dependencies, disposability. Several map directly onto @sec-environments, @sec-config-secrets, and @sec-containerisation, and the whole thing takes twenty minutes to read. ## Documentation {#sec-reading-documentation} **Diátaxis** — Daniele Procida (diataxis.fr). The documentation framework introduced in @sec-documentation, and the structure this book itself follows. The site is brief and clarifying: it explains why documentation so often fails (by trying to do several jobs at once) and how separating tutorials, how-to guides, reference, and explanation fixes it. Read it before you write your next README. ## Foundational references {#sec-reading-references} For the tools themselves, the official documentation is usually the best and most current source; Appendix A lists each tool with its starting point. Two references are worth singling out: **Pro Git** by Scott Chacon and Ben Straub (Apress, 2nd edition, 2014, free at git-scm.com/book) is the comprehensive, freely available guide to Git when you need more than @sec-version-control; and the **scikit-learn**, **FastAPI**, **pytest**, and **Docker** documentation sites are all unusually well written and worth reading directly rather than through secondary summaries.