For a good analysis of risks and failures associated with complex systems, I recommend "Normal Accidents: Living with High-Risk Technologies" by Charles Perrow (1999).
If you are looking for other books, I’d also recommend “Normal Accidents” by Charles Perrow which is focused on what kinds of systems lend themselves to catastrophic malfunctions. From bakeries to nuclear reactors. It has a huge focus on how unimportant (and predictable) “human error” is to causing systemic problems and catastrophes.
Once you start seeing systems in terms of how coupled, complex, and the catastrophic potential they have/are, it gives you a new insight into how things work and function (at least it did for me)
This is fascinating. For anyone interested in a slightly odd but unique and in-depth view of Systems design and failure, would like to recommend "The Systems Bible: The Beginner's Guide to Systems Large and Small" by John Gall.
Yes, I started reading into the literature on failures recently because of that exact essay. It's been a great supplement to my reading on systems thinking.
There’s a book by John Gall called the Systems Bible [0] that goes into how big systems form and fall apart. Mostly anecdotes and no real solution, but a decent read that isn’t full of the usual BS that’s required because usually only large systems can afford high speaker fees.
Gerald Weinberg has several books on the topic [0, 1]. Dorner [2] has a short, sharp, thoughtful book on our biases in analyzing complex systems.
[0] 'An Introduction to General Systems Thinking', Gerald Weinberg
[1] 'General Principles of Systems Design', Gerald Weinberg
[2] 'The Logic Of Failure: Recognizing And Avoiding Error In Complex Situations', Dietrich Dorner
Okay, here's some I haven't seen mentioned yet, although these may be pretty far afield for you:
Drift into Failure, by Sidney Dekker. Studies failure analysis in complex systems, and basically argues that our classic reductionist/scientific method approach is the wrong way to study complex engineering failures.
How Buildings Learn, by Stewart Brand. This isn't about programming. It's about architecture, in the build-a-building sense. It studies what happens to buildings over the course of their lives, as opposed to just when they're first built.
Enterprise Integration Patterns, by Gregor Hohpe and Bobby Woolf. Learn how to use message queues and service busses correctly. Honestly, just read the first couple of chapters (65 pages or so), and the rest is reference, to look up as needed, so it's not as imposing as it sounds.
Advanced Programming in the UNIX Environment, by W. Richard Stevens. This book was my bible back in the olden days before http and ssh and stuff (I'm olde). Knowing how sockets really work can be an absolute lifesaver, even in this modern world of giant protocol stacks. Especially in this modern world.
The Art of Computer Programming, vols 1-3, by Donald Knuth. Only a madman would actually read them all, but they're good to have to remind you that there are mountains you can't even begin to climb.
A Deepness in the Sky, by Vernor Vinge. A science fiction novel that is really about hacking, set thousands of years in the future, when Moore's Law is long defeated and programmers are basically archeologists.
Design Patterns (aka Gang of Four), by Gamma/Helm/Johnson/Vlissides. There are lots of good books on design patterns, but you should really read the one that started it all. (For extra credit, read A Pattern Language, by Christopher Alexander - a book about urban architecture that inspired it.)
Continuous Delivery, by Jez Humble and David Farley. Stop thinking about your program in isolation, and learn how to deploy effectively!
Influence: The Psychology of Persuasion, by Robert Cialdini. This is DHH's favorite book. Learn how people think, and how to use that to design better products.
How to Win Friends and Influence People, by Dale Carnegie. Not creepy at all, despite how the title sounds in today's language. This book is the bible of how to get along with others. It's been in continuous print since before WWII, for good reason.
The Lean Startup, by Eric Ries. The best work you do is the work you find you don't need to do. Learn how to fail fast and save time on projects and product development, by building what customers want rather than what you think they need.
To build on this discussion, what are some highlights in that book that you found useful?
I ask because I've read some systems thinking books (e.g. Systemantics) that were difficult to apply in real life. I come from the perspective of someone with a systems/theory builder personality. The only systems thinking book that I found remotely practical was The Fifth Discipline by Peter Senge.
The most useful piece of short writing on systems thinking that I've come across is "How Complex Systems Fail" [1, 2], which talks about designing systems for resiliency, and not for rigid notions of reliability.
I've read the first SRE book but having worked on large-scale systems it is impossible to relate to the book or internalise the advice/process outlined in it unless you've been burned by scale.
Does anyone have a similar compendium specifically for software engineering disasters?
Not of nasty bugs like the F-22 -- those are fun stories, but they don't really illustrate the systemic failures that led to the bug being deployed in the first place. Much more interested in systemic cultural/practice/process factors that led to a disaster.
Can you recommend books and other materials on Systems in general? I read "Thinking in Systems" [1], which I liked, "Systemantics" [2], rather humorous and [3] Richard's Cook talks, e.g. "how complex system fail" (a very good one). Anything more you would add to the list?
Nancy Leveson's _Engineering a Safer World_ (available for free here: https://direct.mit.edu/books/book/2908/Engineering-a-Safer-W...) is required reading, in my opinion. It talks about applying systems theory to reducing industrial accidents, using various case studies such as the Bhopal gas tragedy, which maps automatically to reducing outage severity and incident severity for tech workers.
The key insight it applies is that root-cause analysis to fix and improve situations is limited because choice of root-cause is somewhat arbitrary - it really depends on where you stop probing further. Engineers should not rely on root-cause analysis to identify failures in _the system_, only on failures of components. Nancy proposes alternative techniques instead to identify and improve broken systems.
The detail, quality and depth of her treatment makes it immediately practical and useful. My one complaint is that it is a bit long, and not all of it is easily condensed into short treatments - but I guess verbosity is the tradeoff for quality?
Seconded, Normal Accidents is one of the best books I've ever read. In addition to the maritime accidents, there are chapters on nuclear plants, chemical plants, and dams. It is a great discussion of how modern industrial accidents rarely have a single cause but instead are cascade failures of systems whose complexity has evolved beyond what we can handle, and presaged many of the messes we've got into with hyper-scale software projects before those ever existed.
Is there a "sequel" to Normal Accidents about software? Because I'd buy that in a heartbeat.
If you like that and it really resonated with you, you might also find a lot of value in Drift Into Failure and Resilience Engineering: Concepts and Precepts
I've found the more I go down the rabbit hole on these particular topics, the faster I'm able to spot problems at a system wide scale in realtime and (try) course correct.
I'd like advise to read http://www.draftymanor.com/bart/systems1.htm and the five following chapters, too.
I have the book by Gall and it's both an amusing and informative read (especially if one has some background in complex systems).
This failure/flaw perspective is very useful for finding problems to develop products for.
Charles Perrow, "Normal Accidents". Nominally about safety in large-scale complex engineered systems, and it's very good on that topic, and/but in doing so, touches on how scale and social factors interact with engineering details, which affects a lot more than safety.
https://www.amazon.com/Inviting-Disaster-Lessons-Edge-Techno...
reply