Major incidents aside, I always think that cache-related bugs are some of the most likely to go undetected since if you don't test for them end-to-end, they're really not that easy to spot & diagnose.
An article sticking around too long on the home page. Semi-stale data creeping into your pipeline. Someone's security token being accepted post-revocation. All really hard to spot unless (1) you're explicitly looking, or (2) manure hits the fan.
Microsoft has some serious problem with token caching. I changed job last month and for two three weeks I could log into my old work account for a split second before being thrown out. (By habit visited the page). I could see the news feed and mails but not long enought to see if they were stale.
I categorize this as bugs caused by data inconsistency because od data duplication. That includes:
- Using asynchronous database replication and reading data from database slaves
- Duplicating same data over multiple database tables (possibly for performance reasons)
- Having additional system that duplicates some data. For example: in the middle of rewriting some legacy system - a process that was split into phases so functionality between new and old systems overlap for some period of time.
Based on my experience I always assume that inconsistency is unavoidable when the same information is stored in more than one place.
What I find most interesting in this is the pseudo detective story of hunting down disappearing post-mortem and "lessons learned" documentation. Optimistically we'd hope that perhaps the older systems no longer reflect the existing systems in any meaningful way (possibly as the org structures and/or software stacks shift and change) and they're no longer relevant.
I'd imagine most lost knowledge is not an explicit decision however which means such historical scenarios / documentation / ... are just lost as part of business. Lost knowledge is the default for companies.
Twitter is likely better than most given their documentation is all digital and there exist explicit processes to catalogue such incidents. I'd also be curious to see how much of this knowledge has been implicitly exported to their open source codebases.
What you've said is, in my opinion, likely to be a difference between the technology companies that become tomorrow's infrastructure and the ones that disappear (even if it takes decades).
As you say, the default tendency in many companies when failures occur is information-loss. That can be attributed to using too many communication tools, cultural expectations that problems should be hidden, silo'd or disparate documentation stores, or lack of process.
Intentional, open, thorough and replicated note-taking with cross-references before, during and after incidents can create radically different environments which allow for querying, recovery and improvement regardless of failure mode(s). Kudos to Dan for moving in that direction with these writeups (and to you for raising the subtext).
I remember reading Facebooks caches had a dedicated standby set of “gutter” servers that would take over a failure quickly (otherwise inactive and unused) that was an interesting mitigation for some failure scenarios.
Required reading for all of the "I could code up Twitter in a weekend" -types.
The long listen queue -> multiple queued up retries feedback loop is a classic: https://datatracker.ietf.org/doc/html/rfc896 TCP/IP "congestion collapse" and the 1986 Internet meltdown [various sources]
It ultimately depends at what scale. I think most people to be fair are talking about building a clone of what was present back in the mid 2000s. You could build a twitter clone that could handle a few hundred users in a few weeks with modern tech stacks.
The fact that there are at least 3 twitter clones that are less well put together with a decent amount of users handling in the load proves that it is possible.
“A few weeks” sounds a lot longer than a weekend, and I’d also consider the history: Twitter itself was built quickly using a modern stack. Rails was highly productive, the problem is that the concept of the service makes scaling non-trivial. We have more RAM and SSDs now so you could get further but those aren’t magic.
Twitter was down all the time for hours after the launch. I don’t think I have an issue coding something that goes down when it overloads with functionality twitter had when it launched in a weekend. Most work for that kind of project goes into interpreting the specs/your business colleagues and fixing the mishaps; here you don’t have that.
Don’t forget that you’re talking 2006, so you need to be doing a lot more infrastructure work: no containers, shared hosting environments are less stable but bare metal costs a fair amount to get started, you’re using something like cfengine instead of Chef/Ansible if you aren’t setting everything up by hand, you have 10% of the RAM and no SSDs, CDNs are an expensive premium service, etc. Then think about what that means browser-wise: you can do a bit on the client side but server side rendering is a necessity and you’re still going to be burning time on browser compatibility to an extent which can be hard to remember now. HTML5 hasn’t happened yet so you’re building more stuff yourself, too.
I’m not saying there’s nothing they could have done better, just that there’s an awful lot they couldn’t have avoided at least without building a very different app.
Yes, I agree with that. Containers though, as far as some of the advantages; we have been using chroots for deployment since the early 2000s which is not the same but deployment/compatibility wise it was pretty good. It allows you to have the same small Linux image and deployment everywhere as well and you could move most zipped images from machine to machine with vastly different kernels. I still use chroots now on my Pandora handheld which has an ancient kernel but I run modern software on it in a chroot. No overhead too.
I think what they could have done better is rails; it was not good enough then. Php would’ve been far less hassle. But he, they made it!
I used chroots, too, and it was useful but much harder to maintain than a container. Automation wasn't impossible, of course, but that was also complicated by concerns about bloating each chroot with copies of all of the system libraries & config files.
I used PHP in that era. It could be faster but then you're in the classic developer productivity tradeoff between, say, hand-coded SQL calls versus using an ORM, etc. The PHP frameworks which were comparable productivity-wise to Rails were also a lot closer to Rails performance-wise since they also had heavy abstractions, and they tended to have even more creative ways to create security holes. (I am feeling very old remembering arguing against enabling register_globals circa 1998)
Hahah yeah ; I did a lot of hosting then and everyone insisted on register_globals. I still run stuff for clients that depends on it to this day.
And maybe the rails perf was comparable to the larger pho frameworks in some cases, but php was really much easier to scale in our experience. We hosted millions of sites and the python/ruby ones were generally dramas when they got serious (for that time) traffic.
You could still write bad code, of course, but deploying mod_php sure was easy. People could still write bad code (thinking here of someone who processed a database join in a foreach loop rather than learning how to use a WHERE constraint) but I do miss that level of install simplicity, at least until I remember what it was like dealing with incompatible versions or reconciling configuration in multiple places.
Those remarks are always made at launch, not later on. Dropbox and twitter, both of which people said this about, were rather trivial at launch especially with modern tooling. They also, and especially twitter, had growing pains. Twitter defo prioritised move fast and break things.
Obviously copying decades of improvements and scaling lessons you cannot copy unless someone made a product of those parts and you can use those.
These big incidents involving 'big cache' are fun to read about. Years ago I had to deal with a bunch of cache issues over a short time, but they were all minor incidents with minor uses of cache (simple memoization, storing stuff in maps on attributes of java singletons, browser local storage). Still, I made a checklist of questions to ask thenceforth on any proposal or implementation of a cache in a doc or code review. Most of them are focused on actually paying attention to what your keys are made of and how invalidation works (or if you even can invalidate, or if it's even needed). I think for 'big cache' questions I should just refer to this blog post and ask "what's the risk of these issues?"
An article sticking around too long on the home page. Semi-stale data creeping into your pipeline. Someone's security token being accepted post-revocation. All really hard to spot unless (1) you're explicitly looking, or (2) manure hits the fan.
reply