It's probably part of how they keep things simple. But you seem to imply that storing lots of data is hard. It doesn't have to be. I've worked with search teams that operate at a very large scale. You need lots of hardware but not necessarily lots of people.
The scale of Twitter is large but not enormous. Basically a few trillions of messages overall. Most of them pretty small. All of them immutable (well so far). And lots of images/videos with some CDN in front of that. It's a lot of bytes but not a whole lot of complexity. And a couple hundred million users and user profiles. I can think of a quite a few ways of dealing with all that.
What's hard is things like algorithms and good search. But that's a space where you can achieve a lot with a small and focused team with access to lots of hardware. And obviously this is also something where Twitter has maybe been a bit underwhelming and struggling. User growth and engagement stagnated years ago for them. This was not a healthy social network. It's not like they were nailing this. Throwing more people at the problem wasn't working.
But .. fulltext index done right, including per hashtag, doubles the storage requirements at most. And the ability to call up any tweet from history does not need to be as instantaneous is recent tweets (It's ok to wait 5 seconds for a tweet from 5 years ago).
> The fanout problem turns sharding into a lumpy problem because again, there is a power law distribution that can overwhelm any single machine.
Not really, in the case of twitter (where you only trace edges, not edges-of-edges and edges-of-edges-of-edges like Facebook and Linkedin do). In the simplified model I described, the web frontends all pull from the backend, but for the hot 10-million account followers -- of which there aren't that many --, you can just push them to the webservers so they don't even need to query)
> Robustness at that scale is more than buying a pair of SSDs and setting up RAID.
I most definitely agree. But it is cheaply doable when everything is so perfectly shardable as it is in Twitter.
> Then there's the fun part that Twitter is so reliable these days that when something else on the internet breaks, we check Twitter for updates.
That's possibly an illusion. Yes, Twitter is mostly reliable - but do you have any latency stats? e.g., would you know if, on a daily bases, 10% of tweets take 60 seconds until they appear on a viewer's refresh? The "hot" accounts would be cached and immediately updated, but the long tail might have minutes delay and almost no one would notice.
> But often, when you inquire, there is a solid non-obvious reason for non-trivial diversions from the simplest thing that could possibly work.
> I always find it useful to give the benefit of the doubt.
I do not assume other engineers are morons. In my experience as a consultant, though, those solid non-obvious reasons for non-trivial diversions are much more often than not "historical, do not apply any more", "we didn't have time or resources to do it the right way", "the guy who did the initial design did it wrong for whatever reason, and now we are stuck with it", "there's a legal reason that's not obvious why we can't do it that way", and a few others.
Twitter had the resources to do it better from very early on, and they didn't (I think it was 2012 or so before Twitter became reliable). For all I know, their system now could be the most efficient beast ever, and run on a ZX81 with 1K Ram with ultimate reliability, way better than I could ever hope to build.
I was just pointing out that the user facing side of twitter, technically speaking, is not very impressive. I've been doing it for a while on various technical forums - and not once did anyone offer a reason for why it's much harder than it would seem -- most of the responses were along the lines of "but mysql/pgsql/oracle can't take the firehose load, I tried!". Which is correct, but irrelevant.
I interviewed with a Twitter search engineer and he told me that that's because their in-memory search index is not big enough hold all tweets ever posted. On top of that, their efforts to scale up the index are counteracted by the ever-increasing volume of tweets posted per day.
Clearly, you've never worked in distributed systems, and your impression of Twitter (and other large "centralized" by your definition systems) are actually huge, loosely coupled decentralized systems.
The real difference is whether than decentralized, loosely coupled system is run by and owned by a single corporation, or not.
It's the design of the system itself that determines latency, not the centralized vs. decentralized argument.
You're speaking as if twitter.com is a single giant machine on a single IP address. In fact, it's thousands of loosely coupled machines and services all working in tandem across nearly the entire globe. Sounds like a typical decentralized system to me.
Twitter is no harder to partition/shard than Google is. (your search is against the entire web, not just one part of it) Batching has nothing to do with this issue.
I'm guessing Twitter is a highly distributed system, so of course it's doing a bunch of poorly batched RPCs. That's more or less the definition of a highly distributed system, where all the data can't be stored in 1 place.
Large scale distributed systems aren't that simple. I don't know any internal details of Twitter, but from what I've gathered they are trying to hit the kind of QoS you are dismissing. Imagine some celebrity with millions of followers says something controversial: it would be confusing to a lot of people if they start getting retweets or discussions of that tweet significantly before they see that tweet themselves. And that's just one scenario.
I don't know about the 500 vs. 1000 employee question, but Twitter is a big engineering challenge.
Because Lucene wasn't good at near-realtime in 2009 or so, Twitter's original (acquired via summize) search was written in mysql. It might have even been a row for every token, not quite sure.
IIRC, when we moved to a highly-customized-Lucene-based system in 2011, we dropped the server count on the cluster from around 400 nodes to around 15.
One of the key features of twitter, the global reach of hashtags, would be impossible. Twitter relies very heavily on being centralised. Anyone claiming to build a decentralised twitter needs a very careful numbers-based argument as to what the bandwidth consumption of being a popular user or hashtag might be.
I think you're dramatically underestimating the scope and scale of the problems Twitter had to solve. See the thread below for examples of the sort of solutions they had to build because open source or cloud-hosted solutions weren't available or viable.
Twitter is basically a real-time database where everything is interconnected. It's one of the harder things to scale because it doesn't allow for easy segmentation.
Already pointed out by others, but I think you might be underestimating just how much the issue comes from Twitter (et al.) being a platform : giant, centralized, closed, US-based...
It's not indexed in realtime. (tweets from 20min ago are still not searchable) It has no interactivity either which makes the problem few orders of magnitude easier. And they don't care about subscriptions.
Implementing search: just chuck the last hour of fire hose into s3 and update the ngrams.
Implementing real timeish view: ok, so we need to distribute+replicate the data so timeline queries can happen in parallel and get fresh info depending on subscriptions.
Implementing the number of likes ticking up as you watch the tweet: basically alien technology. (I assume it's optimised somehow because otherwise that would convert a single tweet view into 1 + however long someone stays on the page / 2s times the load in naive implementation)
The problem with a centralized service like Twitter is that there is absolutely no guarantee that you will be able to access it through any particular means.
For example, my only means of accessing Twitter are either through the bloated Web UI, which is unusable on most of my devices, or through Nitter, which is read-only and sometimes doesn't work due to Twitter's API limitations.
Sorry, I don't want to be pedantic but this is a constraint on distributed systems. Like Google Search, Twitter doesn't run on a single machine but it is a very cohesive and optimized system (e.g: high speed networks, storage and performance).
The web part of it is pretty easy to scale. You simply add more web servers. The problem is in the storage/db layer. I imagine the feed was the primary challenge scaling Twitter.
Other components such as the search were probably also quite tricky. One thing i've never figured out is how Facebook handles search on your timeline. That a seriously complex problem.
Linkedin Recently published a bunch of papers about their feed tech:
reply