Building Twitter/Mastodon *not at scale* isn't that hard and certainly doesn't take 200 person-years. Building it *at scale* is a completely different story. Remember the fail-whale? That was years of Twitter struggling to scale their product.
That said, as we described in the post our implementation of Mastodon is less code than Mastodon's official implementation. So not only is Rama orders of magnitude more efficient for building applications at scale, it's also much faster for building first versions of an application.
I hate to be the person but I've seen complicated dynamic applications push much higher bandwidth and serve millions of concurrent users with similar if not smaller h/w requirements.
Would be interesting seeing Twitters complete backend and while mastodon might not be apples to apples also interested cost per user to infrastructure analysis too.
Twitter is a really simple application to scale. The fact that it took Twitter so long to get it somewhat right is remarkable, but I'm not sure I'd be so keen to look at them as an example of how to design scalable architectures.
It scaled for a long time for Twitter. With few exceptions, we're not going to work at anything close to the scale of Twitter at the time, so that's plenty scalable for most use cases. Now, if you want to be more compute-efficient that's a different question...
Check out twitter-scale-mastodon, which is an implementation of Mastodon's backend from scratch that scales to Twitter scale. It's more than 40% less code than Mastodon's backend and 100x less code than Twitter wrote to build the equivalent.
I'd be surprised if anyone thought building Twitter with similar scalability would be easy. Sure while you cold fit it into 1 mySQL DB it would be easy to get the basic features down with a much simpler UI and no API. Past that though they would be vastly underestimating the work.
My understanding is that the particular scaling problem of twitter is high fanout through subscriptions. Receiving 7k messages per second and storing them in a database is actually fairly straightforward.
The title pretty much covers it. Twitter is more like an IM network or email list service than like a typical database-backed webapp; a relational database is just the wrong platform for large-scale messaging.
I don't get why Twitter doesn't scale. It's just webmail, but with smaller messages and a simpler UI. Here's how twitter should work: every user should have a list of users following them. When they tweet, each follower gets a copy of that message in their personal inbox. A copy is also attached to the tweeter's account, so new followers can suck that copy in when they start following them.
That's it. Now, sending a message takes O(n) (n=followers) time, which is really cheap. On my machine, it takes about a second to create and sync 40,000 files (there's not much data, so replicating this via NFS wouldn't be that expensive either). With that out of the way, all you have to do is ls your "twitter directory" to see all of your friend's messages. This is another incredibly cheap operation. It's easy to distribute, and there's no locking.
Anyway, just look at the mail handling systems at huge universities and corporations. They scale fine, and they're much more complicated than twitter. Twitter is just a subset of e-mail, so it should be implement that way, not as a "SELECT * FROM tweets WHERE user IN (list, of, followers) ORDER BY date". That is the wrong approach because it makes reads (very common) expensive and writes (very uncommon) cheap. That's why twitter doesn't scale.
'... Six Apart didn't hire Brad -- they acquired LiveJournal from him and named him "Chief Architect" or something ...'
Same result.
'... their developer does not seem to be amazingly qualified to do what he's doing ...'
The thing that strikes me is the system is not layered enough. The API's the app developers should be calling would shield having to deal with these types of problems. nostrodemons [0] summared flickrs approach to optomisation. [1] So is it the lack of a scaling infrastructure where twitter is failing?
'... how many are dynamic and how many are cachable/static ...'
One thing I notice with twitter is the update on the sytem. Every 2 minutes.
For most users 5-10 minutes would probably be ample. I often wonder why they don't say "right you want RT, well here's the monthly subscription".
As for the dynamic and cacheable, the main hits appear to be reads of RSS public timeline. [2] RT creation allows no or little caching as the RSS would be built on the fly. Couple that with Rails in ability to talk to multiple db's [3] and you get bottle necks. Makes you wonder why they don't switch certain layers to perl?
[2] google groups, twitter development talk, Alex Pain 'we don't gaurentee
that you'll be able to collect contiguous sets of data from the
public timeline API method. It's our most-requested method, so right
now it's optimized for performance, not archival'
the bottleneck is twitter's api limits, data-wise and http connection wise I have headroom, lots of.
simple http connections to parse/spider the follower records from public pages is a no-go since twitter blocks the IP then, and scaling this out will eventually not end in a good way.
reply