The title pretty much covers it. Twitter is more like an IM network or email list service than like a typical database-backed webapp; a relational database is just the wrong platform for large-scale messaging.
Many folks would say the database is where scaling problems belong. Lots and lots of programmer hours have already been expended in figuring out how to scale databases; no sense duplicating that effort in scaling your web framework.
I'd say Twitter's problem is that they're a multicast messaging app with a database backend, a combination that tends not to work so well. As an intern in college, I worked on a financial J2EE app that tried to do everything through JMS messaging queues backed by Oracle. It had similar scalability and performance problems.
In Twitter's defense, they didn't know they'd be a multicast messaging app when they started, and a database is a logical choice for what they did know they'd be (a website). They'll figure things out; they just need time and resources to rearchitect.
"Our new architecture will move our reliance to a simple, elegant filesystem-based approach, rather than a collection of database."
Finally! Using a database for Twitter was a huge mistake. Anyone who's developing an IM should not use a database as a general message storage. And especially not for a system that's multicasting at such a high rate.
It's not about doing stuff in platform A that can't be done in platform B. It's about delivering things faster than your competition, something that can't be done if you use the very same tools they use. You can, of course, be wrong and go with the wrong tool, but remember - it's a bet on a technology differential.
The Twitter problems stem not from the web framework they used but in flaws in the design of the data structures under it. When doing something like Twitter, one should never even consider using relational databases for more than a prototype.
You don't need to have Twitter's userbase to have Twitter's data problem. Look at someone like FlightCaster -- they're using a NoSQL database to handle a huge amount of data to be useful to even their first user.
With the advent of cheap, reliable, available commodity hardware and network access, previously difficult data problems are solvable. SQL doesn't make sense for all of those problems.
Twitter has been mostly not Rails for a very long time. You can't build a service like Twitter on top of a traditional relational database stack, so much of their infrastructure is based around custom message queue stuff.
I couldn't care less about Twitter's high level abstractions. They were never renowned for those. Their database schemas and infrastructure on the other hand...
Just wondering, is there a reason why twitter doesn't use one of the many distributed in-memory database solutions?
It seems like they had to write a lot of custom layering on top of it just to scale
I'd be surprised if anyone thought building Twitter with similar scalability would be easy. Sure while you cold fit it into 1 mySQL DB it would be easy to get the basic features down with a much simpler UI and no API. Past that though they would be vastly underestimating the work.
"Such a platform can yield very satisfactory performance for tens or hundreds of thousands of active users"
There are 253 million Internet users in China alone. What happens when your site needs to scale from 0.001% of them using the site simultaneously to needing to scale to 1% of them using the site simultaneously? Within a month?
"Of course if you index poorly or create some horrendous joins"
Which in the Twitter and Facebook cases is exactly what they have to do on many of their requests. As I've personally found out, relying on a database to do a join across a social network graph is a recipe for disaster. One day you'll be woken up because your database's query planner decided to switch from using a hash join to doing a full table scan against 10s of millions of users on every request. Then, you'll be left either trying to tweak the query plan back to working order, or actually doing what you should've done in the first place: architect around message queues and custom daemons more suited to your query load.
"Even with billions upon billions of help tickets."
At 50 million tweets a day, Twitter would hit 18 billion tweets within a year. Good luck architecting a database system to handle that kind of load. That is, one in which the database system is serving all of the requests (including Twitter streams) and isn't just being used for data warehousing.
"Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need"
The disconnect here is this guy's needs are not the needs of a lot of us who are actively looking at alternatives. He is simply not familiar with the problem domain.
Scalling an application is easy. You just stick a load balancer in front of more and more webheads. The hard part is scaling a database!
Twitter is still running a single master database system and - from what they've put out - only three database servers. They might also have some pretty, but wasteful queries that do a lot of joins. Joins are not good for high-hit applications.
Microblogging is more like IM and email, so people who are talking about caching and databases and scaling twitter like a traditional web site all miss the point. This is rare exception, too bad the author abandoned his own attempt at doing it right. I think there's still a chance that somebody other than twitter will be first to general adoption.
Which is sorta ridiculous considering that it's definitely not among the most complex of the Web apps out there. Even late competitors like www.shoutem.com are much more complex as they allow for a bunch of Twitters to be created on the same platform.
The only complex thing about Twitter is its size, and I bet their developers are working round the clock just to keep it from falling apart.
reply