> A key feature of these environments is that state is not generally persisted from one request to the next. That means we can’t use standard client-side database connection pooling.
So we introduced so many optimizations just because we can't persist state. I can't help but think this is being penny wise pound foolish; the problems this article is solving wouldn't have been problems in the first place if you choose a boring architecture.
> You’re already talking to stateful systems to do anything meaningful.
Yeah, so? They don’t have to be talking to the same system, and in fact it was literally what you called bullshit to originally.
> If you’re having trouble with that, you’ve got bigger issues.
That does absolutely nothing to changing the fact that a SPoF is still an anti-pattern that should be avoided.
For that matter…
> A in-memory cache on top of session retrieval is so trivial and adds so few microseconds that it’s imperceptible even at large volumes of traffic.
Also does absolutely nothing to change that fact. You have done nothing to actually elaborate on why it’s totally not a horrendous idea to have everything communicate to the same database. Just because there’s a caching layer does not mean that fresh data wouldn’t be available if a SPoF goes down, which, once again, is the whole point here.
>But database optimization has become less important for typical applications. <..> As much as I love tuning SQL queries, it's becoming a dying art for most application developers.
We thought so, too, but as our business started to grow, we had to spend months, if not years, rewriting and fine-tuning most of our queries because every day there were reports about query timeouts in large clients' accounts... Some clients left because they were disappointed with performance. Another issue is growing the development team. We made the application stateless so we can spin up additional app instances at no cost, or move them around between nodes, to make sure the load is evenly distributed across all nodes/CPUs (often a node simply dies for some reason). Since they are stateless, if an app instance crashes or becomes unstable, nothing happens, no data is lost, it's just restarted or moved to a less busy node. DB instances are now managed by the SRE team which consists of a few very experienced devs, while the app itself (microservices) is written by several teams of varying experience and you worry less about the app bringing down the whole production because microservice instances are ephemeral and can be quickly killed/restarted/moved around. Simple solutions are attractive but I'd rather invest in a more complex solution from the very beginning, because moving away from SQLite to something like Postgres can be costlier than investing some time in setting up 3-tier if you plan your business to grow, otherwise eventually you can end up reinventing 3-tier, but with SQLite. But that's just my experience, maybe I'm too used to our architecture.
> Transactions are inherently useless in a web service, for example, because you can’t open the transaction on the client
(1) There is no reason you couldn’t, in a stateful, connected web app (or a stateful web service) open backend DB transactions controlled by activity on the client. You probably don’t want to, because, “ew, stateful”, and would prefer to use compensation strategies to deal with business transactions that span beyond a database transactions, but you could.
(2) The conclusion “transactions are inherently useless in a web service” does not follow from the premise “you can’t open the transaction on the client”. They are just two completely unrelated things that you’ve glued together with “because”. I write web services. The backends very often need a consistent view of state and to be able to assure that a sequence of mutating operations against the DB are applied all or nothing; transactions do both of those things. The fact that transactions are opened from the backend doesn’t make them useless.
> the part of the system where the vast majority of consistency issues happen (the client <-> server communication) will always be outside the transaction boundary.
“Consistency issues” in the ACID sense do not (cannot, in fact, since “consistency” is a property of database state) happen anywhere other than inside the database. Client <-> server communications have all kinds of issues, but ACID;s C is not one of them.
> ACID fanboys love to talk about how all the big name internet companies are built on RDBMSes
Nah, actually, as an “ACID fanboy”, I’ll say that most big name internet companies were not built on ACID systems, and that if you are running a big internet company you have a much greater than usual chance of (1) having a case with a tradeoff that really does call for a non-ACID system, (2) having to invent revolutionary new technology if it turns out you actually do need an ACID system, because OTS RDBMS’s don’t scale the way you need.
But for everyone else, you probably aren’t Google.
> If MongoDB had come first and SQL/ACID RDBMSes had come after,
While they weren’t MongoDB, specifically, non-relational, non-ACID, key-value stores where the “value” could be arbitrary data did exist before RDBMS’s. OTOH, users of MongoDB and other similar NoSQL system have often discovered that, oh yeah, they do want the things RDBMS’s provide, which is why the pendulum swung somewhat back in the RDBMS direction after peak NoSQL hype.
NoSQL has its place, too, but the relational model and ACID guarantees actually do solve real problems.
> not ideal for long running/cache/database style services.
Well, one question to ask yourself when considering going down this route is whether it makes more sense to move all the statefulness into managed services, like Aurora, BigTable, S3, etc.
That drastically simplifies life. Now the only infrastructure directly managed by you are stateless workloads that can easily be self-healed, rolled back, scaled up/down, etc. Managed DBs are more expensive than running your own DB, but most likely the cost savings of moving the rest of the infrastructure to spot/preemptible outweighs this difference.
> In the three-tier frontend-backend-database architecture, try to have as thin a middle layer as possible. Or even eliminate the backend layer altogether by directly exposing the database to the frontend.
Wat...
This article reads like a jumbled mess of ideas. But, the above is a total wat...
And it also assumes that your backend doesn't have to consume and tie together several resources or that you need a caching layer and so many other naive assumptions - and I'm not even addressing the security implications.
Take this article with a heavy cup of salt.
To me it just reads like something someone would write because they want full control and does't like working with and/or being dependant on other people/teams.
> When you want distrubted-db, yeah you need to design
But one shouldn't be designing for implementation details. Usually, we start technologies out with leaky abstractions, and gradually get better at it. A good example is game development, where it used to be that you always used the drawing method of the display and the clock speed of the cpu to your advantage. Nowadays, we've moved past that, because it was working on horrible abstractions, and because the technology underneath improved.
I'm not saying I have a solution, and I agree that this problem rears it's head the most when you start bringing in distributed storage. But my point stands: these databases run on a highly leaky abstraction, and that's a big problem going forward.
> My point is instead of doing 100k transactions in your web app, you should look at how to gather them into batches.
This sounds odd to me. Assuming 100k independent web requests, a reasonable web app ought to be 100k transactions, and the database is the bottleneck. Suggesting that the web app should rework all this into one bulk transaction is ignoring the supposed value-add of the database abstraction and re-implementing it in the web app layer.
And most attempts are probably going to do it poorly. They're going to fail at maintaining the coherence and reliability of the naive solution that leaves this work to the database.
Of course, one ought to design appropriate bulk operations in a web app where there is really one client trying to accomplish batches of things. Then, you can also push the batch semantics up the stack, potentially all the way to the UX. But that's a lot different than going through heroics to merge independent client request streams into batches while telling yourself the database is not actually the bottleneck...
> But in a real production application with heavy scale, you need a lot of low-level control over precisely how state is managed.
Very much this. "State is managed automatically!" can very quickly become "this frequent operation is automatically doing a huge amount of DB work which should actually be deferred until later"
> These days it's pretty trivial to have a cloud managed component e.g. Redis that is maintained, upgraded and supported.
It's still another moving part. Thing should be as simple as they can be, but no simpler.
> trying to use a database as a poor man's queue.
Of course, it's not really a "poor man's" queue-- it's got some superior capabilities. It just loses on top-end performance. (Of course, using those capabilities is dangerous, because it creates some degree of lock-in, so go into it open-eyed).
For as much as you accuse others of looking down their nose / not willing to seriously consider other technologies... you seem to be inclined that way yourself.
Redis is great. But if you have Postgres already, and modest to moderate queuing requirements, why add another piece to your stack? Postgres-by-default is not a bad technology sourcing strategy.
> There is zero reason for a language runtime to be opinionated about the database.
The API is extremely simple. It's little more than a hash map. That's hardly opinionated.
> All the good programming advice will tell you such a coupling is a bad idea.
That advice is outdated. Virtually any modern application will want a database, and this API can serve as a foundation for it – a foundation that conveniently already has many of the features (esp. global replication and consistency) that you will want and that are incredibly difficult to get right if you build them yourself. Think of this as the application's "file system". PaaS today usually doesn't provide direct disk access, so this is the low-level abstraction for persistent storage that is available. An abstraction that, in many cases, will in fact be all your application ever needs.
> I think the only really scalable part here is the database
That sums it up pretty well.
The great irony here is that those thin, stateless web servers that are the only things to easily autoscale are pretty much guaranteed to never be your performance bottleneck, unless you're doing something really strange.
> If you end up with enough traffic that it actually becomes a problem, it would have been a problem anyway because you'd be running a lot of servers with persistent connections in a more traditional model.
I think this is the part I disagree with. DB connection pools are much, much smaller than than the total # of functions that touch a database in any reasonably complex application.
Yes, scale is always an issue, but it seems to me that in this serverless world where you have 1 connection per function you run into scale issues a lot(order of magnitude?) faster than the "traditional" way.
> a good abstraction means only one or maybe two functions actually talks to the database
In a serverless world, does this mean you would run a handful of functions with DB connections, and other functions would proxy db requests through them? I can see that working ok I suppose.
> The down side is that you aren't going to be getting the upsides you'd get if you used a new technology
True of course. I've found that at least "premature abstraction" is almost a requirement when using persistence. I'm not talking an ORM or anything, but put a service in front of it if you have multiple things needing it, or put an in-code abstraction if you don't. And make each abstraction's/service's operation very specific to its caller. So a simple "fetchMarketData(timeRange, tickerSymbols) -> StreamOfData" will save you so much in the future.
> monolithic large relational databases are hard to scale
DB2 on z/OS was able handle billions of queries per day.
In 1999.
Some greybeards took great delight in telling me this sometime around 2010 when I was visiting a development lab.
> When you have one large database with tons of interdependencies, it makes migrating data, and making schema changes much harder.
Another way to say this is that when you have a tool ferociously and consistently protecting the integrity of all your data against a very wide range of mistakes, you have to sometimes do boring things like fix your mistakes before proceeding.
> In theory better application design would have separate upstream data services fetch the resources they are responsible for.
A join in the application is still a join. Except it is slower, harder to write, more likely to be wrong and mathematically guaranteed to run into transaction anomalies.
I think non-relational datastores have their place. Really. There are certain kinds of traffic patterns in which it makes sense to accept the tradeoffs.
But they are few. We ought to demand substantial, demonstrable business value, far outweighing the risks, before being prepared to surrender the kinds of guarantees that a RDBMS is able to provide.
> I feel like the author is making a huge omission by not talking about the properties of multi-raft sharded DBs coupled with the TrueTime-like APIs from cloud providers in 2018.
I mean this is all fine and dandy if it works (plus, how do know that it actually works in all the edge cases?), but that's a HUGE amount of complexity. IMO you really need to have a very good reason to even consider anything this complex.
> When you can't query your DB, you hit your resident memory cache. Redis, Memcached, etc.
You would query those first surely, before loading the DB? Regardless, I've been using Redis and Memcache for years from PHP, so mute point.
> I use multiple threads in a single request flow frequently.
So how do you track those? And for how long do they live after the parent request has been processed, or do they block the parent? Gets complicated quickly.
> It's ugly and inconsistent, takes time to memorize, and leads to errors.
Rasmus will admit the same, but he will also admit he does not care (I've seen him say this during a talk). Nobody bought your product because it had beautiful, consistent code.
> This limits you to writing basic CRUD. And so many other languages offer this.
So we introduced so many optimizations just because we can't persist state. I can't help but think this is being penny wise pound foolish; the problems this article is solving wouldn't have been problems in the first place if you choose a boring architecture.
reply