Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> why not use different databases? They cost nothing and provide perfect separation.

I understand the sentiment, but This is a pretty simplistic take that I very much doubt will hold true for meaningful traffic. Many databases have licensing considerations that arent amenable. Beyond that you get in to density and resource problems as simple as IO, processes, threads etc. But most of all theres the time and effort burden in supporting migrations, schema updates, etc.

Yes layered logical separation is a really good idea. Its also really expensive once you start dealing with organic growth and a meaningful number of discrete customers.

Disclaimer: Principal at AWS who was helped build and run services with both multi tenant and single tenant architectures.



sort by: page size:

> Use One Big Database.

I emphatically disagree.

I've seen this evolve into tightly coupled microservices that could be deployed independently in theory, but required exquisite coordination to work.

If you want them to be on a single server, that's fine, but having multiple databases or schemas will help enforce separation.

And, if you need one single place for analytics, push changes to that space asynchronously.

Having said that, I've seen silly optimizations being employed that make sense when you are Twitter, and to nobody else. Slice services up to the point they still do something meaningful in terms of the solution and avoid going any further.


>Probably an unpopular opinion, but I think having a central database that directly interfaces with multiple applications is an enormous source of technical debt and other risks, and unnecessary for most organizations. Read-only users are fine for exploratory/analytical stuff, but multiple independent writers/cooks is a recipe for disaster.

In my org I've felt the pain of having centralized DBs (with many writers and many readers) a lot of our woes come because of legacy debt some of these databases are quite old - a number date back to the mid 90's so over time they've ballooned considerably.

The Architecture I've found which makes things less painful is to transition the the centralized database into two databases.

On Database A you keep the legacy schemas etc and restrict access only to the DB writers (in our case we have A2A messaging queues as well as some compiled binaries which directly write to the DB). Then you have data replicated from database A into database B. Database B is where the data consumers (BI tools, reporting, etc) interface with the data.

You can exercise greater control over the schema on B which is exposed to the data consumers without needing to mass recompile binaries which can continue writing to Database A.

I'm not sure how "proper" DBAs feel about this split but it works for my usecase and has helped control ballooning legacy databases somewhat.


> Of course you run into issues with transaction occurring across multiple databases but these problems are hard but solvable.

The only thing you need to do to fix this is run all the services on the same DB.

> This sounds crazy. I don't know any large companies that have successfully implemented it. This is basically arguing for a giant central database across the entire company. Good luck getting the 300 people necessary into a room and agreeing on a schema.

You don't need every service to use the same schema. You only need transactions that span all services. They can use any data schema they want. A single DB is only used for the ACID guarantees.


> that at 100k DAU vertical

That's chump change size even for a medium EC2/RDS instance, which should be capable of tens of millions of queries a day without the CPU or disk complaining at you (unless all your queries are table scans or unindexed).

> my db doesn’t contain just “Bob ordered 5 widgets to Foo Lane” data

It doesn't matter, it's still just bytes. What will matter is your query pattern relative to the databases query planner efficacy, and how updates/deletes impact this.

> makes it very easy to manage data across realms

You can just as easily do this at first as separate databases/schemas on the same physical server, with different users and permissions to prevent cross-database/schema joins so that when you need to move them to different machines it's an easier process.

Everyone I know that has tested isolated multi-tenancy that wasn't dependent on legal needs ended up abandoning this approach and consolidating into as little hardware as possible. Heap Analytics had a blog post a few years ago about this, but I can't seem to find it.

Regardless, hope you success in your endeavor and that you come back in a few months to prove us all wrong.


>However, I am a strong believer in multi-tenant architectures as they allow to scale while mutualizing the resources (I personally think it's single tenant at scale is non-sense in term of ecology) and we will invest into maintaining a multi-tenant architecture.

Our products use multi-tenant architecture in the form of 1 database file per company, and a single database server for everyone (by default). It's great for data isolation, as we can't accidentally leak sensitive corporate data from one company's account to another (say, a missing WHERE). It's also great for indexing, as DB queries only touch small subsets of data. And it works well for most businesses (10-100 employees). For large companies (not that many of them), if we detect a lot of activity which stresses the main database server, we have infrastructure in place to migrate them to dedicated servers, transparently to users. It's worked pretty well so far.


> The resource mismatch currently leaves a lot of compute sitting idle in practice.

When separate storage is multi tenant and compute is elastic you can really squeeze more efficiency. With that you can keep utilization extremely high, b/c storage is effectively spread across the whole fleet. We really see this now running 10K databases on a single (single per region really) storage deployment.

Compute is also elastic, so you again can drive utilization up. So my argument is that separation of storage and compute is not necessarily for performance but rather for utilization, convenience to operate, and additional capabilities (branching, serverless).


> For example, this Hacker News comment stated that using Postgres this way is “hacky” and the commenter received no pushback. I found the comment to be load of BS and straw man arguments. This thinking seems to be “the prevailing wisdom” of the industry – if you want to talk about queue technology in public, it better not be a relational database.

I don't think that there's anything wrong with using a database as a queue, however, I think that there probably could have been better ways to get across the idea, rather than just dismissing an honest opinion as BS. I don't necessarily agree with all of what was said there, but at the same time I can see why those arguments would be reasonable: https://news.ycombinator.com/item?id=20022572

For example:

> Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly.

To me, this makes perfect sense, if you're using the same database instance for the typical RDBMS use case AND also for the queue. Then again, that could be avoided by having separate database instances/clusters and treating those as separate services: prod-app-database and prod-queue-database.

That said, using something like RabbitMQ or another specialized queue solution might also have the additional benefit of bunches of tutorials and libraries, as well as other resources available, which is pretty much the case whenever you have a well known and a more niche technology, even when the latter might be in some ways better! After all, there is a reason why many would use Sidekiq, resque, rq, Hangfire, asynq and other libraries that were mentioned and already have lots of content around them.

Though whether the inherent complexity of the system or the complexity of your code that's needed to integrate with it is more important, is probably highly situational.


> [meme]You can't simply use a second machine/instance.

I'm assuming [meme] is shorthand “overly-broad assertion”? It's nice if your database has horizontal scaling built in but it's not like we don't have an entire generation of successful companies who had application-level sharding logic either by necessity or because they found the control it offered was valuable compared to the built-in generic logic.

> While it's true that most companies will be fine in 256GB ram, we were talking about sharding, which it doesn't have.

You still haven't supported the assertion that it's common for places to have massive, heavily-queried databases like this which would not be better split following natural application-level boundaries. This is particularly relevant when discussing AWS as some of the common reasons for keeping everything in one database are sweet-spots for other services (e.g. migrating data for long-term historical reporting over to RedShift).

Again, I'm not questioning that integrated sharding would have its uses – only your sweeping assertion that this is a likely problem for most people and that it's a dead-end (“you're stuck”) rather than merely one of many growing pains which you'll deal with on a successful product. In particular, it's unlikely that everyone will have the same right answer at that scale since access patterns vary widely.


> These days it's pretty trivial to have a cloud managed component e.g. Redis that is maintained, upgraded and supported.

It's still another moving part. Thing should be as simple as they can be, but no simpler.

> trying to use a database as a poor man's queue.

Of course, it's not really a "poor man's" queue-- it's got some superior capabilities. It just loses on top-end performance. (Of course, using those capabilities is dangerous, because it creates some degree of lock-in, so go into it open-eyed).

For as much as you accuse others of looking down their nose / not willing to seriously consider other technologies... you seem to be inclined that way yourself.

Redis is great. But if you have Postgres already, and modest to moderate queuing requirements, why add another piece to your stack? Postgres-by-default is not a bad technology sourcing strategy.


> Is a shared database in microservices actually an anti-pattern?

No, unless you dont want (or care about having) a single source of truth. If state is distributed its harder to backup/restore/rewind or even query atomically and reproducibly.

You could still make the single database sharded and duplicated though, but still: in most cases its still one shared database. Even storing Files outside of a DB is just sharding, important thing is the DB refers to the file and still is the single source of truth.

And when you want vertical segmentation see designs like multi-tenant, but its still not a database per microsevice: quite the opposite.

When you dont want a single source of truth you could do without it ofcourse.


> One of the selling points, which is now understood to be garbage, is that you can use different databases.

It was a major selling point back in the days. You can say that's now a legacy but it was definitely a thing pre-cloud / SAAS.

Lots of software used ORM to offer multi-database support, which was required when you sell a license and users installed it on-premise. Some organizations strictly only allowed a certain brand of database.

You couldn't spin up a random database of flavor in AWS, Azure or GCP. There were in-house DBAs and you were stuck with what they supported.


> Not sure what do you mean about scaling storage layer separately from the compute layer. It's how it's usually done currently, with the disk being the storage layer.

Yeah, I meant that only. DynamoDB and Aurora Serverless are examples where they are decoupled and can be scaled independently. So, I wanted to know about more products like them but ones that scale with load really well.

> AWS now has the Aurora Serverless, though I don't have any real-life experience with it's latency yet (VPC requirement is a downer for now).

I have tried Aurora Serverless and I like it but it also plagued by the provisioning problems for now. The lack of options to customize database parameters was kind of a downer. Also for some weird reason they only support MySQL in t type instances.

> (VPC requirement is a downer for now)

BTW, you have an use case where you need to use DB from outside VPC and you can't do peering?


> Why not just use the right tool for the job ? (i.e. hosted DB or compute).

Because sometimes you want a database for 10000 rows, or you have 5 logins a month and you don't want a $30/month database running in the cloud.

There's a market out there for real "server-less" database that charges you per rows stored, is priced per read/write operations, and is reasonably priced for a 10000 rows per month, without having to calculate how many 4KB blocks you are going to read or write.


> monolithic large relational databases are hard to scale

DB2 on z/OS was able handle billions of queries per day.

In 1999.

Some greybeards took great delight in telling me this sometime around 2010 when I was visiting a development lab.

> When you have one large database with tons of interdependencies, it makes migrating data, and making schema changes much harder.

Another way to say this is that when you have a tool ferociously and consistently protecting the integrity of all your data against a very wide range of mistakes, you have to sometimes do boring things like fix your mistakes before proceeding.

> In theory better application design would have separate upstream data services fetch the resources they are responsible for.

A join in the application is still a join. Except it is slower, harder to write, more likely to be wrong and mathematically guaranteed to run into transaction anomalies.

I think non-relational datastores have their place. Really. There are certain kinds of traffic patterns in which it makes sense to accept the tradeoffs.

But they are few. We ought to demand substantial, demonstrable business value, far outweighing the risks, before being prepared to surrender the kinds of guarantees that a RDBMS is able to provide.


>Interesting choice of words. Performance wise, sure. Money wise? I'm still waiting for a SQL database with pay-per-request pricing. The cost difference is enormous, particularly when you remember that you don't need to spend manpower managing the underlying hardware.

I assume you're saying DynamoDB is less expensive than SQL because of pay-per-request.

Working on applications with a modest amount of data (a few TB over a few years) pay per request has been incredibly expensive even with scaled provisioning. I would much rather have an SQL database and pay for the server/s. Then I could afford a few more developers!


> This is a case where the db server should use the entire resources of a single server

They have thousands of clusters. They didn't design/architect anything.

They're likely just trying to regroup databases because they are heavily underutilized and noone knows WTF they are running. And the organization will keep growing like that, adding new databases every day.


> not ideal for long running/cache/database style services.

Well, one question to ask yourself when considering going down this route is whether it makes more sense to move all the statefulness into managed services, like Aurora, BigTable, S3, etc.

That drastically simplifies life. Now the only infrastructure directly managed by you are stateless workloads that can easily be self-healed, rolled back, scaled up/down, etc. Managed DBs are more expensive than running your own DB, but most likely the cost savings of moving the rest of the infrastructure to spot/preemptible outweighs this difference.


> My impression from the article is that this is a single SQL database being discussed.

Even if it's initially single, it's bad to assume that it will be so forever and that you are not going to use third party providers in the future.

How well does ON UPDATE CASCADE work if there's millions of existing relations to that entity?


> we opted for having one DB per tenant which also makes sharding, scaling, balancing, and handling data-residency challenges easier

When you say 1 DB I suspect you mean you have a single DB Server and multiple DB's on that server. Then I don't think this really solves the data-residency problem as the clients data is just in a different DB but still on the same instance. It makes other problems for you as well for example you now have 2 DB's to run maintenance, upgrades, data migrations on. Current company uses a similar model for multiple types of systems and it makes upgrading the software very difficult.

It also makes scaling more difficult as instead of having a single DB cluster that you can tweek for everyone you'll need to tweek each cluster individually depending on the tenants that are on those clusters. You also have a practical limit to how many DB's you can have on any physical instance so your load balancing will become very tricky.

There are other problems it causes like federation which Enterprise Customers often want.

next

Legal | privacy