Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> but still having to combine data (the comment I'm responding to).

The GP frames the problem as merely avoiding round trips. You introduced the idea of eliminating redundant database calls on the back end. In that scenario, you absolutely have the right idea.



sort by: page size:

> Fewer points of coordination is always better.

The distributed database needs a coordination system anyway, so it's not an additional point.

> In general you shouldn't need to make a roundtrip to produce an ID.

Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.


> You just have to deal with the very mild inconvenience of keeping your database synchronized across devices.

For HN crowd that is likely easy. (I also use that solution)


>Really the only downside I ever found — at the scale I was operating at — was that it bloats the database backups.

Easily solved by having two separate databases. One for dynamic content, one for static files, which is probably good practice regardless.


> Of course you run into issues with transaction occurring across multiple databases but these problems are hard but solvable.

The only thing you need to do to fix this is run all the services on the same DB.

> This sounds crazy. I don't know any large companies that have successfully implemented it. This is basically arguing for a giant central database across the entire company. Good luck getting the 300 people necessary into a room and agreeing on a schema.

You don't need every service to use the same schema. You only need transactions that span all services. They can use any data schema they want. A single DB is only used for the ACID guarantees.


>Few real world scenarios have clients sending queries back-to-back without any gaps in between.

I would have thought so too until I saw our app in production run 1000s of queries in a single domain operation.

It's not the highest quality codebase I've ever worked on to say the least.


> The distributed database needs a coordination system anyway, so it's not an additional point.

Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

> Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.


> Use One Big Database.

> Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).

This works until it doesn't and then you land in the position my company finds itself in where our databases can't handle the load we generate. We can't get bigger or faster hardware because we are using the biggest and fastest hardware you can buy.

Distributed systems suck, sure, and they make querying cross systems a nightmare. However, by giving those aspects up, what you gain is the ability to add new services, features, etc without running into scotty yelling "She can't take much more of it!"

Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.


> Instead, you need to make separate parallel requests for those datapoints and let the dataloader be in charge of merging them into larger batches of requests for the db to fullfill. > This can result in some additional latency on a request, but ultimately provides the best way to be able to scale things out.

I promise you, this is not the best way to scale things out.


> Seems challenging to implement without losing significant performance to the abstraction layer.

Agreed, it seems the tradeoff would only make this worthwhile when you need to optimize for throughput, and a lot of the workload is ad hoc. A lot of distributed DBMS just serve point queries or range scans where I don't think something like this would be useful.

Troubleshooting production issues here seems challenging too (though that's always the case in distributed systems to this would be less of a problem).

Still interesting, but it would be more interesting to see the idea get some trial by fire.


> key upside to complex logic in the application layer vs the database is that the application layer is often much easier to scale out than the db

I think the point of the GP is that all of these application instances are still connected to the DB, doing sub optimal data fetches taxing the database in multiples.


> Use One Big Database.

I emphatically disagree.

I've seen this evolve into tightly coupled microservices that could be deployed independently in theory, but required exquisite coordination to work.

If you want them to be on a single server, that's fine, but having multiple databases or schemas will help enforce separation.

And, if you need one single place for analytics, push changes to that space asynchronously.

Having said that, I've seen silly optimizations being employed that make sense when you are Twitter, and to nobody else. Slice services up to the point they still do something meaningful in terms of the solution and avoid going any further.


> If anything, I think many folks would say “whatever, make the two requests in parallel and zip them together in the application”

That was my first reaction.

Keep in mind that high load applications have limited database connections. Running queries in parallel will be an anti-pattern in such applications.


> I was referring to your (seemingly snarky, and I apologize if I mistook your intent) comment that you hoped I wasn't suggesting to a shared database between services.

Apologies, I did write a snarky comment and I'm very sorry that I did that, rather than asking for clarification and/or explaining my position a bit more clearly.

I take the view that sharing an SQS queue (or a topic, or a stream...) between two services is OK. It creates a contract between those two services, of course, and part of that contract might be that there is a uniqueId field that the receiving service is going to have to persist in its own database in order to only action each request once, but it doesn't add an excessive amount of complexity or coupling between the services, not for me at least.

I'm still not OK with having two services share the same database, though, even if you're using it for a narrow purpose. For me, the danger is that you end up introducing more coupling than just agreeing on a message format and messaging semantics. I've been burned by that once or twice and I don't want to get burned by it again and I'm pretty happy that it's widely seen as an anti-pattern these days and personally I dogmatically avoid it. I do take your point that the database is going to end up being a SPOF/bottleneck, though, and I don't really have a great answer for that in the general case.


> And by the love of god, don't split your data.

I think people feel that if they introduce a new data store, they won’t have to deal with the existing nearly unusable massive data store.

Of course that just exacerbates the problem.


> My point is instead of doing 100k transactions in your web app, you should look at how to gather them into batches.

This sounds odd to me. Assuming 100k independent web requests, a reasonable web app ought to be 100k transactions, and the database is the bottleneck. Suggesting that the web app should rework all this into one bulk transaction is ignoring the supposed value-add of the database abstraction and re-implementing it in the web app layer.

And most attempts are probably going to do it poorly. They're going to fail at maintaining the coherence and reliability of the naive solution that leaves this work to the database.

Of course, one ought to design appropriate bulk operations in a web app where there is really one client trying to accomplish batches of things. Then, you can also push the batch semantics up the stack, potentially all the way to the UX. But that's a lot different than going through heroics to merge independent client request streams into batches while telling yourself the database is not actually the bottleneck...


> It makes zero sense to pull a bunch of records back from the db using multiple network calls to join and then filter. Let the db do its job.

But then it wouldn't be distributed processing! :)

Seriously, though, consider it for a moment.. this pattern has similar features to something like Hadoop. The data comes from storage nodes (database server and, hopefully, their read replicas) and goes to processing nodes (app server) to have the work done and is then new data is written back out over the network to storage nodes and replicated across the network (to the replica/slave database servers).

If the data volume is particularly low or the compute load (CPU and/or RAM) is particularly high, the distributed method would make intuitive sense. I haven't seen it yet, however.


> it doesn't sound they really sharing data with each other, it looks like your logic is well lineralizable and data localized, and you can't implement access to some global hashmap in that way for example.

Yes, because data can have thread affinity. Data doesn't need to be shared by _all _ connections, just by a few hundred/thousand. This enables connections to be scheduled to run on the same thread so that they can share data without synchronization.

> I run this(10k threads blocked by DB access) in prod and it works fine for my needs. There are lots of statements in internet about overhead, but not much benchmarks how large this overhead is.

The underlying problem is old and well researched: https://en.wikipedia.org/wiki/C10k_problem


> Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute -- instead of a new microservice or whatever and some minutes of compute.

This works until your database falls over in production. Recently someone started appending to a json field type over and over in our production database. And then on some queries, postgres crashed due to lack of memory. The fix was to remove the field and code that constantly appended to it, and do something else.

No, the database should not be the answer to all your data problems. Yes a microservice may be the best answer. But for structured data and queries that can run with normal amounts of memory the standard SQL DB is fine.


> why not use different databases? They cost nothing and provide perfect separation.

I understand the sentiment, but This is a pretty simplistic take that I very much doubt will hold true for meaningful traffic. Many databases have licensing considerations that arent amenable. Beyond that you get in to density and resource problems as simple as IO, processes, threads etc. But most of all theres the time and effort burden in supporting migrations, schema updates, etc.

Yes layered logical separation is a really good idea. Its also really expensive once you start dealing with organic growth and a meaningful number of discrete customers.

Disclaimer: Principal at AWS who was helped build and run services with both multi tenant and single tenant architectures.

next

Legal | privacy