Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?

Yeah, but when that happens I usually don't think, oh hey they are lacking an optimal in memory distributed database solution.

I think, well... their engineers suck. Or they don't care. Pick one.

edit: His point is vague, so there is nothing technical to respond to. I am very much interested in a good technical example - but the things mentioned so far are by all appearances relatively straight-forward and linear, hence lack of effort or bad engineering are the only reasonable assumptions left.



sort by: page size:

> The filesystem is a shitty database. Thought we’d all learned that by now.

How dare you :) "The" filesystem is a great database.. for certain applications. Big, binary blobs of video, store, and even index, particularly well!

I think what the collective "we" haven't learned is to avoid trying to think about scale intuitively (rather than "doing the math") and to avoid extrapolating from the trivial scenario.

It's why "latency numbers every programmer should know" is still a thing.


> When you want distrubted-db, yeah you need to design

But one shouldn't be designing for implementation details. Usually, we start technologies out with leaky abstractions, and gradually get better at it. A good example is game development, where it used to be that you always used the drawing method of the display and the clock speed of the cpu to your advantage. Nowadays, we've moved past that, because it was working on horrible abstractions, and because the technology underneath improved.

I'm not saying I have a solution, and I agree that this problem rears it's head the most when you start bringing in distributed storage. But my point stands: these databases run on a highly leaky abstraction, and that's a big problem going forward.


>>What exactly would cause them to never deliver

concurrency issues, latency issues, poor database schemas, scaling issues, there's a lot of reasons internet scale things can grind to a halt.


> As a result, primary databases (e.g. MySQL, Mongo etc.) almost never work

I mean it does. As far as I'm aware Facebook's ad platform is mostly backed by hundreds of thousands of Mysql instances.

But more importantly this post really doesn't describe issues of scale.

Sure it has the stages of recommendation, that might or might not be correct, but it doesn't describe how all of those processes are scheduled, coordinated and communicate.

Stuff at scale is normally a result of tradeoffs, sure you can use a ML model to increase a retention metric by 5% but it costs an extra 350ms to generate and will quadruple the load on the backend during certain events.

What about the message passing, like is that one monolith making the recommendation (cuts down on latency kids!) or micro services, what happens if the message doesn't arrive, do you have a retry? what have you done to stop retry storms?

did you bound your queue properly?

none of this is covered, and my friends, that is 90% of the "architecture at scale" that matters.

Normally stuff at scale is "no clever shit" followed by "fine you can have that clever shit, just document it clearly, oh you've left" which descends into "god this is scary and exotic" finally leading to "lets spend half a billion making a new one with all the same mistakes."


> Putting your filesystem on a ramdisk is a good idea but it's hardly innovative.

I am way late to the party, but here goes: In 2007 I started working for a company in Norway that had developed their own database, from hardware and up. It included parallell processing (a few thousand processors when I joined) and everything hosted in memory. It was fast. Boot up took a while though, since it had to load everything from disks to memory.

When I joined they were on the third iteration already, the first version went live back in 1992.

We have since retired the concept since off-the-shelf hardware caught up in terms of speed and lower cost, also it didn't scale very well.


> assuming some beastly server with terabytes of ram, hundreds of fast cores, and an exotic io subsystem capable of ridiculous amounts of low latency iops, I'd guess the perf issue with that example is not sql server struggling with load but rather lock contention from the users table being heavily updated.

You'd guess wrong. The example above is not the only query our server runs. It's an example of some of the queries that can be run. We have a VERY complex relationship graph, far more than what you'll typically find. This is finance, after all.

I used the user example for something relatable without getting into the weeds of the domain.

We are particularly read heavy and write light. The issue is quiet literally that we have too many applications doing too many reads. We are literally running into problems where our tempDb can't keep up with the requests because there are too many of them doing too complex of work.

You are assuming we can just partition a table here or there and everything will just work swimmingly, that's simply not the case. Our tables do not so easily partition. (perhaps our users table would, but again, that was for illustrative purposes and by no means the most complex example).

Do you think that such a simple solution hasn't been explored by a team of 50 DBAs? Or that this sort of obvious problem wouldn't have been immediately fixed?


> I’m in the data space and a company like Snowflake has a model where you pay for compute by the second and storage by the byte. Very simple, transparent and everyone is aligned.

Not sure everyone is aligned. Sounds like Snowflake got no incentive on optimizing queries. They even got an incentive doing the opposite. They must keep their infrastructure as-is without any optimization on compute time nor storage to earn the same amount every month.


> 90% of our bill is database related currently.

Huh. With the context of the rest of the comment, I realize (the very obvious comparison) that a database engine designed to shard to many thousands of small workers could potentially be a very attractive future development path.

Iff the current trends in cloud computing (workers, lambda, etc) continues and some other fundamental doesn't come along and overtake.

Which is probably (part of) the reason why this doesn't exist, since I think I've basically just described the P=NP of storage engineering :)


> maybe the problems I've been working on aren't as hard as they seem.

Most likely this. When you have a horizontally sharded MySQL database that spans multiple datacenters - managing replication, failover and scaling out is not something you would expect most developers to be able to handle.

Let alone fine-tuning various kernel parameters, in depth monitoring and other performance tunings.

EDIT: Sure most experienced developers would have a high level and abstract idea about various things that may be outta whack - but it takes some one with expertise in the field to go beyond guesses.


>>IBM and Oracle for years and I can tell you, that kind of behavior is exponentially worse

These dynamics are common in any system where you have a pyramid structure and resources are limited.


> I also realized that storage's performance had very little to do with the benchmark's scores as in order to game the benchmark one would simply add more memory.

Reminds me of the time [1] Fusion-IO showed how good their drives were for SQL Server by running benchmarks with just 16GB of RAM enabled, even though the servers had 384GB RAM.

Under direct questioning, they admitted that if you just used the rest of the RAM in the server, their drives didn't have any effect, and that's why they limited the server so badly for the presentation. (sigh)

[1] https://ozar.me/2014/06/fact-checking-24-hours-of-pass/


>What is the actual bottleneck?

As mentioned in OP, it is MySQL.

>you can't run a massive database on crappy hardware and expect it to work smoothly.

Absolutely... but we do wish to milk the VPS for every cent of its worth and we're not sure we've got there yet.

Thanks for your input.


>> I am building an application that has 2 million rows of data.

It will also be helpful to know what kind of hardware you are using for the database. How much RAM the machine has? Does the whole dataset fit in the RAM? If not, at least indexes should fit in the RAM.

Disk type: HDD/SSD?

Are you using indexes?

Did you check "explain analyze"[1] to check your query plans to ensure that the indexes are being used or not?

There are many things that affect database performance.

>> I want to build something that involves billions of rows of data. I want to know how to speed it up.

The key to this is "Shared nothing architecture" along with a Sharded Database. I have tried to explain this architecture here [2]. To understand Database sharding, I would recommend this Digital Ocean Article [3].

To learn more about highly scalable architectures I would suggest reading the real world architectures section [4] of the High Scalability Blog.

[1] https://thoughtbot.com/blog/reading-an-explain-analyze-query...

[2] https://mobisoftinfotech.com/resources/mguide/shared-nothing...

[3] https://www.digitalocean.com/community/tutorials/understandi...

[4] http://highscalability.com/blog/category/example


> To me, spinning up multiple copies of the database is cheating.

What if the database was designed to be run that way?

> You're comparing a box of Apples to a single apple.

Precisely. Dragonfly is a box of apples. Redis is a single apple that can be put in a box with other apples. If you run a "benchmark" comparing your box of apples against a sole apple, you're being either stupid or dishonest.


> But for many people databases have become commodity applications that they install and then forget.

It seems unlikely to me that this group of people would reach benchmark posts.


> how often would this kind of malprogramming actually lead to deal-breaking performance issues in a project?

Depends very much on your projects. Web developers generally have it easy, as it's cheap to add more servers and free to burn CPU in the browser. Even so you can get into trouble with anything that's O(N) in the number of users. I do wonder how much use of nosql is from people who think SQL is slow because they've not got their indexes set up properly.

Game developers, of course, live and breathe performance. As do embedded and similar low-level environments. Or people working with this trendy "big" data.


> you are just asking for a hard time debugging, troubleshooting, and increasing the chance that you will fuck up the most valuable part of your system, your data.

This sounds like a tooling problem. One could imagine a database that doesn't have these issues.

> It also means you have a single point of failure, no read-replicas or redundancy.

What? Why would any of this be the case?


> This is a case where the db server should use the entire resources of a single server

They have thousands of clusters. They didn't design/architect anything.

They're likely just trying to regroup databases because they are heavily underutilized and noone knows WTF they are running. And the organization will keep growing like that, adding new databases every day.


> Instead, we should be using game-style "data oriented programming" a.k.a. "column databases" for a much higher performance.

This makes logical sense, but I don’t buy it in practice. Most of the heavy data reads are handled by databases, which do optimize for this stuff. I just doubt that, in most software, a significant amount of software performance issues are a result of poor memory alignment of data structures.

next

Legal | privacy