Hacker Read

afavour · 2023-08-21 08:58:47

Come on, that's a silly statement. There's no point comparing completely context free "amounts of data".

If the post was “the incredible difficulty of inserting 120k rows in database” you’d have a point. But it isn’t.

acqq | karma 15084 | avg karma 2.08 · | 2012-03-05 21:28:41+00:00

Even stranger is their conclusion that key/value is certainly slower than the full-fledged SQL database. I just can't be. Somebody doesn't understand something basic, and if it's I, please make me see.

newt0311 | karma 1090 | avg karma 1.91 · | 2009-01-11 15:30:56

"more than ~1 million rows."

Grow up and use an actual database.

reply

coldtea | karma 86593 | avg karma 2.38 · | 2015-10-30 23:26:07+00:00

>I mean, surely a relational db is ill-suited to something like that, right?

Why would one assume that? Relational dbs have been optimized for 30+ years to be as fast as possible for exactly that, and with hundreds of millions of rows even -- 8 million rows is nothing.

If you don't need complex JOINs and have indexes, they can do all that faster than the NoSQL flavor of the day (and even with JOINs they tend to be on par).

reply

gwern | karma 33755 | avg karma 4.24 · | 2023-10-20 12:26:19

> but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal.

Is this really a big deal? Surely for a SQL database, looking up the existence of a few IDs, just about the simplest possible set operation, is an operation measured in a fraction of a millisecond. Wikipedia is not that sparse, and a lookup of a few IDs in a call would easily bound it to a worst case of <10 calls and still just a millisecond or two.

reply

nebabyte | karma 985 | avg karma 1.75 · | 2017-04-13 23:53:59+00:00

> It's definitely doable, but you'll need to heavily fine-tune your queries

Misrepresentation; then it's not actually over 120 million rows. You're basically encoding which subset to actually search in the query, rather than building a proper overall schema that trivializes queries.

reply

jeremy_wiebe | karma 418 | avg karma 2.53 · | 2016-09-26 05:44:32+00:00

> Each dynamic page does roughly 200 SQL statements.

I haven't done web work in a while but am I the only one who thinks that's a ridiculously high number for a single page?

reply

taco_emoji | karma 996 | avg karma 2.31 · | 2017-09-26 14:07:16+00:00

> "you said 125 characters for that string field and you just received 130, FAIL"

Most (all?) RDBMS's have something like a varchar(max) which accepts reasonably large strings.

> you have to create indexes and fus with all sorts of relational nonsense

No, you don't.

> its going to read the entire table into memory and re-write it back to disk

Probably not, if you're adding a nullable column.

reply

dozzie | karma 1448 | avg karma 0.5 · | 2016-07-13 13:20:49+00:00

> Sequential values tend to index well because they don't require large sections of the index to be realigned

Erm... What? I don't understand this sentence at all, and I know well how indexing works.

> [MySQL performance] Notice the nearly consistent insertion time of a long integer vs. the UUID.

Oh boy. You just found out that (a) MySQL is a subpar product and (b) that inserting 4 or 8 bytes is faster than inserting 36 bytes.

How about inserting 8 bytes (int64) vs. inserting 16 bytes (deicated UUID type) in PostgreSQL?

> Auto incremented integers will almost always be smaller, meaning scans against tables/records will be more efficient.

And this is because...?

You know how B-trees work, do you? And you do know that you very rarely need a sequential iteration over artificial key in a database?

reply

hinkley | karma 39933 | avg karma 2.46 · | 2020-02-20 15:08:34

> the main table at work is over 500 columns wide...

This is horrifying, but at least not as horrifying as the public sector database I once had to work with that predated proper support for foreign keys, so there was a relationship table that was about 4 columns wide and must have had more rows than the rest of the database combined.

Even the database they had moved this schema to struggled with that many join operations.

reply

tester756 | karma 3905 | avg karma 1.96 · | 2021-08-30 12:48:33

>It's annoying for sure but a smart editor could simply offer you all possible columns in the current context

I'm using databases with >500 of tables

I guess it's not that trivial

reply

Ericson2314 | karma 7269 | avg karma 1.6 · | 2020-02-22 05:55:07+00:00

> A key/value database doesn't really want to know about an application's type

Relational DBs have for decades.

> Because that complicates storage algorithms because sorting is now much more complex

Leveraging the types to build better indices is huge. Different data admits different total orders / partial order / lattices / other mathematical structure. If you generalize the math to infinitely large tables, this is not the difference between certain queries existing or not existing, it is the deference between queries existing or not existing. One might say limit of algorithmic complexity is realizability https://ncatlab.org/nlab/show/realizability.

reply

deepsun | karma 3058 | avg karma 1.77 · | 2023-05-25 10:38:31

> There's nothing you can do in an index for a table with 350 rows

I meant if you have 100s of indices -- it might be slow even on small tables.

I actually had a production issues from using a lot of indices, but it's not apples-to-apples with current discussion, because the table sizes, DBMS and update rates were much larger, fixed by removing indices and splitting table into multiple.

reply

Maro | karma 7654 | avg karma 4.84 · | 2009-06-30 21:09:56+00:00

"Now, retrieving 48 individual records one by one is sort of silly, becase you could easily construct a single where Id in (1,2,3..,47,48) query that would grab all 48 posts in one go. But even if we did it in this naive way, the total execution time is still a very reasonable (48 * 3 ms) + 260 ms = 404 ms. That is half the time of the standard select-star SQL emitted by LINQ to SQL!"

His "naive way" is the 48 individual SELECTs, and he's claiming that that's faster than one SELECT (star).

His point about SQL vs. LINQ may or may not be valid --- I don't know since I don't use LINQ --- but his argument and measurements are certainly flawed.

The nice thing about software such as a database is that, if you spend enough time, it's not some magical black box. If you read some books and papers or go to school or read some source code then you can actually have an understanding of what goes on "under the hood" and you don't have to resort to voodoo measurements. So, if you know something about RDMBSs, you know that once a query retrieves some rows from disk they're cached, so any queries issued shortly after will be served from memory. This is probably what happened here, since 3ms is less than the disks' seek time, not counting networking and processing overhead...

If you're absolutely unwilling to learn something about how a software works and feel compelled to measure it, then at least create meaningful measurements, including averages, medians, deviations, plots...

As a final note, as a developer who's startup is writing a soon-to-be-released high-performance replicated key-value store, if their site is such that performance matters, then these are exactly the type of queries that could be offloaded to some key-value store or even memcached and recomputed asynchronously in the background every minute.

reply

continuations | karma 1073 | avg karma 3.54 · | 2020-04-07 15:39:53

> we store data very efficiently with minimal overhead.

Do you use encoding like delta-of-delta timestamps or something similar?

> One interesting consequence is that in many cases, we don't require indexes where other databases do.

I don't follow. Why don't you require indexes where other databases do?

Thanks.

reply

byroot | karma 1481 | avg karma 3.24 · | 2013-10-21 12:29:58+00:00

> Anything you can do with SQL you can do with in-memory data structures.

Sure, but unless you also do some indexing manually, you can't really query your whole dataset when it start to become too big.

reply

SnowHill9902 | karma 685 | avg karma 1.37 · | 2022-08-06 14:06:08

> It's easy to construct a query which puts unreasonable load on your system.

Can you give an example?

reply

alex_smart | karma 1120 | avg karma 2.31 · | 2024-02-18 13:49:17

>The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".

They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored. Altering a table to a change a column from varchar(300) to varchar(200) needs to rewrite every single row, where as updating the constraint on a text column is essentially free, just a full table scan to ensure that the existing values satisfy your new constraints.

FTA:

>So, as you can see, the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.

reply

chrismorgan | karma 22283 | avg karma 4.08 · | 2020-12-04 14:35:41+00:00

> -- 8 bytes gives a collision p = .5 after 5.1 x 10^9 values

So yeah, a fifty-fifty chance of a collision after only five billion values. You’re at 10% chance before even two billion, and 1% after 609 million. I wouldn’t care to play this random game with even a million keys, the 64-bit key space is just not large enough to pick IDs at random. UUIDs are 128 bits; that’s large enough that you can reasonably assume in most fields that no collisions will occur.

Storing a string is also inefficient, wasting more than three, and probably eight or more bytes (I’m not certain of the implementation details in PostgreSQL), growing index sizes and making comparisons slower. It’s more efficient to store it as a number and convert to and from the string form only when needed.

reply

atomical | karma 2465 | avg karma 1.42 · | 2018-10-21 14:24:45+00:00

> Especially with the use of Excel, which would be useful for removing duplicates, etc.

Excel can only handle about 1 million rows, right?

reply