Even stranger is their conclusion that key/value is certainly slower than the full-fledged SQL database. I just can't be. Somebody doesn't understand something basic, and if it's I, please make me see.
>I mean, surely a relational db is ill-suited to something like that, right?
Why would one assume that? Relational dbs have been optimized for 30+ years to be as fast as possible for exactly that, and with hundreds of millions of rows even -- 8 million rows is nothing.
If you don't need complex JOINs and have indexes, they can do all that faster than the NoSQL flavor of the day (and even with JOINs they tend to be on par).
> but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal.
Is this really a big deal? Surely for a SQL database, looking up the existence of a few IDs, just about the simplest possible set operation, is an operation measured in a fraction of a millisecond. Wikipedia is not that sparse, and a lookup of a few IDs in a call would easily bound it to a worst case of <10 calls and still just a millisecond or two.
> It's definitely doable, but you'll need to heavily fine-tune your queries
Misrepresentation; then it's not actually over 120 million rows. You're basically encoding which subset to actually search in the query, rather than building a proper overall schema that trivializes queries.
> the main table at work is over 500 columns wide...
This is horrifying, but at least not as horrifying as the public sector database I once had to work with that predated proper support for foreign keys, so there was a relationship table that was about 4 columns wide and must have had more rows than the rest of the database combined.
Even the database they had moved this schema to struggled with that many join operations.
> A key/value database doesn't really want to know about an application's type
Relational DBs have for decades.
> Because that complicates storage algorithms because sorting is now much more complex
Leveraging the types to build better indices is huge. Different data admits different total orders / partial order / lattices / other mathematical structure. If you generalize the math to infinitely large tables, this is not the difference between certain queries existing or not existing, it is the deference between queries existing or not existing. One might say limit of algorithmic complexity is realizability https://ncatlab.org/nlab/show/realizability.
> There's nothing you can do in an index for a table with 350 rows
I meant if you have 100s of indices -- it might be slow even on small tables.
I actually had a production issues from using a lot of indices, but it's not apples-to-apples with current discussion, because the table sizes, DBMS and update rates were much larger, fixed by removing indices and splitting table into multiple.
"Now, retrieving 48 individual records one by one is sort of silly, becase you could easily construct a single where Id in (1,2,3..,47,48) query that would grab all 48 posts in one go. But even if we did it in this naive way, the total execution time is still a very reasonable (48 * 3 ms) + 260 ms = 404 ms. That is half the time of the standard select-star SQL emitted by LINQ to SQL!"
His "naive way" is the 48 individual SELECTs, and he's claiming that that's faster than one SELECT (star).
His point about SQL vs. LINQ may or may not be valid --- I don't know since I don't use LINQ --- but his argument and measurements are certainly flawed.
The nice thing about software such as a database is that, if you spend enough time, it's not some magical black box. If you read some books and papers or go to school or read some source code then you can actually have an understanding of what goes on "under the hood" and you don't have to resort to voodoo measurements. So, if you know something about RDMBSs, you know that once a query retrieves some rows from disk they're cached, so any queries issued shortly after will be served from memory. This is probably what happened here, since 3ms is less than the disks' seek time, not counting networking and processing overhead...
If you're absolutely unwilling to learn something about how a software works and feel compelled to measure it, then at least create meaningful measurements, including averages, medians, deviations, plots...
As a final note, as a developer who's startup is writing a soon-to-be-released high-performance replicated key-value store, if their site is such that performance matters, then these are exactly the type of queries that could be offloaded to some key-value store or even memcached and recomputed asynchronously in the background every minute.
>The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".
They then also show that there is in fact a significant performance difference when you need to migrate your schema to accodomate a change in length of strings being stored. Altering a table to a change a column from varchar(300) to varchar(200) needs to rewrite every single row, where as updating the constraint on a text column is essentially free, just a full table scan to ensure that the existing values satisfy your new constraints.
FTA:
>So, as you can see, the text type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks.
> -- 8 bytes gives a collision p = .5 after 5.1 x 10^9 values
So yeah, a fifty-fifty chance of a collision after only five billion values. You’re at 10% chance before even two billion, and 1% after 609 million. I wouldn’t care to play this random game with even a million keys, the 64-bit key space is just not large enough to pick IDs at random. UUIDs are 128 bits; that’s large enough that you can reasonably assume in most fields that no collisions will occur.
Storing a string is also inefficient, wasting more than three, and probably eight or more bytes (I’m not certain of the implementation details in PostgreSQL), growing index sizes and making comparisons slower. It’s more efficient to store it as a number and convert to and from the string form only when needed.
If the post was “the incredible difficulty of inserting 120k rows in database” you’d have a point. But it isn’t.
reply