Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Updates to the H2O.ai db-benchmark (duckdb.org) similar stories update story
192 points by vgt | karma 2349 | avg karma 4.39 2023-11-06 09:51:59 | hide | past | favorite | 92 comments



view as:

Fantastic job by the DuckDB team. I’ve been using it for the past year to query 100s of GBs of Parquet files with complex analytic queries involving multiple levels of aggregations, joins and window functions and it all works and works fast. And I do all this from Jupyter Notebook.

It’s actually faster than AWS Athena for me.


That’s great! Where are the Parquet files stored, and where is running DuckDB?

S3 and SageMaker. Same region.

Then you have low latency between the SageMaker notebook running DuckDB and the S3 buckets storing the Parquet files: neat!

Yeah I've started using it with my search engine as well. It's fantastic how versatile it is for data all manner of data munging.

Just the other day I used it to transform an unordered 60 GB CSV file with links and texts into a 3 GB parquet file that's so fast I can create a projection for the relevant data of each partition in like a minute (which then fits in memory).

It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.


Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.


I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.

Problems I encountered with polar was that the immutability meant that copies of the dataframe were generated and not cleaned up fast enough.

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html


> If you still get OOMs, chunking your data will help.

what does chunking data mean here?..


Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.

Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.

A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.


I had the same idea in my mind, but I am struggle to understand efficient implementation.

Say, I have large table A, which I want to group by. I can chunk it into small tables, for example:

for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')

but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?


It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.

DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)


this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..

No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning


ok.. thanks.. I'll try it out. I can think of few use-case that we have where this might be a good alternative to athena.

Exactly the same experience here. I am thinking about getting rid off everything else in our data infra as soon as it hits 1.0.

Would be curious how the performance compares to DataFusion[0] as one of the top contenders to DuckDB on this area (albeit they being different in a lot of parts, I find it one of the closest compared to all others).

ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)

[0]: https://github.com/apache/arrow-datafusion

[1]: https://benchmark.clickhouse.com


Looks like a recent PR bumped benchmark.clickhouse.com to DuckDB v0.9 on the 3rd.

https://github.com/ClickHouse/ClickBench/pull/141


A paper on DataFusion is in progress[0].

The draft[1] includes a comparison to DuckDB and preliminary benchmark results.

[0]: https://github.com/apache/arrow-datafusion/issues/6782

[1]: https://www.overleaf.com/read/qjhrxqhgksvr


Why do you run benchmarks on such small datasets? It is very hard to judge performance..

Looks like DataFusion is included in most of the results in the article?

You are right! Seems like it is not text-addressable which is why my ctrl+f searches failed.

question about Arrow: the format seems to be not very space efficient.

I tried converting one of my parquet files from datalake from parquet to arrow and size difference is staggering. 20mb parquet -> 700mb arrow.

doesnt seem fit for datalake at all


Do you have compression enabled? At least from Pandas, Parquet defaults to compressed and Arrow/Feather default to uncompressed. When I enable zstd compression, I get similar file sizes, and sometimes Arrow is smaller.

I was just trying pandas native .to_parquet and .to_arrow() without any extra config knobs

The following paper describes some of the tradeoffs between different formats

Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf


Arrow format is not intended for storage, it's for in-memory data exchange between different libraries and languages.

Arrow is not really designed for storage though. See the "Parquet vs Arrow" section of this post (https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encod...):

> Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.

> The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups.


the O(1) random access doesn't look like advantage at all, honestly.

if I load my parquet into memory - I will have O(1) random access to any row just as well.

plus, considering that Arrow recommends to work in chunks of 1000 rows per file, I am curious to learn exact tasks for which Arrow is optimizing for.

the only use case I can think of is transferring data between systems written in different languages/runtimes and doing zero serialization/deserialization, just send/receive memory buffers that are nicely mapped to dataframes.


Into what format in memory? You can't just mmap into a parquet file and access whole records, it's a columnar format.

Arrow is absolutely designed for Interop between languages - often devs want to develop their core in something like Python, and the platform devs want to work on Scala. Arrow lets you write all the distributed system and data shuffling in Scala, but then users can access the same records in their user-defined code with minimal overhead


Seems likely a memory management issue in the Arrow interface for the language you’re using.

Reminder that you can use Fugue as a unified API and swap out back-ends, including DuckDB: https://fugue-tutorials.readthedocs.io/tutorials/integration...

Fugue,

> Fugue provides an easier interface to using distributed compute effectively and accelerates big data projects. It does this by minimizing the amount of code you need to write, in addition to taking care of tricks and optimizations that lead to more efficient execution on distrubted compute. Fugue ports Python, Pandas, and SQL code to Spark, Dask, and Ray.


I really love DuckDB for one-offs and analytics and I wonder if anybody here has experience using with medium size data.

I still seem to run into the workflow problem where data has to be in proximity to compute in order to function.

If I need to run joins on a 5-10 GB parquet / table, unless I have that sitting locally, the performance bottleneck is not the database.

I still find myself reaching to Databricks / Spark for most tasks for this reason.

I suppose this is what Motherduck is trying to solve? But it just doesn't feel like it's quite there yet for me. Anybody who is better at this stuff than me have thoughts?


I really like Motherducks hybrid execution model. My DS colleagues love abusing our data warehouse - bringing the data down locally to pound on is a win-win

Hi, head of Produck at MotherDuck here.

Yes, we are indeed a good use case for this. For one, we built a fully-fledged managed storage system on top of DuckDB, with better performance and caching and the like. Two, we're going to be pretty good at reading from S3 because we've optimized that path. Three, our storage has sharing/IAM and is about to have things like zero-copy clone and time travel.

Happy to answer any Qs.


It would be interesting to see some more benchmarks of e.g. querying multiple files from S3, and how that evolved across versions.

When I checked at 0.7.1, when working with ~90 S3 parquet objects (x0000 rows each, so not too many) it was 25-50% faster to first download them in Go and then query them, rather than using the DuckDB S3 extension with those objects directly (the whole execution ran on the order of a couple hundred milliseconds).


You definitely pay a performance penalty on S3 (S3 is high throughput but high latency storage) so not optimal for (any) database use cases. Local disk will always be faster if you can swing that.

It’s not a DuckDB specific issue (although there’s headroom for improvement — I don’t think DuckDB’s S3 connector is highly optimized). It’s S3.


I might’ve been unclear, so to clarify:

The overhead of fetching from S3 via a naive Go implementation (goroutine per object) to disk and then running duckdb on that was lower than using duckdb end-to-end.

I was measuring the S3 overhead in both cases.


No I got you.

Like I said S3 is a high throughput high latency storage. When you fetch the S3 object to disk, that’s a high throughput operation and S3 excels at that. Once on disk DuckDB can operate at low latency.

If you run DuckDB end to end as a database engine on S3, it has to do partial reads on parquet on S3 etc. and has to deal with S3 latencies and it can end up being slower than what you described above.

For long running operations where I can chunk the data, I often copy chunks to local disk before running DuckDB. It’s a lot faster than running DuckDB directly on S3.

The downside is I need enough disk space.


Theoretically reading directly from s3 should be faster. Downloading all the data from s3 and then running the query locally is basically an extreme form of pre-fetching. DuckDB could be written to pre-fetch data concurrently using some heuristics and provide similar or better performance.

Makes sense — every S3 call is high latency so the fewer you make the better.

Prefetching would help reduce the number of those high latency calls which databases naturally make.

We often think of S3 as a file system but it isn’t one — it differs in fundamental ways from one. (Also treating it as a filesystem isn’t performant at all — I tried s3fs and mountpoint but both were slow)


Oh, yeah, that does make sense, and I suppose it might also be a better approach if the objects were large (especially if you can avoid reading some parts). In that case prefetching could be wasteful (or result in OOMs).

I didn’t think of that - thanks for the explanation!


Interestingly, Dask runs out of memory on many of the tasks of the benchmark.

I had similar issues with the Dask scheduler a few months ago. The docs say it encourages depth first behavior in the computation graph but in my case it kept running out of memory on a large ETL task by first trying to load all the input files into memory before moving on to the next stage.

I am impressed that Polar is close to DuckDB near the top. It's surprising that a Python library would often out perform everything but DuckDB. DuckDB is very impressive but DataFrames and Python is too useful to give up on.

Polars is written in Rust!

DuckDB interoperates with polars dataframes easily. I see DuckDB as a SQL engine for dataframes.

Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).

The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.

So I usually write complex queries in DuckDB SQL but if I need to manipulate it in polars I just convert it in my workflow midstream (only takes milliseconds) and then continue working with that in DuckDB. It’s seamless due to Apache Arrow.

https://duckdb.org/docs/guides/python/polars.html


Wow, what a cool workflow. I looks like the interop promise of Apache Arrow is real. It's a great thing when your computer works as fast as you think as opposed to sitting around waiting for queries to finish.

I mean polars is great, but there's nothing fundamentally impossible about polars providing similar performance to DuckDB, polars is written in rust, and really a lazy dataframe just provides an alternative frontend (sql being another frontend).

There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.


I didn't know that Polars was implemented in Rust. In fact, it's very neat that Rust can interop with Python so cleanly, but that shouldn't be surprising since Python is basically a wrapper around C libraries.

But I still think it's surprising how much legs Python model of wrapping around C/C++/Rust libraries has. I would assume that if you have Python calling the libraries, you can't do lazy evaluation and thus you hit a wall such as Pandas.

But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.


Kinda odd that they have max 50GB dataset and run it on machine with 160GB ram, so no testing of out of memory capabilities.

I was/am a fan of duckdb, but I recently discovered a bug in 0.9.1 where a fairly innocuous query was silently returning wrong results (issue 9399 on github). That made me much less confident about duckdb and how well tested it is. Maybe it was a one off, but with postgresql for example I don't think I personally encountered cases of simply incorrect query results.

Just had a look (https://github.com/duckdb/duckdb/issues/9399). Yeah it's worrying that such a trivial query returned incorrect results - but credit to the Devs for getting it fixed quickly.

To my knowledge the only databases that can be described as "military-grade" in terms of testing are SQLite and Postgres.


Apparently DuckDB requires your real-life name to file an online bug report, bucking every norm of online handles for communication, as well as enabling doxxers and stalkers to find and trace people in real life.

It's the same with a lot of open source contributions; those need to do so for legal and copyright reasons.

If you're afraid of doxxing and / or stalking though, at least you have the choice to not contribute. You can still post somewhere else and ask someone else to make the report for you if need be.


I'm not aware of any other open source project that required a real name just to file a bug

Yes, that was a surprising requirement when submitting a bug report. (I understand patches may need to be like that due to copyright issues)

This is stupid and [the given reasoning](https://berthub.eu/articles/posts/anonymous-help/) makes no sense. If you post open source software and open issues for it, and I post a properly formatted, reproducible, clear issue, there is absolutely no reason for me to disclose my identity for the sake of your ego.

I looked at the bug report form and for context it links to a post “Dear anonymous internet user asking for help” https://berthub.eu/articles/posts/anonymous-help/

It seems to be about wanting to know who you're talking to when providing free support for an open-source project, and whether the person submitting an issue is using the project for personal use or within an organization.

> If I don’t know who you are, am I enabling you to build the new Turkish censorship infrastructure, or helping you implement [Russian internet blocking] more efficiently? These are two examples that actually happened by the way.


What about Oracle and MSSQL? I'd imagine that especially the former would be "military-grade" (whatever that entails)

Even Postgres has its share of the bugs, e.g. simple search shows https://www.postgresql.org/message-id/CAGckUK2GLF%3Dd9J5ErEW...

Oracle might be military grade because they have the entire web page on how to report wrong results bugs: https://support.oracle.com/knowledge/Oracle%20Cloud/150895_1... https://support.oracle.com/knowledge/Oracle%20Database%20Pro... etc.

Query engines are (not)surprisingly complex software products. Add to that the constant (and aggressive, due to the competition in the field) evolution and adition of the new features that can interact with every existing feature in any existing context and you have a perfect environment for bugs to appear.


I'm not sure if the fix is reassuring or not: https://github.com/duckdb/duckdb/pull/9411/files

I certainly liked that the added the problematic query to the list of tests, which I think is a healthy sign.

Almost(!) every single query engine has bugs like this. The fact is that DuckDB is version 0.9.1 (pre 1.0) so bugs are kind of expected. I have found bugs like this in commercial engines.

Why do so many of the 0.5GB clickhouse benchmarks fail?

If you have some data in postgresql and want to query it with duckdb (really fast) you can try extracting the data to a parquet file; this file can then be queried from duckdb with incredible speed. I've written a small program in python that reads from postgresql and exports to parquet for anybody that wanna try it https://github.com/spapas/pg-parquet-py#why

There's also the option to use the DuckDB PostgreSQL scanner

https://duckdb.org/docs/archive/0.9.1/extensions/postgres_sc...


Really interesting to compare to Clickhouse's benchmark, when you filter out the non-comparable results. The TLDR is that their benchmark shows DuckDB winning a lot of the races:

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...


I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.

[1] https://h2oai.github.io/db-benchmark/


I think these benchmarks are great, but also quite misleading and should be updated:

* the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.

* the benchmarks only show the query runtime once the data has been persisted in memory. They should also show how long it takes to persist the data in memory. If query_engine_A takes 5 mins to persist in memory & 10 seconds to run the query and query_engine_B takes 2 mins to persist in memory & 20 seconds to run the query, then the amount of time to persist the data is highly relevant.

* benchmarks should also show results when the data isn't persisted in memory.

* Using a Parquet file with column pruning would make a lot more sense than a huge CSV file. The groupby dataset has 9 columns and some of the queries only require 3 columns. Needlessly persisting 6 columns in memory is really misleading for some engines.

* Seems like some of the engines have queries that are more optimized than others. Some have explicitly casted columns as int32 and presumably others are int64. The queries should be apples:apples across engines.

* Some engines are parallel and lazy. "Running" some of these queries is hard because lazy engines don't want to do work unless they have to. The authors have forced some of these queries to run by persisting in memory, which is another step, so that should be investigated.

* There are obvious missing query types like filtering and "compound queries" like filter, join, then aggregate.

I like these benchmarks a lot and use the h2o datasets locally all the time, but the methodology really needs to be modernized. At the bottom you can see "Benchmark run took around 105.3 hours." This is way to slow and there are some obvious fixes that'll make the results more useful for the data community.


Why should 50GB be stored in multiple files?

Modern query engines are designed to read data in parallel because it's so much faster. The data could be stored in 50 different one-gig files that were read in parallel.

None of this requires multiple files, either in theory or in practice. It may be a limitation of a particular implementation but that is a design choice. There are functional advantages to not splitting up files unnecessarily when processing queries.

Spec compliant CSV parsing cannot be done in parallel iirc. Most CSV parsers are not spec compliant, but splitting up CSVs is one way to read them in parallel while ensuring you can still use a spec compliant parser.

> the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.

For a generic OLAP db, maybe. In this case, though, a single file fits one of DuckDB's use cases: analytics on the data consumed by or produced by a data scientist. In such scenario, it's not uncommon for a multi-GB input or for dumping GBs of a dataframe into a single CSV file.


I love the increased focus on benchmark testing recently, but I always find it a little weird to read for stuff like spark or dask.

Those are written to offer scale over large data so have very different overheads and limits compared to something like duckdb. Seems odd to have them on the same chart.

Also, side note but I'd love to see the performance impact of pandas/dask with pyarrow schemas.


We at MotherDuck are working closely with the DuckDB folks to deliver a serverless analytics service powered by DuckDB [0]. Currently in open Beta and driving towards GA. Our users are certainly recognizing how fast DuckDB is in the cloud.

One of the reasons we exist is because DuckDB is meant to be a single-player database. MotherDuck is doing tons of heavy-lifting to turn it into a true multi-player data warehouse, so things like IAM/sharing, persistence, time travel, administration, the ecosystem and so forth.

What's magical about MotherDuck is that virtually any DuckDB instance in the wild can connect to MotherDuck by simply running '.open motherduck:' [1], and suddenly you get all these aforementioned benefits.

(head of produck at MotherDuck)

[0] https://motherduck.com/

[1] https://motherduck.com/docs/getting-started/connect-query-fr...


Nice job DuckDB team, those are great performance improvements compared to a couple years ago. It’s neat that solutions like this are being offered and developed. At Hasura we’ve been working on a data-connector that’ll wrap around DuckDB. Curious what others would think about having a GraphQL layer on JSON/Parquet/CSV via DuckDB?

It's nice that they put the Tldr at the top, im not sure why but a lot of places, for some weird reason, put it at the end.

Recent and related:

DuckDB 0.9 - https://news.ycombinator.com/item?id=37657736 - Sept 2023 (59 comments)


Moderation note: the actual title of this article is "Updates to the H2O.ai db-benchmark!".

It's not about duckdb's performance improvements per se, it's about the change to the aws instance type they're using to get fairer benchmark results by avoiding block storage and preventing noisy neighbors.


From the article itself:

"The team at DuckDB Labs has been hard at work improving the performance of the out-of-core hash aggregates and joins."

Aside from that, since April DuckDB has vastly improved performance of a number of queries, including aggregates, joins, window functions. [0]

[0] https://duckdb.org/2023/09/26/announcing-duckdb-090.html#cor...


From: https://news.ycombinator.com/newsguidelines.html

> please use the original title, unless it is misleading or linkbait; don't editorialize.

Just because they mention X in the article, doesn't mean you should change the title to X, even if X is nice, welcome news.


That sounds great and i will do that from now on. That said, the substance of my reply holds...

Changed now. Thanks!

(Submitted title was "DuckDB performance improvements with the latest release")


Legal | privacy