Fantastic job by the DuckDB team. I’ve been using it for the past year to query 100s of GBs of Parquet files with complex analytic queries involving multiple levels of aggregations, joins and window functions and it all works and works fast. And I do all this from Jupyter Notebook.
Yeah I've started using it with my search engine as well. It's fantastic how versatile it is for data all manner of data munging.
Just the other day I used it to transform an unordered 60 GB CSV file with links and texts into a 3 GB parquet file that's so fast I can create a projection for the relevant data of each partition in like a minute (which then fits in memory).
It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.
We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.
Setting the memory limit setting or the disk swap file has not worked.
Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).
I used to have to chunk my data to avoid OOMs but I haven’t had to do that.
That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.
Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.
Breaking it up. Instead of running a query for the entire year, run it month by month and stitch the final results back together.
Or if you have a unique string ID, calculate an integer hash using hash(ID) % 50 to get 50 chunks which you can process separately without OOMing.
A basic assumption is that all the chunks are independent of each other. Chunking is essentially temporary partitioning to fit your processing limitations.
I had the same idea in my mind, but I am struggle to understand efficient implementation.
Say, I have large table A, which I want to group by.
I can chunk it into small tables, for example:
for i in range(64): execute('create table_%i as select * from A where hash(id) % 64 = %i')
but this would mean scanning source table 64 times which is likely not efficient, I am wondering if duckdb has some functionality which can do more efficient chunking?
It's not doing a full table scan of the entire source table 64 times -- it's only scanning the id column (DuckDB + Parquet is a columnar store) and retrieving the corresponding rows from other columns. It's pretty fast.
this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..
And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)
Would be curious how the performance compares to DataFusion[0] as one of the top contenders to DuckDB on this area (albeit they being different in a lot of parts, I find it one of the closest compared to all others).
ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)
Do you have compression enabled? At least from Pandas, Parquet defaults to compressed and Arrow/Feather default to uncompressed. When I enable zstd compression, I get similar file sizes, and sometimes Arrow is smaller.
> Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.
> The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups.
the O(1) random access doesn't look like advantage at all, honestly.
if I load my parquet into memory - I will have O(1) random access to any row just as well.
plus, considering that Arrow recommends to work in chunks of 1000 rows per file, I am curious to learn exact tasks for which Arrow is optimizing for.
the only use case I can think of is transferring data between systems written in different languages/runtimes and doing zero serialization/deserialization, just send/receive memory buffers that are nicely mapped to dataframes.
Into what format in memory? You can't just mmap into a parquet file and access whole records, it's a columnar format.
Arrow is absolutely designed for Interop between languages - often devs want to develop their core in something like Python, and the platform devs want to work on Scala. Arrow lets you write all the distributed system and data shuffling in Scala, but then users can access the same records in their user-defined code with minimal overhead
> Fugue provides an easier interface to using distributed compute effectively and accelerates big data projects. It does this by minimizing the amount of code you need to write, in addition to taking care of tricks and optimizations that lead to more efficient execution on distrubted compute. Fugue ports Python, Pandas, and SQL code to Spark, Dask, and Ray.
I really love DuckDB for one-offs and analytics and I wonder if anybody here has experience using with medium size data.
I still seem to run into the workflow problem where data has to be in proximity to compute in order to function.
If I need to run joins on a 5-10 GB parquet / table, unless I have that sitting locally, the performance bottleneck is not the database.
I still find myself reaching to Databricks / Spark for most tasks for this reason.
I suppose this is what Motherduck is trying to solve? But it just doesn't feel like it's quite there yet for me. Anybody who is better at this stuff than me have thoughts?
I really like Motherducks hybrid execution model. My DS colleagues love abusing our data warehouse - bringing the data down locally to pound on is a win-win
Yes, we are indeed a good use case for this. For one, we built a fully-fledged managed storage system on top of DuckDB, with better performance and caching and the like. Two, we're going to be pretty good at reading from S3 because we've optimized that path. Three, our storage has sharing/IAM and is about to have things like zero-copy clone and time travel.
It would be interesting to see some more benchmarks of e.g. querying multiple files from S3, and how that evolved across versions.
When I checked at 0.7.1, when working with ~90 S3 parquet objects (x0000 rows each, so not too many) it was 25-50% faster to first download them in Go and then query them, rather than using the DuckDB S3 extension with those objects directly (the whole execution ran on the order of a couple hundred milliseconds).
You definitely pay a performance penalty on S3 (S3 is high throughput but high latency storage) so not optimal for (any) database use cases. Local disk will always be faster if you can swing that.
It’s not a DuckDB specific issue (although there’s headroom for improvement — I don’t think DuckDB’s S3 connector is highly optimized). It’s S3.
The overhead of fetching from S3 via a naive Go implementation (goroutine per object) to disk and then running duckdb on that was lower than using duckdb end-to-end.
Like I said S3 is a high throughput high latency storage. When you fetch the S3 object to disk, that’s a high throughput operation and S3 excels at that. Once on disk DuckDB can operate at low latency.
If you run DuckDB end to end as a database engine on S3, it has to do partial reads on parquet on S3 etc. and has to deal with S3 latencies and it can end up being slower than what you described above.
For long running operations where I can chunk the data, I often copy chunks to local disk before running DuckDB. It’s a lot faster than running DuckDB directly on S3.
Theoretically reading directly from s3 should be faster. Downloading all the data from s3 and then running the query locally is basically an extreme form of pre-fetching. DuckDB could be written to pre-fetch data concurrently using some heuristics and provide similar or better performance.
Makes sense — every S3 call is high latency so the fewer you make the better.
Prefetching would help reduce the number of those high latency calls which databases naturally make.
We often think of S3 as a file system but it isn’t one — it differs in fundamental ways from one. (Also treating it as a filesystem isn’t performant at all — I tried s3fs and mountpoint but both were slow)
Oh, yeah, that does make sense, and I suppose it might also be a better approach if the objects were large (especially if you can avoid reading some parts). In that case prefetching could be wasteful (or result in OOMs).
I didn’t think of that - thanks for the explanation!
I had similar issues with the Dask scheduler a few months ago. The docs say it encourages depth first behavior in the computation graph but in my case it kept running out of memory on a large ETL task by first trying to load all the input files into memory before moving on to the next stage.
I am impressed that Polar is close to DuckDB near the top. It's surprising that a Python library would often out perform everything but DuckDB. DuckDB is very impressive but DataFrames and Python is too useful to give up on.
DuckDB interoperates with polars dataframes easily. I see DuckDB as a SQL engine for dataframes.
Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).
The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.
So I usually write complex queries in DuckDB SQL but if I need to manipulate it in polars I just convert it in my workflow midstream (only takes milliseconds) and then continue working with that in DuckDB. It’s seamless due to Apache Arrow.
Wow, what a cool workflow. I looks like the interop promise of Apache Arrow is real. It's a great thing when your computer works as fast as you think as opposed to sitting around waiting for queries to finish.
I mean polars is great, but there's nothing fundamentally impossible about polars providing similar performance to DuckDB, polars is written in rust, and really a lazy dataframe just provides an alternative frontend (sql being another frontend).
There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.
I didn't know that Polars was implemented in Rust. In fact, it's very neat that Rust can interop with Python so cleanly, but that shouldn't be surprising since Python is basically a wrapper around C libraries.
But I still think it's surprising how much legs Python model of wrapping around C/C++/Rust libraries has. I would assume that if you have Python calling the libraries, you can't do lazy evaluation and thus you hit a wall such as Pandas.
But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.
I was/am a fan of duckdb, but I recently discovered a bug in 0.9.1 where a fairly innocuous query was silently returning wrong results (issue 9399 on github). That made me much less confident about duckdb and how well tested it is.
Maybe it was a one off, but with postgresql for example I don't think I personally encountered cases of simply incorrect query results.
Just had a look (https://github.com/duckdb/duckdb/issues/9399). Yeah it's worrying that such a trivial query returned incorrect results - but credit to the Devs for getting it fixed quickly.
To my knowledge the only databases that can be described as "military-grade" in terms of testing are SQLite and Postgres.
Apparently DuckDB requires your real-life name to file an online bug report, bucking every norm of online handles for communication, as well as enabling doxxers and stalkers to find and trace people in real life.
It's the same with a lot of open source contributions; those need to do so for legal and copyright reasons.
If you're afraid of doxxing and / or stalking though, at least you have the choice to not contribute. You can still post somewhere else and ask someone else to make the report for you if need be.
This is stupid and [the given reasoning](https://berthub.eu/articles/posts/anonymous-help/) makes no sense. If you post open source software and open issues for it, and I post a properly formatted, reproducible, clear issue, there is absolutely no reason for me to disclose my identity for the sake of your ego.
It seems to be about wanting to know who you're talking to when providing free support for an open-source project, and whether the person submitting an issue is using the project for personal use or within an organization.
> If I don’t know who you are, am I enabling you to build the new Turkish censorship infrastructure, or helping you implement [Russian internet blocking] more efficiently? These are two examples that actually happened by the way.
Query engines are (not)surprisingly complex software products. Add to that the constant (and aggressive, due to the competition in the field) evolution and adition of the new features that can interact with every existing feature in any existing context and you have a perfect environment for bugs to appear.
Almost(!) every single query engine has bugs like this. The fact is that DuckDB is version 0.9.1 (pre 1.0) so bugs are kind of expected. I have found bugs like this in commercial engines.
If you have some data in postgresql and want to query it with duckdb (really fast) you can try extracting the data to a parquet file; this file can then be queried from duckdb with incredible speed. I've written a small program in python that reads from postgresql and exports to parquet for anybody that wanna try it https://github.com/spapas/pg-parquet-py#why
Really interesting to compare to Clickhouse's benchmark, when you filter out the non-comparable results. The TLDR is that their benchmark shows DuckDB winning a lot of the races:
I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.
I think these benchmarks are great, but also quite misleading and should be updated:
* the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.
* the benchmarks only show the query runtime once the data has been persisted in memory. They should also show how long it takes to persist the data in memory. If query_engine_A takes 5 mins to persist in memory & 10 seconds to run the query and query_engine_B takes 2 mins to persist in memory & 20 seconds to run the query, then the amount of time to persist the data is highly relevant.
* benchmarks should also show results when the data isn't persisted in memory.
* Using a Parquet file with column pruning would make a lot more sense than a huge CSV file. The groupby dataset has 9 columns and some of the queries only require 3 columns. Needlessly persisting 6 columns in memory is really misleading for some engines.
* Seems like some of the engines have queries that are more optimized than others. Some have explicitly casted columns as int32 and presumably others are int64. The queries should be apples:apples across engines.
* Some engines are parallel and lazy. "Running" some of these queries is hard because lazy engines don't want to do work unless they have to. The authors have forced some of these queries to run by persisting in memory, which is another step, so that should be investigated.
* There are obvious missing query types like filtering and "compound queries" like filter, join, then aggregate.
I like these benchmarks a lot and use the h2o datasets locally all the time, but the methodology really needs to be modernized. At the bottom you can see "Benchmark run took around 105.3 hours." This is way to slow and there are some obvious fixes that'll make the results more useful for the data community.
Modern query engines are designed to read data in parallel because it's so much faster. The data could be stored in 50 different one-gig files that were read in parallel.
None of this requires multiple files, either in theory or in practice. It may be a limitation of a particular implementation but that is a design choice. There are functional advantages to not splitting up files unnecessarily when processing queries.
Spec compliant CSV parsing cannot be done in parallel iirc. Most CSV parsers are not spec compliant, but splitting up CSVs is one way to read them in parallel while ensuring you can still use a spec compliant parser.
> the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.
For a generic OLAP db, maybe. In this case, though, a single file fits one of DuckDB's use cases: analytics on the data consumed by or produced by a data scientist. In such scenario, it's not uncommon for a multi-GB input or for dumping GBs of a dataframe into a single CSV file.
I love the increased focus on benchmark testing recently, but I always find it a little weird to read for stuff like spark or dask.
Those are written to offer scale over large data so have very different overheads and limits compared to something like duckdb. Seems odd to have them on the same chart.
Also, side note but I'd love to see the performance impact of pandas/dask with pyarrow schemas.
We at MotherDuck are working closely with the DuckDB folks to deliver a serverless analytics service powered by DuckDB [0]. Currently in open Beta and driving towards GA. Our users are certainly recognizing how fast DuckDB is in the cloud.
One of the reasons we exist is because DuckDB is meant to be a single-player database. MotherDuck is doing tons of heavy-lifting to turn it into a true multi-player data warehouse, so things like IAM/sharing, persistence, time travel, administration, the ecosystem and so forth.
What's magical about MotherDuck is that virtually any DuckDB instance in the wild can connect to MotherDuck by simply running '.open motherduck:' [1], and suddenly you get all these aforementioned benefits.
Nice job DuckDB team, those are great performance improvements compared to a couple years ago. It’s neat that solutions like this are being offered and developed. At Hasura we’ve been working on a data-connector that’ll wrap around DuckDB. Curious what others would think about having a GraphQL layer on JSON/Parquet/CSV via DuckDB?
Moderation note: the actual title of this article is "Updates to the H2O.ai db-benchmark!".
It's not about duckdb's performance improvements per se, it's about the change to the aws instance type they're using to get fairer benchmark results by avoiding block storage and preventing noisy neighbors.
It’s actually faster than AWS Athena for me.
reply