
⚡ dplyr, data.table, pandas, Polars or DuckDB? This benchmark has the answer with real data.
The DuckDB Labs database-like operations benchmark compares the most popular data manipulation tools in open-source data science. Tests include:
- groupby → aggregations by group (the most common operation in data analysis)
- join → joining datasets at different scales
Results are run on datasets of different sizes (0.5GB, 5GB, 50GB) to show how each tool scales.
Which tools are included? pandas, dplyr, data.table, DuckDB, Polars, Spark, ClickHouse, and more.
Most interesting: the benchmark doesn’t just show timings — it also shows the exact syntax being timed. So you can see whether the comparison applies to your specific use case.
This project was originally started by H2O.ai and is now maintained by DuckDB Labs. It runs automatically when a PR is opened in the repository.
💡 Explanation in a nutshell#
Not all data tools are equal under real load. DuckDB and Polars consistently appear among the fastest for in-memory analytical operations, while tools like pandas show limitations at larger scale. This benchmark is the most honest reference available for choosing your data stack.
More information at the link 👇
