Database-like ops benchmark: Which data tool is fastest?

⚡ dplyr, data.table, pandas, Polars or DuckDB? This benchmark has the answer with real data.

The DuckDB Labs database-like operations benchmark compares the most popular data manipulation tools in open-source data science. Tests include:

groupby → aggregations by group (the most common operation in data analysis)
join → joining datasets at different scales

Results are run on datasets of different sizes (0.5GB, 5GB, 50GB) to show how each tool scales.

Which tools are included? pandas, dplyr, data.table, DuckDB, Polars, Spark, ClickHouse, and more.

Most interesting: the benchmark doesn’t just show timings — it also shows the exact syntax being timed. So you can see whether the comparison applies to your specific use case.

This project was originally started by H2O.ai and is now maintained by DuckDB Labs. It runs automatically when a PR is opened in the repository.

💡 Explanation in a nutshell
#

Not all data tools are equal under real load. DuckDB and Polars consistently appear among the fastest for in-memory analytical operations, while tools like pandas show limitations at larger scale. This benchmark is the most honest reference available for choosing your data stack.

Database-like ops benchmark

duckdblabs.github.io ↗

Also published on LinkedIn.

Author

Juan Pedro Bretti Mandarano

💡 Explanation in a nutshell#

Database-like ops benchmark

💡 Explanation in a nutshell
#