DuckDB

🦆 What is DuckDB and why is it revolutionizing data analysis from Python?

If you work with data in CSV, Parquet files, or databases and get frustrated having to move them into a traditional database or wait minutes for Pandas to finish a groupby… 💡 you should get to know DuckDB.

🔍 What is DuckDB?

It’s an embedded SQL database engine (like SQLite), but optimized for analytical queries (OLAP).

It runs in the same process as your program and doesn’t require server installation or complex configuration.

🐍 How do you use it from Python?

Very simple:

import duckdb
import pandas as pd

df = pd.read_csv("ventas.csv")

# SQL queries directly on DataFrames
resultado = duckdb.query("SELECT categoria, SUM(monto) FROM df GROUP BY categoria").to_df()
print(resultado)

You can also read Parquet or CSV files directly without loading them into Pandas:

duckdb.query("SELECT * FROM 'datos.parquet' WHERE fecha > '2024-01-01'").to_df()

✅ What does DuckDB solve?

🚀 Speed: Processes large volumes of data faster than Pandas for many operations.
🔌 Frictionless integration with Python, Pandas and local files.
📁 No need for servers or remote connections to use SQL.
🧠 Ideal for notebooks, ETL, prototyping and exploratory analysis.

🎯 When to use it?

If you have large datasets and Pandas falls short.
If you want to use SQL directly on your files or DataFrames.
If you work in notebooks and want powerful queries without spinning up an external database.

🔗 Bonus: DuckDB also integrates with Polars, Arrow and other modern tools.

An in-process SQL OLAP database management system

DuckDB is an in-process SQL OLAP database management system. Simple, feature-rich, fast & open source.

duckdb.org ↗

Also published on LinkedIn.

Author

Juan Pedro Bretti Mandarano