Skip to main content
  1. Posts/

DuckDB

πŸ¦† What is DuckDB and why is it revolutionizing data analysis from Python?

If you work with data in CSV, Parquet files, or databases and get frustrated having to move them into a traditional database or wait minutes for Pandas to finish a groupby… πŸ’‘ you should get to know DuckDB.

πŸ” What is DuckDB?

It’s an embedded SQL database engine (like SQLite), but optimized for analytical queries (OLAP).

It runs in the same process as your program and doesn’t require server installation or complex configuration.

🐍 How do you use it from Python?

Very simple:

import duckdb
import pandas as pd

df = pd.read_csv("ventas.csv")

# SQL queries directly on DataFrames
resultado = duckdb.query("SELECT categoria, SUM(monto) FROM df GROUP BY categoria").to_df()
print(resultado)

You can also read Parquet or CSV files directly without loading them into Pandas:

duckdb.query("SELECT * FROM 'datos.parquet' WHERE fecha > '2024-01-01'").to_df()

βœ… What does DuckDB solve?

  • πŸš€ Speed: Processes large volumes of data faster than Pandas for many operations.
  • πŸ”Œ Frictionless integration with Python, Pandas and local files.
  • πŸ“ No need for servers or remote connections to use SQL.
  • 🧠 Ideal for notebooks, ETL, prototyping and exploratory analysis.

🎯 When to use it?

  • If you have large datasets and Pandas falls short.
  • If you want to use SQL directly on your files or DataFrames.
  • If you work in notebooks and want powerful queries without spinning up an external database.

πŸ”— Bonus: DuckDB also integrates with Polars, Arrow and other modern tools.

More information at the link πŸ‘‡

Also published on LinkedIn.
Juan Pedro Bretti Mandarano
Author
Juan Pedro Bretti Mandarano