Skip to main content
  1. Posts/

CSV vs. Parquet vs. Arrow: Storage Formats Explained

··258 words·2 mins·

πŸ“Š CSV vs. Parquet vs. Arrow: Which one should you use?

If you work with tabular data in Python, you’ll eventually encounter these three formats. Do you know when to use each one?

CSV β€” Simple and universal

  • Plain text, one row per line
  • Compatible with Excel, pandas, databases, any tool
  • ❌ No explicit schema: types must be inferred every load
  • ❌ Slow with large datasets: parsing text is CPU-intensive

Parquet β€” For large-scale analytics

  • Binary columnar format: stores all values of each column together
  • βœ… Excellent compression (Snappy, Gzip): files 5-10x smaller than CSV
  • βœ… Reads only the columns you need (column pruning)
  • βœ… Schema stored: types preserved without inference
  • Ideal for batch processing and millions/billions of rows

Arrow β€” The in-memory format

  • Columnar in RAM: designed for ultra-fast operations
  • βœ… Zero-copy reads: access data on disk without loading into memory
  • βœ… Vectorized operations on columns
  • βœ… Interoperable with pandas, Polars, Spark, PyArrow
  • In Hugging Face Datasets, every dataset is internally Arrow

πŸ” Explanation in a nutshell

“Columnar” means that instead of storing row by row (name1, age1, city1 / name2, age2, city2), all names are stored together, all ages together. This greatly speeds up queries that only need a few columns and allows similar data to compress better.

πŸ“Œ Practical rule: CSV for quick experiments β†’ Parquet for storing large tables β†’ Arrow for fast in-memory training.

More information at the link πŸ‘‡

Also published on LinkedIn.
Juan Pedro Bretti Mandarano
Author
Juan Pedro Bretti Mandarano