Skip to main content
  1. Posts/

Text Classification with Python 3.14's zstd Module

··247 words·2 mins·

🗜️ Classify text… with compression? Python 3.14 makes this technique practical for the first time.

Python 3.14 added the compression.zstd module (Facebook’s Zstandard) to the standard library. This opens an elegant and surprising way to classify text without traditional ML models.

🧠 The core idea: If you compress a text together with a corpus from a category, the result will be smaller the more similar the text is to that category. This is based on Kolmogorov complexity: similar data compresses better together.

💡 The practical trick with zstd:

from compression.zstd import ZstdCompressor, ZstdDict

# For each class, build a "dictionary" from its corpus
zd_tacos = ZstdDict(tacos_corpus, is_raw=True)
comp = ZstdCompressor(zstd_dict=zd_tacos)

# The text producing the shortest output is the winning class
len(comp.compress(new_text))

✨ Advantages:

  • Zero external dependencies (Python 3.14 stdlib)
  • Works in online/streaming mode: no full retraining needed
  • Very fast: rebuilding the compressor takes microseconds

⚠️ Limitations:

  • Less accurate than modern models like BERT
  • Best for low-latency or resource-constrained use cases

💡 Explanation in a nutshell
#

The idea is simple: to know whether a text is about “tacos” or “padel,” compress it alongside texts from each category. The text will compress better with the texts it resembles most. It’s a way to measure similarity using compression math, without training any model.

More information at the link 👇

Also published on LinkedIn.
Juan Pedro Bretti Mandarano
Author
Juan Pedro Bretti Mandarano