Skip to main content
  1. Posts/

Mimesis: Anonymize Production Data for Data Science with Python

··282 words·2 mins·

🔒 The Problem of Using Real Data in Data Science — and the Python Solution
#

Production data often contains PII (Personally Identifiable Information) subject to privacy regulations. Mimesis is the open-source Python library that generates fake but realistic data to replace sensitive data. 🛡️

🎯 What Does Mimesis Do?
#

Generates realistic synthetic data: names, emails, phone numbers, addresses, dates, etc. — locally, without sending anything to the cloud.

💻 Practical Example
#

from mimesis import Person
from mimesis.locales import Locale

person = Person(locale=Locale.EN, seed=42)

# Replace sensitive columns
df['real_name'] = [person.full_name() for _ in range(len(df))]
df['email'] = [person.email() for _ in range(len(df))]
df['phone'] = [person.telephone() for _ in range(len(df))]

📋 Result
#

user_id  anon_name          email                  phone        subscription_tier
101      Anthony Reilly     [email protected]  +13312271333  Premium
102      Kai Day            [email protected]  ...           Basic

Sensitive fields change; subscription_tier remains intact. ✅

🏆 Best Practices
#

  • Use seed for reproducibility across runs
  • Consider saving to a separate DataFrame to avoid losing original data
  • Generated data respects original data types

💡 Explanation in a nutshell
#

Mimesis solves a common data science problem: we need real data to develop and test models, but that data contains sensitive information we can’t use without violating GDPR or other regulations. The solution is anonymization: replacing PII fields (names, emails, phone numbers) with synthetic but realistic data. Mimesis does this locally in Python, with a clean API and multi-language support. Perfect for creating safe development datasets from production snapshots.

More information at the link 👇

Also published on LinkedIn.
Juan Pedro Bretti Mandarano
Author
Juan Pedro Bretti Mandarano