Building Modern EDA Pipelines with Pingouin

📊 More Rigorous EDA Pipelines with Pingouin
#

Are your exploratory analyses just histograms and scatter plots? Time to level up. 📈

🔍 What is Pingouin?
#

Pingouin is a Python library that bridges SciPy and pandas, enabling statistically rigorous EDA pipelines.

🧪 What You Can Validate with Pingouin
#

✅ Univariate normality — Shapiro-Wilk test via pg.normality()
✅ Homoscedasticity — Levene test via pg.homoscedasticity()
✅ Advanced correlations — With p-values and robust statistics
✅ Outliers and statistical tests — Complete with a single function

💻 Quick Example
#

import pingouin as pg
import pandas as pd

df = pd.read_csv("wine-quality.csv")

# Normality test
normality = pg.normality(df[['pH', 'alcohol', 'fixed acidity']])
print(normality)

🚨 The Golden Rule: Garbage In, Garbage Out (GIGO)
#

Feeding a model with data that violates its mathematical assumptions is the perfect recipe for ineffective models. Pingouin helps you detect these issues before modeling.

💡 Explanation in a nutshell
#

Pingouin is a Python statistics library that extends SciPy with a pandas-like API, making statistical validation in EDA pipelines straightforward. It lets you check normality, variance homogeneity, and correlations with p-values in just a few lines — essential validation before training any ML model.