Skip to main content
  1. Posts/

Building Modern EDA Pipelines with Pingouin

··231 words·2 mins·

πŸ“Š More Rigorous EDA Pipelines with Pingouin
#

Are your exploratory analyses just histograms and scatter plots? Time to level up. πŸ“ˆ

πŸ” What is Pingouin?
#

Pingouin is a Python library that bridges SciPy and pandas, enabling statistically rigorous EDA pipelines.

πŸ§ͺ What You Can Validate with Pingouin
#

  • βœ… Univariate normality β€” Shapiro-Wilk test via pg.normality()
  • βœ… Homoscedasticity β€” Levene test via pg.homoscedasticity()
  • βœ… Advanced correlations β€” With p-values and robust statistics
  • βœ… Outliers and statistical tests β€” Complete with a single function

πŸ’» Quick Example
#

import pingouin as pg
import pandas as pd

df = pd.read_csv("wine-quality.csv")

# Normality test
normality = pg.normality(df[['pH', 'alcohol', 'fixed acidity']])
print(normality)

🚨 The Golden Rule: Garbage In, Garbage Out (GIGO)
#

Feeding a model with data that violates its mathematical assumptions is the perfect recipe for ineffective models. Pingouin helps you detect these issues before modeling.

πŸ’‘ Explanation in a nutshell
#

Pingouin is a Python statistics library that extends SciPy with a pandas-like API, making statistical validation in EDA pipelines straightforward. It lets you check normality, variance homogeneity, and correlations with p-values in just a few lines β€” essential validation before training any ML model.

More information at the link πŸ‘‡

Also published on LinkedIn.
Juan Pedro Bretti Mandarano
Author
Juan Pedro Bretti Mandarano