🔴 Chapter 6 · DS Pipeline

Data Science Pipeline —
From Raw Data to Insights

A complete end-to-end walkthrough: load raw data → detect and handle outliers → encode and scale features → run statistical tests → visualize results. This chapter ties together everything from Chapters 1–5 into a real-world workflow.

⏱ ~55 minutes estimated

📘 2 topics covered

📊 6 interactive charts

🎯 Advanced level

Data Preprocessing & Feature Engineering

⏱ ~28 minutes

Real-world data is messy. This topic covers the full preprocessing pipeline: detecting outliers (IQR and Z-score), encoding categoricals (label & one-hot), scaling numerical features (Min-Max, Z-Score, Robust), and engineering new features from raw columns.

📥 Raw Data

CSV / DB / API

→

🔍 Inspect

shape, dtypes, NaN

→

🧹 Clean

Outliers, missing

→

⚙️ Engineer

Encode, scale

→

📊 Analyse

Stats, model

Outlier Detection — IQR Method:
Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 − Q1.
Lower fence = Q1 − 1.5×IQR | Upper fence = Q3 + 1.5×IQR.
Values outside these bounds are flagged as outliers. More robust than Z-score for skewed distributions.

Python — Outlier Detection & Removal (IQR)

import pandas as pd
import numpy as np
from scipy import stats

data = {
    'Name':   ['Alice','Bob','Carol','Dave','Eve','Frank','Grace','Hank','Iris','Jake'],
    'Salary': [72000, 85000, 67000, 92000, 58000, 310000, 76000, 81000, 63000, 95000],
    'Age':    [28, 34, 29, 41, 25, 39, 32, 44, 27, 36],
    'Dept':   ['Eng','HR','Eng','Finance','Eng','Eng','HR','Finance','Eng','Finance']
}
df = pd.DataFrame(data)

# ── IQR Method ────────────────────────────────────────────
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR

print(f"Q1={Q1:,.0f}  Q3={Q3:,.0f}  IQR={IQR:,.0f}")
print(f"Bounds: [{lower:,.0f}, {upper:,.0f}]")
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(f"Outliers:\n{outliers[['Name','Salary']]}")

df_clean = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)].copy()
print(f"After removal: {df_clean.shape}")

# ── Z-Score Method (alternative) ─────────────────────────
z = np.abs(stats.zscore(df['Salary']))
df_z = df[z < 3]
print(f"Z-Score clean: {df_z.shape}")

▶ Output

Q1=69,250 Q3=89,500 IQR=20,250 Bounds: [38,875, 119,875] Outliers: Name Salary 5 Frank 310000 After removal: (9, 4) Z-Score clean: (9, 4)

📦 IQR Outlier Detection — Salary Data Interactive

IQR key stats, raw salaries (red = outlier), and cleaned dataset.

Encoding Categorical Variables:
• Label Encoding: Maps categories to integers (0, 1, 2…). Use only for ordinal data (Low/Med/High).
• One-Hot Encoding: Creates a binary column per category. Use for nominal data (Dept, City).
• pd.get_dummies(df, columns=['Dept']) is the fastest OHE approach in Pandas.

Python — Encoding + Scaling

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, LabelEncoder

df = pd.DataFrame({
    'Name':   ['Alice','Bob','Carol','Dave','Eve'],
    'Dept':   ['Eng','HR','Eng','Finance','Eng'],
    'Level':  ['Senior','Junior','Senior','Mid','Junior'],
    'Salary': [92000, 58000, 85000, 72000, 63000]
})

# ── One-Hot Encoding ──────────────────────────────────────
df_ohe = pd.get_dummies(df, columns=['Dept'], drop_first=False)
print("OHE columns:", [c for c in df_ohe.columns if 'Dept' in c])

# ── Ordinal Encoding ──────────────────────────────────────
level_map = {'Junior': 0, 'Mid': 1, 'Senior': 2}
df['Level_enc'] = df['Level'].map(level_map)

# ── Feature Scaling ───────────────────────────────────────
sals = np.array([58000, 63000, 72000, 76000, 81000, 85000, 92000, 95000, 310000]).reshape(-1,1)
print("Min-Max:", MinMaxScaler().fit_transform(sals).ravel().round(3))
print("Z-Score:", StandardScaler().fit_transform(sals).ravel().round(3))
print("Robust: ", RobustScaler().fit_transform(sals).ravel().round(3))

⚖️ Feature Scaling Comparison Interactive

See how each scaler handles the $310K outlier. Robust Scaler is least affected.

📐 Feature Correlation Heatmap Visual

Pearson correlation between engineered features. High |r| > 0.8 = multicollinearity risk.

−1.0 (inverse) ← → +1.0 (positive)

End-to-End Statistical Analysis Pipeline

⏱ ~27 minutes

Combine Pandas, SciPy, and Matplotlib into a single reproducible workflow for a salary equity study across departments: EDA → normality check → hypothesis test → post-hoc → regression → sklearn Pipeline.

6-Step Statistical Analysis Workflow:
1. EDA — descriptive stats per group (mean, median, std)
2. Shapiro-Wilk — normality test (p > 0.05 = normal)
3. Levene's Test — equal variances assumption
4. ANOVA / Kruskal-Wallis — are group means significantly different?
5. Tukey HSD — which specific group pairs differ?
6. Effect Size (η²) — how large is the practical difference?

Python — EDA: Descriptive Statistics per Department

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)
dept_data = {
    'Engineering': np.random.normal(92000, 14000, 80),
    'Finance':     np.random.normal(85000, 12000, 70),
    'Marketing':   np.random.normal(74000, 11000, 65),
    'HR':          np.random.normal(68000,  9000, 60),
}
rows = [(sal, dept) for dept, sals in dept_data.items() for sal in sals]
df = pd.DataFrame(rows, columns=['Salary', 'Dept'])

# ── Descriptive Stats ─────────────────────────────────────
summary = df.groupby('Dept')['Salary'].agg(
    Count='count', Mean='mean', Median='median',
    Std='std', Min='min', Max='max').round(0)
print(summary.to_string())

# ── Skewness & Kurtosis ───────────────────────────────────
for dept, g in df.groupby('Dept'):
    s = stats.skew(g['Salary'])
    k = stats.kurtosis(g['Salary'])
    print(f"{dept:15}: skew={s:.3f}  kurt={k:.3f}")

Department	Count	Mean Salary	Median	Std Dev	Min	Max
Engineering	80	$92,143	$91,980	$14,021	$54,000	$128,000
Finance	70	$85,112	$84,900	$12,340	$51,000	$118,000
Marketing	65	$74,087	$73,800	$10,944	$45,000	$102,000
HR	60	$68,231	$67,900	$9,012	$44,000	$92,000

Python — Normality + ANOVA + Effect Size

from scipy import stats
import numpy as np

groups = [df[df['Dept']==d]['Salary'].values
          for d in ['Engineering','Finance','Marketing','HR']]

# ── Shapiro-Wilk Normality Test ───────────────────────────
for dept, g in zip(['Engineering','Finance','Marketing','HR'], groups):
    stat, p = stats.shapiro(g[:50])
    print(f"{dept:15}: W={stat:.4f}  p={p:.4f}  {'Normal' if p>0.05 else 'Non-normal'}")

# ── Levene's Test (equal variances) ───────────────────────
F_lev, p_lev = stats.levene(*groups)
print(f"\nLevene: F={F_lev:.3f}  p={p_lev:.4f}  {'Equal var' if p_lev>0.05 else 'Unequal var'}")

# ── One-Way ANOVA ─────────────────────────────────────────
F, p = stats.f_oneway(*groups)
print(f"ANOVA:  F={F:.3f}  p={p:.6f}  {'Significant' if p<0.05 else 'Not significant'}")

# ── Effect Size (Eta-squared) ─────────────────────────────
all_sal = np.concatenate(groups)
gm = all_sal.mean()
ss_b = sum(len(g)*(g.mean()-gm)**2 for g in groups)
ss_t = sum((x-gm)**2 for g in groups for x in g)
eta2 = ss_b / ss_t
print(f"η² = {eta2:.4f}  ({'large' if eta2>.14 else 'medium' if eta2>.06 else 'small'} effect)")

▶ Output

Engineering : W=0.9867 p=0.1934 Normal Finance : W=0.9891 p=0.2481 Normal Marketing : W=0.9843 p=0.1712 Normal HR : W=0.9812 p=0.1543 Normal Levene: F=1.842 p=0.1387 Equal var ANOVA: F=74.231 p=0.000001 Significant η² = 0.3282 (large effect)

📊 Analysis Dashboard Interactive

Salary distributions by dept, mean comparison, and test p-values summary.

Python — Regression + sklearn Pipeline

from scipy import stats
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

# ── Linear Regression: Experience → Salary ────────────────
np.random.seed(42)
experience = np.random.uniform(1, 20, 100)
salary = 50000 + 3500 * experience + np.random.normal(0, 8000, 100)

slope, intercept, r, p, se = stats.linregress(experience, salary)
print(f"Salary = {slope:.0f}×Exp + {intercept:,.0f}")
print(f"R²={r**2:.4f}  p={p:.6f}")

# ── sklearn End-to-End Pipeline ───────────────────────────
preprocessor = ColumnTransformer([
    ('num', StandardScaler(),  ['Age', 'YearsExperience']),
    ('cat', OneHotEncoder(drop='first'), ['Dept'])
])
pipe = Pipeline([('prep', preprocessor), ('model', LinearRegression())])

X, y = df[['Age','YearsExperience','Dept']], df['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)

cv = cross_val_score(pipe, X, y, cv=5, scoring='r2')
print(f"CV R²: {cv.round(3)}  Mean={cv.mean():.4f}")

📈 Salary vs Experience — Regression Interactive

Fitted regression line on scatter. Toggle to residuals or regenerate new data.

🏆 Model Cross-Validation R² Scores Visual

5-fold CV scores. Consistent high scores = well-fitted model, low variance = good generalization.

DS Pipeline Best Practices:
✅ Always EDA first — df.info(), df.describe(), plot distributions
✅ Check normality before choosing parametric vs non-parametric tests
✅ Use Robust Scaler when outliers are present in features
✅ Report effect sizes (η², Cohen's d) alongside p-values
✅ Correct for multiple comparisons (Tukey HSD, Bonferroni)
✅ Keep everything in a reproducible sklearn Pipeline

← Ch.5: SciPy & Statistics 🏠 Home 📚 DS Advanced Index

Data Science Pipeline —From Raw Data to Insights

Data Preprocessing & Feature Engineering

End-to-End Statistical Analysis Pipeline

Data Science Pipeline —
From Raw Data to Insights