๐ด Chapter 6 ยท DS Pipeline
Data Science Pipeline โ
From Raw Data to Insights
A complete end-to-end walkthrough: load raw data โ detect and handle outliers โ encode and scale features โ run statistical tests โ visualize results. This chapter ties together everything from Chapters 1โ5 into a real-world workflow.
28
Data Preprocessing & Feature Engineering
โฑ ~28 minutes
Real-world data is messy. This topic covers the full preprocessing pipeline: detecting outliers (IQR and Z-score), encoding categoricals (label & one-hot), scaling numerical features (Min-Max, Z-Score, Robust), and engineering new features from raw columns.
๐ฅ Raw Data
CSV / DB / API
โ
๐ Inspect
shape, dtypes, NaN
โ
๐งน Clean
Outliers, missing
โ
โ๏ธ Engineer
Encode, scale
โ
๐ Analyse
Stats, model
Outlier Detection โ IQR Method:
Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 โ Q1.
Lower fence = Q1 โ 1.5รIQR | Upper fence = Q3 + 1.5รIQR.
Values outside these bounds are flagged as outliers. More robust than Z-score for skewed distributions.
Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 โ Q1.
Lower fence = Q1 โ 1.5รIQR | Upper fence = Q3 + 1.5รIQR.
Values outside these bounds are flagged as outliers. More robust than Z-score for skewed distributions.
Python โ Outlier Detection & Removal (IQR)
import pandas as pd
import numpy as np
from scipy import stats
data = {
'Name': ['Alice','Bob','Carol','Dave','Eve','Frank','Grace','Hank','Iris','Jake'],
'Salary': [72000, 85000, 67000, 92000, 58000, 310000, 76000, 81000, 63000, 95000],
'Age': [28, 34, 29, 41, 25, 39, 32, 44, 27, 36],
'Dept': ['Eng','HR','Eng','Finance','Eng','Eng','HR','Finance','Eng','Finance']
}
df = pd.DataFrame(data)
# โโ IQR Method โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
print(f"Q1={Q1:,.0f} Q3={Q3:,.0f} IQR={IQR:,.0f}")
print(f"Bounds: [{lower:,.0f}, {upper:,.0f}]")
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(f"Outliers:\n{outliers[['Name','Salary']]}")
df_clean = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)].copy()
print(f"After removal: {df_clean.shape}")
# โโ Z-Score Method (alternative) โโโโโโโโโโโโโโโโโโโโโโโโโ
z = np.abs(stats.zscore(df['Salary']))
df_z = df[z < 3]
print(f"Z-Score clean: {df_z.shape}")
โถ Output
Q1=69,250 Q3=89,500 IQR=20,250
Bounds: [38,875, 119,875]
Outliers:
Name Salary
5 Frank 310000
After removal: (9, 4)
Z-Score clean: (9, 4)
๐ฆ IQR Outlier Detection โ Salary Data Interactive
IQR key stats, raw salaries (red = outlier), and cleaned dataset.
Encoding Categorical Variables:
โข Label Encoding: Maps categories to integers (0, 1, 2โฆ). Use only for ordinal data (Low/Med/High).
โข One-Hot Encoding: Creates a binary column per category. Use for nominal data (Dept, City).
โข
โข Label Encoding: Maps categories to integers (0, 1, 2โฆ). Use only for ordinal data (Low/Med/High).
โข One-Hot Encoding: Creates a binary column per category. Use for nominal data (Dept, City).
โข
pd.get_dummies(df, columns=['Dept']) is the fastest OHE approach in Pandas.
Python โ Encoding + Scaling
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, LabelEncoder
df = pd.DataFrame({
'Name': ['Alice','Bob','Carol','Dave','Eve'],
'Dept': ['Eng','HR','Eng','Finance','Eng'],
'Level': ['Senior','Junior','Senior','Mid','Junior'],
'Salary': [92000, 58000, 85000, 72000, 63000]
})
# โโ One-Hot Encoding โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
df_ohe = pd.get_dummies(df, columns=['Dept'], drop_first=False)
print("OHE columns:", [c for c in df_ohe.columns if 'Dept' in c])
# โโ Ordinal Encoding โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
level_map = {'Junior': 0, 'Mid': 1, 'Senior': 2}
df['Level_enc'] = df['Level'].map(level_map)
# โโ Feature Scaling โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
sals = np.array([58000, 63000, 72000, 76000, 81000, 85000, 92000, 95000, 310000]).reshape(-1,1)
print("Min-Max:", MinMaxScaler().fit_transform(sals).ravel().round(3))
print("Z-Score:", StandardScaler().fit_transform(sals).ravel().round(3))
print("Robust: ", RobustScaler().fit_transform(sals).ravel().round(3))
โ๏ธ Feature Scaling Comparison Interactive
See how each scaler handles the $310K outlier. Robust Scaler is least affected.
๐ Feature Correlation Heatmap Visual
Pearson correlation between engineered features. High |r| > 0.8 = multicollinearity risk.
โ1.0 (inverse) โ โ +1.0 (positive)
29
End-to-End Statistical Analysis Pipeline
โฑ ~27 minutes
Combine Pandas, SciPy, and Matplotlib into a single reproducible workflow for a salary equity study across departments: EDA โ normality check โ hypothesis test โ post-hoc โ regression โ sklearn Pipeline.
6-Step Statistical Analysis Workflow:
1. EDA โ descriptive stats per group (mean, median, std)
2. Shapiro-Wilk โ normality test (p > 0.05 = normal)
3. Levene's Test โ equal variances assumption
4. ANOVA / Kruskal-Wallis โ are group means significantly different?
5. Tukey HSD โ which specific group pairs differ?
6. Effect Size (ฮทยฒ) โ how large is the practical difference?
1. EDA โ descriptive stats per group (mean, median, std)
2. Shapiro-Wilk โ normality test (p > 0.05 = normal)
3. Levene's Test โ equal variances assumption
4. ANOVA / Kruskal-Wallis โ are group means significantly different?
5. Tukey HSD โ which specific group pairs differ?
6. Effect Size (ฮทยฒ) โ how large is the practical difference?
Python โ EDA: Descriptive Statistics per Department
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(42)
dept_data = {
'Engineering': np.random.normal(92000, 14000, 80),
'Finance': np.random.normal(85000, 12000, 70),
'Marketing': np.random.normal(74000, 11000, 65),
'HR': np.random.normal(68000, 9000, 60),
}
rows = [(sal, dept) for dept, sals in dept_data.items() for sal in sals]
df = pd.DataFrame(rows, columns=['Salary', 'Dept'])
# โโ Descriptive Stats โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
summary = df.groupby('Dept')['Salary'].agg(
Count='count', Mean='mean', Median='median',
Std='std', Min='min', Max='max').round(0)
print(summary.to_string())
# โโ Skewness & Kurtosis โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
for dept, g in df.groupby('Dept'):
s = stats.skew(g['Salary'])
k = stats.kurtosis(g['Salary'])
print(f"{dept:15}: skew={s:.3f} kurt={k:.3f}")
| Department | Count | Mean Salary | Median | Std Dev | Min | Max |
|---|---|---|---|---|---|---|
| Engineering | 80 | $92,143 | $91,980 | $14,021 | $54,000 | $128,000 |
| Finance | 70 | $85,112 | $84,900 | $12,340 | $51,000 | $118,000 |
| Marketing | 65 | $74,087 | $73,800 | $10,944 | $45,000 | $102,000 |
| HR | 60 | $68,231 | $67,900 | $9,012 | $44,000 | $92,000 |
Python โ Normality + ANOVA + Effect Size
from scipy import stats
import numpy as np
groups = [df[df['Dept']==d]['Salary'].values
for d in ['Engineering','Finance','Marketing','HR']]
# โโ Shapiro-Wilk Normality Test โโโโโโโโโโโโโโโโโโโโโโโโโโโ
for dept, g in zip(['Engineering','Finance','Marketing','HR'], groups):
stat, p = stats.shapiro(g[:50])
print(f"{dept:15}: W={stat:.4f} p={p:.4f} {'Normal' if p>0.05 else 'Non-normal'}")
# โโ Levene's Test (equal variances) โโโโโโโโโโโโโโโโโโโโโโโ
F_lev, p_lev = stats.levene(*groups)
print(f"\nLevene: F={F_lev:.3f} p={p_lev:.4f} {'Equal var' if p_lev>0.05 else 'Unequal var'}")
# โโ One-Way ANOVA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
F, p = stats.f_oneway(*groups)
print(f"ANOVA: F={F:.3f} p={p:.6f} {'Significant' if p<0.05 else 'Not significant'}")
# โโ Effect Size (Eta-squared) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
all_sal = np.concatenate(groups)
gm = all_sal.mean()
ss_b = sum(len(g)*(g.mean()-gm)**2 for g in groups)
ss_t = sum((x-gm)**2 for g in groups for x in g)
eta2 = ss_b / ss_t
print(f"ฮทยฒ = {eta2:.4f} ({'large' if eta2>.14 else 'medium' if eta2>.06 else 'small'} effect)")
โถ Output
Engineering : W=0.9867 p=0.1934 Normal
Finance : W=0.9891 p=0.2481 Normal
Marketing : W=0.9843 p=0.1712 Normal
HR : W=0.9812 p=0.1543 Normal
Levene: F=1.842 p=0.1387 Equal var
ANOVA: F=74.231 p=0.000001 Significant
ฮทยฒ = 0.3282 (large effect)
๐ Analysis Dashboard Interactive
Salary distributions by dept, mean comparison, and test p-values summary.
Python โ Regression + sklearn Pipeline
from scipy import stats
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
# โโ Linear Regression: Experience โ Salary โโโโโโโโโโโโโโโโ
np.random.seed(42)
experience = np.random.uniform(1, 20, 100)
salary = 50000 + 3500 * experience + np.random.normal(0, 8000, 100)
slope, intercept, r, p, se = stats.linregress(experience, salary)
print(f"Salary = {slope:.0f}รExp + {intercept:,.0f}")
print(f"Rยฒ={r**2:.4f} p={p:.6f}")
# โโ sklearn End-to-End Pipeline โโโโโโโโโโโโโโโโโโโโโโโโโโโ
preprocessor = ColumnTransformer([
('num', StandardScaler(), ['Age', 'YearsExperience']),
('cat', OneHotEncoder(drop='first'), ['Dept'])
])
pipe = Pipeline([('prep', preprocessor), ('model', LinearRegression())])
X, y = df[['Age','YearsExperience','Dept']], df['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
cv = cross_val_score(pipe, X, y, cv=5, scoring='r2')
print(f"CV Rยฒ: {cv.round(3)} Mean={cv.mean():.4f}")
๐ Salary vs Experience โ Regression Interactive
Fitted regression line on scatter. Toggle to residuals or regenerate new data.
๐ Model Cross-Validation Rยฒ Scores Visual
5-fold CV scores. Consistent high scores = well-fitted model, low variance = good generalization.
DS Pipeline Best Practices:
โ Always EDA first โ
โ Check normality before choosing parametric vs non-parametric tests
โ Use Robust Scaler when outliers are present in features
โ Report effect sizes (ฮทยฒ, Cohen's d) alongside p-values
โ Correct for multiple comparisons (Tukey HSD, Bonferroni)
โ Keep everything in a reproducible sklearn Pipeline
โ Always EDA first โ
df.info(), df.describe(), plot distributionsโ Check normality before choosing parametric vs non-parametric tests
โ Use Robust Scaler when outliers are present in features
โ Report effect sizes (ฮทยฒ, Cohen's d) alongside p-values
โ Correct for multiple comparisons (Tukey HSD, Bonferroni)
โ Keep everything in a reproducible sklearn Pipeline