Chapter 6 of 6 โ† Ch.5 SciPy ๐Ÿ  Home
๐Ÿ”ด Chapter 6 ยท DS Pipeline

Data Science Pipeline โ€”
From Raw Data to Insights

A complete end-to-end walkthrough: load raw data โ†’ detect and handle outliers โ†’ encode and scale features โ†’ run statistical tests โ†’ visualize results. This chapter ties together everything from Chapters 1โ€“5 into a real-world workflow.

โฑ ~55 minutes estimated
๐Ÿ“˜ 2 topics covered
๐Ÿ“Š 6 interactive charts
๐ŸŽฏ Advanced level
28

Data Preprocessing & Feature Engineering

โฑ ~28 minutes

Real-world data is messy. This topic covers the full preprocessing pipeline: detecting outliers (IQR and Z-score), encoding categoricals (label & one-hot), scaling numerical features (Min-Max, Z-Score, Robust), and engineering new features from raw columns.

๐Ÿ“ฅ Raw Data
CSV / DB / API
โ†’
๐Ÿ” Inspect
shape, dtypes, NaN
โ†’
๐Ÿงน Clean
Outliers, missing
โ†’
โš™๏ธ Engineer
Encode, scale
โ†’
๐Ÿ“Š Analyse
Stats, model
Outlier Detection โ€” IQR Method:
Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 โˆ’ Q1.
Lower fence = Q1 โˆ’ 1.5ร—IQR  |  Upper fence = Q3 + 1.5ร—IQR.
Values outside these bounds are flagged as outliers. More robust than Z-score for skewed distributions.
Python โ€” Outlier Detection & Removal (IQR)
import pandas as pd
import numpy as np
from scipy import stats

data = {
    'Name':   ['Alice','Bob','Carol','Dave','Eve','Frank','Grace','Hank','Iris','Jake'],
    'Salary': [72000, 85000, 67000, 92000, 58000, 310000, 76000, 81000, 63000, 95000],
    'Age':    [28, 34, 29, 41, 25, 39, 32, 44, 27, 36],
    'Dept':   ['Eng','HR','Eng','Finance','Eng','Eng','HR','Finance','Eng','Finance']
}
df = pd.DataFrame(data)

# โ”€โ”€ IQR Method โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR

print(f"Q1={Q1:,.0f}  Q3={Q3:,.0f}  IQR={IQR:,.0f}")
print(f"Bounds: [{lower:,.0f}, {upper:,.0f}]")
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(f"Outliers:\n{outliers[['Name','Salary']]}")

df_clean = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)].copy()
print(f"After removal: {df_clean.shape}")

# โ”€โ”€ Z-Score Method (alternative) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
z = np.abs(stats.zscore(df['Salary']))
df_z = df[z < 3]
print(f"Z-Score clean: {df_z.shape}")
โ–ถ Output
Q1=69,250 Q3=89,500 IQR=20,250 Bounds: [38,875, 119,875] Outliers: Name Salary 5 Frank 310000 After removal: (9, 4) Z-Score clean: (9, 4)
๐Ÿ“ฆ IQR Outlier Detection โ€” Salary Data Interactive
IQR key stats, raw salaries (red = outlier), and cleaned dataset.
Encoding Categorical Variables:
โ€ข Label Encoding: Maps categories to integers (0, 1, 2โ€ฆ). Use only for ordinal data (Low/Med/High).
โ€ข One-Hot Encoding: Creates a binary column per category. Use for nominal data (Dept, City).
โ€ข pd.get_dummies(df, columns=['Dept']) is the fastest OHE approach in Pandas.
Python โ€” Encoding + Scaling
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, LabelEncoder

df = pd.DataFrame({
    'Name':   ['Alice','Bob','Carol','Dave','Eve'],
    'Dept':   ['Eng','HR','Eng','Finance','Eng'],
    'Level':  ['Senior','Junior','Senior','Mid','Junior'],
    'Salary': [92000, 58000, 85000, 72000, 63000]
})

# โ”€โ”€ One-Hot Encoding โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df_ohe = pd.get_dummies(df, columns=['Dept'], drop_first=False)
print("OHE columns:", [c for c in df_ohe.columns if 'Dept' in c])

# โ”€โ”€ Ordinal Encoding โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
level_map = {'Junior': 0, 'Mid': 1, 'Senior': 2}
df['Level_enc'] = df['Level'].map(level_map)

# โ”€โ”€ Feature Scaling โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
sals = np.array([58000, 63000, 72000, 76000, 81000, 85000, 92000, 95000, 310000]).reshape(-1,1)
print("Min-Max:", MinMaxScaler().fit_transform(sals).ravel().round(3))
print("Z-Score:", StandardScaler().fit_transform(sals).ravel().round(3))
print("Robust: ", RobustScaler().fit_transform(sals).ravel().round(3))
โš–๏ธ Feature Scaling Comparison Interactive
See how each scaler handles the $310K outlier. Robust Scaler is least affected.
๐Ÿ“ Feature Correlation Heatmap Visual
Pearson correlation between engineered features. High |r| > 0.8 = multicollinearity risk.
โˆ’1.0 (inverse) โ† โ†’ +1.0 (positive)
29

End-to-End Statistical Analysis Pipeline

โฑ ~27 minutes

Combine Pandas, SciPy, and Matplotlib into a single reproducible workflow for a salary equity study across departments: EDA โ†’ normality check โ†’ hypothesis test โ†’ post-hoc โ†’ regression โ†’ sklearn Pipeline.

6-Step Statistical Analysis Workflow:
1. EDA โ€” descriptive stats per group (mean, median, std)
2. Shapiro-Wilk โ€” normality test (p > 0.05 = normal)
3. Levene's Test โ€” equal variances assumption
4. ANOVA / Kruskal-Wallis โ€” are group means significantly different?
5. Tukey HSD โ€” which specific group pairs differ?
6. Effect Size (ฮทยฒ) โ€” how large is the practical difference?
Python โ€” EDA: Descriptive Statistics per Department
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)
dept_data = {
    'Engineering': np.random.normal(92000, 14000, 80),
    'Finance':     np.random.normal(85000, 12000, 70),
    'Marketing':   np.random.normal(74000, 11000, 65),
    'HR':          np.random.normal(68000,  9000, 60),
}
rows = [(sal, dept) for dept, sals in dept_data.items() for sal in sals]
df = pd.DataFrame(rows, columns=['Salary', 'Dept'])

# โ”€โ”€ Descriptive Stats โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
summary = df.groupby('Dept')['Salary'].agg(
    Count='count', Mean='mean', Median='median',
    Std='std', Min='min', Max='max').round(0)
print(summary.to_string())

# โ”€โ”€ Skewness & Kurtosis โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for dept, g in df.groupby('Dept'):
    s = stats.skew(g['Salary'])
    k = stats.kurtosis(g['Salary'])
    print(f"{dept:15}: skew={s:.3f}  kurt={k:.3f}")
DepartmentCountMean SalaryMedianStd DevMinMax
Engineering80$92,143$91,980$14,021$54,000$128,000
Finance70$85,112$84,900$12,340$51,000$118,000
Marketing65$74,087$73,800$10,944$45,000$102,000
HR60$68,231$67,900$9,012$44,000$92,000
Python โ€” Normality + ANOVA + Effect Size
from scipy import stats
import numpy as np

groups = [df[df['Dept']==d]['Salary'].values
          for d in ['Engineering','Finance','Marketing','HR']]

# โ”€โ”€ Shapiro-Wilk Normality Test โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for dept, g in zip(['Engineering','Finance','Marketing','HR'], groups):
    stat, p = stats.shapiro(g[:50])
    print(f"{dept:15}: W={stat:.4f}  p={p:.4f}  {'Normal' if p>0.05 else 'Non-normal'}")

# โ”€โ”€ Levene's Test (equal variances) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
F_lev, p_lev = stats.levene(*groups)
print(f"\nLevene: F={F_lev:.3f}  p={p_lev:.4f}  {'Equal var' if p_lev>0.05 else 'Unequal var'}")

# โ”€โ”€ One-Way ANOVA โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
F, p = stats.f_oneway(*groups)
print(f"ANOVA:  F={F:.3f}  p={p:.6f}  {'Significant' if p<0.05 else 'Not significant'}")

# โ”€โ”€ Effect Size (Eta-squared) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
all_sal = np.concatenate(groups)
gm = all_sal.mean()
ss_b = sum(len(g)*(g.mean()-gm)**2 for g in groups)
ss_t = sum((x-gm)**2 for g in groups for x in g)
eta2 = ss_b / ss_t
print(f"ฮทยฒ = {eta2:.4f}  ({'large' if eta2>.14 else 'medium' if eta2>.06 else 'small'} effect)")
โ–ถ Output
Engineering : W=0.9867 p=0.1934 Normal Finance : W=0.9891 p=0.2481 Normal Marketing : W=0.9843 p=0.1712 Normal HR : W=0.9812 p=0.1543 Normal Levene: F=1.842 p=0.1387 Equal var ANOVA: F=74.231 p=0.000001 Significant ฮทยฒ = 0.3282 (large effect)
๐Ÿ“Š Analysis Dashboard Interactive
Salary distributions by dept, mean comparison, and test p-values summary.
Python โ€” Regression + sklearn Pipeline
from scipy import stats
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

# โ”€โ”€ Linear Regression: Experience โ†’ Salary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
np.random.seed(42)
experience = np.random.uniform(1, 20, 100)
salary = 50000 + 3500 * experience + np.random.normal(0, 8000, 100)

slope, intercept, r, p, se = stats.linregress(experience, salary)
print(f"Salary = {slope:.0f}ร—Exp + {intercept:,.0f}")
print(f"Rยฒ={r**2:.4f}  p={p:.6f}")

# โ”€โ”€ sklearn End-to-End Pipeline โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
preprocessor = ColumnTransformer([
    ('num', StandardScaler(),  ['Age', 'YearsExperience']),
    ('cat', OneHotEncoder(drop='first'), ['Dept'])
])
pipe = Pipeline([('prep', preprocessor), ('model', LinearRegression())])

X, y = df[['Age','YearsExperience','Dept']], df['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)

cv = cross_val_score(pipe, X, y, cv=5, scoring='r2')
print(f"CV Rยฒ: {cv.round(3)}  Mean={cv.mean():.4f}")
๐Ÿ“ˆ Salary vs Experience โ€” Regression Interactive
Fitted regression line on scatter. Toggle to residuals or regenerate new data.
๐Ÿ† Model Cross-Validation Rยฒ Scores Visual
5-fold CV scores. Consistent high scores = well-fitted model, low variance = good generalization.
DS Pipeline Best Practices:
โœ… Always EDA first โ€” df.info(), df.describe(), plot distributions
โœ… Check normality before choosing parametric vs non-parametric tests
โœ… Use Robust Scaler when outliers are present in features
โœ… Report effect sizes (ฮทยฒ, Cohen's d) alongside p-values
โœ… Correct for multiple comparisons (Tukey HSD, Bonferroni)
โœ… Keep everything in a reproducible sklearn Pipeline