Sitemap

Building a Machine Learning Pipeline in Python: A Step-by-Step Guide

5 min readMay 8, 2025

--

Building a machine learning pipeline is an exciting endeavor, the flexibility of looping & stacking multiple models efficiently is fascinating!

This blog covers a baseline ML pipeline, demonstrating a deatiled practical example using the “Kaggle dataset — Airline Passenger satisfaction.” and ensemble methods (Random Forest). Let’s get started!

Book: Building ML Pipelines

Library Importing

# Import Essential Libraries
# Data Manipulation Tools
import pandas as pd
import numpy as np

# visuals
import matplotlib.pyplot as plt
import seaborn as sns

# Feature Engineering Techniques
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE, RFECV

# ML Tools
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, KFold, RepeatedStratifiedKFold, cross_val_score, cross_val_predict, RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier, StackingClassifier

import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier # CatBoostRegressor

from sklearn.neural_network import MLPClassifier

# Tuning & Cross-validation
from sklearn.model_selection import RandomizedSearchCV

# ML Performance Metrics
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, f1_score

Loading and Understanding the Raw Data

I saved the dataset on my workspace, feel free to download it from here.

# Import raw data
df = pd.read_csv('../data/airline/train.csv')
test = pd.read_csv('../data/airline/test.csv')
p
df.head()

I like to retrieve the data dictionary to help me understand the data variables & observations. Check the next image

Kaggle: Airline Passenger Satisfaction Predictions

Next I seperated the variables from the target response upfront

# seperate features from target response
X = df.drop('satisfaction',axis=1)
y = df['satisfaction'].copy()

# Create a copy of X
df = X

print('DataFrame Features:', X.columns)
print('\n')
print('Target Response:', y.name)

Data Understanding

  • Unnamed:0 and Arrival Delay in Minutes will be dropped because they duplicate the DataFrame index. Ensure to apply the same to the test subset.
  • Object columns will be converted to strings.
  • Gender and Type of Travel will be transformed using One Hot Encoding (ONE) method.
  • Customer Type, Class and Satisfaction will be transformed using Ordinal Encoding method.
  • Since many columns are binary columns, high dimensionality could hurt model performance. Dimensionality reduction techniques as PCA() are recommended!

Next I wanted to check the number of unique values in the categorical data so I can treat them properly in feature engineering stage.

# checking the unique values 
for i in df.columns:
if (df[i].dtype) == 'object':
print("The unique features of each variable '{}':\n{}\n".format(df[i].name,df[i].value_counts().reset_index()))

Kindly ensure that the basics of data cleaning to check for the missing values, duplicates, outliers, chnaging the dtype of categorical variables as follows:

# change the "object" to "string" variables
for i in df.columns:
if (df[i].dtype) == 'object':
df[i] = df[i].astype('string')

df['id'] = df['id'].astype('string')

Note that I have converted “id” to string because I don’t want my machine learning model to treat and it as a numeric variable and use it in predictions.

Then I dropped unnecessary variables as you can see here:

#
df.drop(['Unnamed: 0','id','Arrival Delay in Minutes'],axis=1,inplace=True)

Feature Transformation Using Piepline

I have divided the numeric variables from the categorical variables

# split data into numerical & categorical features
num_features = df.select_dtypes(exclude=['object','string']).columns
print('Numerical features:', num_features,'\n')
cat_features = df.select_dtypes(include=['string']).columns
print('Categorical features:', cat_features)

Splitting the data using train_test_split() is vital. Because as you will see down below I handle each subset differently and scale the features later accordingly.

# split data into training & test sets 
X_train, X_test, y_train, y_test = train_test_split(df,y,test_size=0.2,random_state=42)

Next comes the imputation, scaling and different techniques of feature engineering, if you’re comfortable around these parts jump to the next sections. If you’re a tad confused check out the next source first.

To ensure tobust model generalization, we first impute missing values in numerical features (e.g., using mean or median imputation), then apply scaling (e.g., MinMax or StandardScaler) to mitigate the influence of extreme values. This preprocessing step is critical before training Random Forest model, as skewed or unscaled data can distort feature importance & degrade model performance.

# impute missing values in numerical features
X_train_num = X_train[num_features]
X_test_num = X_test[num_features]
imputer = SimpleImputer(strategy='mean')
X_train_num = imputer.fit_transform(X_train_num)
X_test_num = imputer.transform(X_test_num)

# scale numerical features
scaler = MinMaxScaler()
X_train_num = scaler.fit_transform(X_train_num)
X_test_num = scaler.transform(X_test_num)

Preprocessing Pipeline

  1. Numeric Pipeline: Imputing + Scaling
# define pipeline for numerical features
num_pipeline = make_pipeline(SimpleImputer(strategy="mean"), MinMaxScaler())

# show the pipeline diagram
num_pipeline

2. Categorical Pipeline:

  • One Hot Encoding → Gender, Type of Travel
  • Ordinal Encoding → Customer Type, Class
# define categorical features w the right feature engineering method
onehot_cols = df[['Gender', 'Type of Travel']]
ordinal_cols = df[['Customer Type', 'Class']]

Using the Pipeline() from sklearn

onehot_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder',OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

Define the unique values/instances inside the ordinal features

# define the orders categories
ordinal_categories = [
['Loyal Customer', 'disloyal Customer'], # Customer Type
['Eco', 'Eco Plus', 'Business'], # Class
]

ordinal_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder(categories= [ordinal_categories]))
])

Creating the preprocessing pipeline after merging all techniques

preprocessing = ColumnTransformer([
("num", num_pipeline, num_features),
('onehot', onehot_pipeline, onehot_cols), # Apply OneHot pipeline to onehot_cols
('ordinal', ordinal_pipeline, ordinal_cols) # Apply Ordinal pipeline to ordinal_cols
])
preprocessing

We’ve now built pipeline’s core component: the preprocessing pipeline.

# apply the fit_transform()
X_train_transformed = preprocessing.fit_transform(X_train)

# If X_train is a NumPy array, it fails because there are no column names!
# convert back to pandas dataframe
X_train_transformed = pd.DataFrame(
X_train_transformed, columns=preprocessing.get_feature_names_out(),
index=X_train.index)

print('Features after transformation:', X_train_transformed.head())
print('Features shape after transformation:', X_train_transformed.shape)
# instantiate a random forest classifier
from sklearn.ensemble import RandomForestClassifier
model_forest = RandomForestClassifier(random_state=42)

# fit the model (use the values attribute to pass a numpy array instead of a pandas dataframe)
model_forest.fit(X_train_transformed.values, y_train.values)
# we use accuracy as the evaluation metric
# don't forget to apply the pipeline to the test data
X_test_transformed = preprocessing.transform(X_test)

# make predictions and evaluate the model
y_pred_forest = model_forest.predict(X_test_transformed)
accuracy_forest = accuracy_score(y_test, y_pred_forest)
print('Accuracy:', accuracy_forest)

The model achieves 96% accuracy which is a strong result!

Next Steps to Consider

  • Hyperparameter Tuning: Use GridSearchCV to optimize max_depth and n_estimators.
  • Feature Importance: Analyze which features drive predictions.

Conclusion

This pipeline demonstrate how to efficiently preprocess data & train a high-accuracy model.

Thanks for reading! Share your thoughts below, and follow for more data science content.

Let’s connect on LinkedIn, YouTube, TikTok!

--

--

No responses yet