Building a Machine Learning Pipeline in Python: A Step-by-Step Guide
Building a machine learning pipeline is an exciting endeavor, the flexibility of looping & stacking multiple models efficiently is fascinating!
This blog covers a baseline ML pipeline, demonstrating a deatiled practical example using the “Kaggle dataset — Airline Passenger satisfaction.” and ensemble methods (Random Forest). Let’s get started!
Library Importing
# Import Essential Libraries
# Data Manipulation Tools
import pandas as pd
import numpy as np
# visuals
import matplotlib.pyplot as plt
import seaborn as sns
# Feature Engineering Techniques
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE, RFECV
# ML Tools
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, KFold, RepeatedStratifiedKFold, cross_val_score, cross_val_predict, RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier, StackingClassifier
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier # CatBoostRegressor
from sklearn.neural_network import MLPClassifier
# Tuning & Cross-validation
from sklearn.model_selection import RandomizedSearchCV
# ML Performance Metrics
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, f1_score
Loading and Understanding the Raw Data
I saved the dataset on my workspace, feel free to download it from here.
# Import raw data
df = pd.read_csv('../data/airline/train.csv')
test = pd.read_csv('../data/airline/test.csv')
p
df.head()
I like to retrieve the data dictionary to help me understand the data variables & observations. Check the next image
Next I seperated the variables from the target response upfront
# seperate features from target response
X = df.drop('satisfaction',axis=1)
y = df['satisfaction'].copy()
# Create a copy of X
df = X
print('DataFrame Features:', X.columns)
print('\n')
print('Target Response:', y.name)
Data Understanding
Unnamed:0
andArrival Delay in Minutes
will be dropped because they duplicate the DataFrame index. Ensure to apply the same to the test subset.Object
columns will be converted to strings.Gender
andType of Travel
will be transformed using One Hot Encoding (ONE) method.Customer Type
,Class
andSatisfaction
will be transformed using Ordinal Encoding method.- Since many columns are binary columns, high dimensionality could hurt model performance. Dimensionality reduction techniques as
PCA()
are recommended!
Next I wanted to check the number of unique values in the categorical data so I can treat them properly in feature engineering stage.
# checking the unique values
for i in df.columns:
if (df[i].dtype) == 'object':
print("The unique features of each variable '{}':\n{}\n".format(df[i].name,df[i].value_counts().reset_index()))
Kindly ensure that the basics of data cleaning to check for the missing values, duplicates, outliers, chnaging the dtype of categorical variables as follows:
# change the "object" to "string" variables
for i in df.columns:
if (df[i].dtype) == 'object':
df[i] = df[i].astype('string')
df['id'] = df['id'].astype('string')
Note that I have converted “id” to string because I don’t want my machine learning model to treat and it as a numeric variable and use it in predictions.
Then I dropped unnecessary variables as you can see here:
#
df.drop(['Unnamed: 0','id','Arrival Delay in Minutes'],axis=1,inplace=True)
Feature Transformation Using Piepline
I have divided the numeric variables from the categorical variables
# split data into numerical & categorical features
num_features = df.select_dtypes(exclude=['object','string']).columns
print('Numerical features:', num_features,'\n')
cat_features = df.select_dtypes(include=['string']).columns
print('Categorical features:', cat_features)
Splitting the data using train_test_split()
is vital. Because as you will see down below I handle each subset differently and scale the features later accordingly.
# split data into training & test sets
X_train, X_test, y_train, y_test = train_test_split(df,y,test_size=0.2,random_state=42)
Next comes the imputation, scaling and different techniques of feature engineering, if you’re comfortable around these parts jump to the next sections. If you’re a tad confused check out the next source first.
To ensure tobust model generalization, we first impute missing values in numerical features (e.g., using mean
or median
imputation), then apply scaling (e.g., MinMax
or StandardScaler
) to mitigate the influence of extreme values. This preprocessing step is critical before training Random Forest
model, as skewed or unscaled data can distort feature importance & degrade model performance.
# impute missing values in numerical features
X_train_num = X_train[num_features]
X_test_num = X_test[num_features]
imputer = SimpleImputer(strategy='mean')
X_train_num = imputer.fit_transform(X_train_num)
X_test_num = imputer.transform(X_test_num)
# scale numerical features
scaler = MinMaxScaler()
X_train_num = scaler.fit_transform(X_train_num)
X_test_num = scaler.transform(X_test_num)
Preprocessing Pipeline
- Numeric Pipeline: Imputing + Scaling
# define pipeline for numerical features
num_pipeline = make_pipeline(SimpleImputer(strategy="mean"), MinMaxScaler())
# show the pipeline diagram
num_pipeline
2. Categorical Pipeline:
- One Hot Encoding →
Gender
,Type of Travel
- Ordinal Encoding →
Customer Type
,Class
# define categorical features w the right feature engineering method
onehot_cols = df[['Gender', 'Type of Travel']]
ordinal_cols = df[['Customer Type', 'Class']]
Using the Pipeline()
from sklearn
onehot_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder',OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])
Define the unique values/instances inside the ordinal features
# define the orders categories
ordinal_categories = [
['Loyal Customer', 'disloyal Customer'], # Customer Type
['Eco', 'Eco Plus', 'Business'], # Class
]
ordinal_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder(categories= [ordinal_categories]))
])
Creating the preprocessing pipeline after merging all techniques
preprocessing = ColumnTransformer([
("num", num_pipeline, num_features),
('onehot', onehot_pipeline, onehot_cols), # Apply OneHot pipeline to onehot_cols
('ordinal', ordinal_pipeline, ordinal_cols) # Apply Ordinal pipeline to ordinal_cols
])
preprocessing
We’ve now built pipeline’s core component: the preprocessing pipeline.
# apply the fit_transform()
X_train_transformed = preprocessing.fit_transform(X_train)
# If X_train is a NumPy array, it fails because there are no column names!
# convert back to pandas dataframe
X_train_transformed = pd.DataFrame(
X_train_transformed, columns=preprocessing.get_feature_names_out(),
index=X_train.index)
print('Features after transformation:', X_train_transformed.head())
print('Features shape after transformation:', X_train_transformed.shape)
# instantiate a random forest classifier
from sklearn.ensemble import RandomForestClassifier
model_forest = RandomForestClassifier(random_state=42)
# fit the model (use the values attribute to pass a numpy array instead of a pandas dataframe)
model_forest.fit(X_train_transformed.values, y_train.values)
# we use accuracy as the evaluation metric
# don't forget to apply the pipeline to the test data
X_test_transformed = preprocessing.transform(X_test)
# make predictions and evaluate the model
y_pred_forest = model_forest.predict(X_test_transformed)
accuracy_forest = accuracy_score(y_test, y_pred_forest)
print('Accuracy:', accuracy_forest)
The model achieves 96%
accuracy which is a strong result!
Next Steps to Consider
- Hyperparameter Tuning: Use
GridSearchCV
to optimizemax_depth
andn_estimators
. - Feature Importance: Analyze which features drive predictions.
Conclusion
This pipeline demonstrate how to efficiently preprocess data & train a high-accuracy model.
Thanks for reading! Share your thoughts below, and follow for more data science content.