ML Workflow with tidymodels

1 Libraries

In this article we use the tidymodels ecosystem to illustrate a complete machine learning workflow in a tidy, coherent, and reproducible manner.
Throughout the article, we revisit the fundamental components of the ML workflow:

Data split → Feature engineering → Model specification → Model fitting → Model prediction → Model evaluation

We begin with the most basic workflow and later extend it using the recipe framework for more advanced preprocessing.

The material in this article is based on the case study provided at: https://www.tidymodels.org/start/case-study/. All exercises use the dataset included in that example.

The following packages are required:

library(tidyverse)
theme_set(theme_bw())

library(tidymodels)  # parsnip + other tidymodels packages

# Helper packages
library(readr)       # for importing data
library(broom.mixed) # for tidying Bayesian model output
library(dotwhisker)  # for visualizing regression results
library(skimr)       # for variable summaries

2 Basic Model Building and Fitting

In this first chapter we focus on three steps:

exploring the data
specifying a model
fitting the model to the data

2.1 Data

In this exercise we explore the relationship between:

the outcome variable: sea urchin width
feature 1: initial volume (continuous)
feature 2: food regime (categorical)

A basic model for these variables can be written as:

width ~ initial_volume * food_regime

## Data import ----------
df_Urchins <-
  read_csv("https://tidymodels.org/start/models/urchins.csv") |>
  setNames(c("food_regime", "initial_volume", "width")) |>
  mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))

## Explore the relationships ----------
ggplot(df_Urchins,
       aes(x = initial_volume,
           y = width,
           group = food_regime,
           col = food_regime)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  scale_color_viridis_d(option = "plasma", end = .7)

2.2 Building the Model

We start with a simple linear regression model using linear_reg(). The tidymodels ecosystem separates model specification (what model we want) from the computational engine (how it is estimated). The parsnip package handles this abstraction.

Available engines for linear_reg() are listed here: https://parsnip.tidymodels.org/reference/linear_reg.html

Choose the model type:

linear_reg()

Linear Regression Model Specification (regression)

Computational engine: lm

Choose the engine:

linear_reg() |>
  set_engine("lm")

Linear Regression Model Specification (regression)

Computational engine: lm

Save the model specification:

mdl_Linear <- linear_reg()

2.3 Training (Fitting) the Model

To train the model, we use the fit() function, which requires:

the model specification
a formula
the training dataset

fit_Linear <-
  mdl_Linear |>
  fit(width ~ initial_volume * food_regime, data = df_Urchins)

fit_Linear

parsnip model object


Call:
stats::lm(formula = width ~ initial_volume * food_regime, data = data)

Coefficients:
                   (Intercept)                  initial_volume  
                     0.0331216                       0.0015546  
                food_regimeLow                 food_regimeHigh  
                     0.0197824                       0.0214111  
 initial_volume:food_regimeLow  initial_volume:food_regimeHigh  
                    -0.0012594                       0.0005254

We can inspect the fitted model coefficients:

tidy(fit_Linear)

# A tibble: 6 × 5
  term                            estimate std.error statistic  p.value
  <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                     0.0331    0.00962      3.44  0.00100 
2 initial_volume                  0.00155   0.000398     3.91  0.000222
3 food_regimeLow                  0.0198    0.0130       1.52  0.133   
4 food_regimeHigh                 0.0214    0.0145       1.47  0.145   
5 initial_volume:food_regimeLow  -0.00126   0.000510    -2.47  0.0162  
6 initial_volume:food_regimeHigh  0.000525  0.000702     0.748 0.457

2.4 Predicting with the Fitted Model

Once a model is fitted, we can use it for prediction. To make predictions we need:

a dataset containing the feature variables (with matching names)
the fitted model object

By default, predict() returns the mean prediction:

df_NewData <- expand.grid(
  initial_volume = 20,
  food_regime = c("Initial", "Low", "High")
)

pred_Mean <- predict(fit_Linear, new_data = df_NewData)
pred_Mean

# A tibble: 3 × 1
   .pred
   <dbl>
1 0.0642
2 0.0588
3 0.0961

We can also request confidence intervals using type = "conf_int":

pred_Conf <- predict(
  fit_Linear,
  new_data = df_NewData,
  type = "conf_int"
)
pred_Conf

# A tibble: 3 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1      0.0555      0.0729
2      0.0499      0.0678
3      0.0870      0.105

Below is a polished, academic, and clearly structured version of your entire section. I improved grammar, readability, terminology, explanations, and formatting. All original meaning is preserved, but now it reads like high-quality lecture notes.

Returned in markdown as requested.

3 Data Processing with `recipes`

The recipes package provides a structured and extensible framework for data preprocessing and feature engineering within the tidymodels ecosystem.
It allows us to define all preprocessing steps in a reproducible, model-agnostic workflow before model fitting.

3.1 Data

In this example we work with flight information from the United States (package nycflights13).
Our goal is to predict whether an arriving flight is late (arrival delay ≥ 30 minutes) or on time.

The modeling relationship can be summarized as:

arr_delay ~ (date, airport, distance, ...)

library(nycflights13)
set.seed(123)

df_Flight <- 
  flights |> 
  mutate(
    arr_delay = ifelse(arr_delay >= 30, "late", "on_time"),
    arr_delay = factor(arr_delay),
    date = lubridate::as_date(time_hour)
  ) |> 
  inner_join(weather, by = c("origin", "time_hour")) |> 
  select(dep_time, flight, origin, dest, air_time, distance,
         carrier, date, arr_delay, time_hour) |> 
  na.omit() |> 
  mutate_if(is.character, as.factor)

df_Flight |> 
  skimr::skim(dest, carrier)

Data summary
Name	df_Flight
Number of rows	325819
Number of columns	10
_______________________
Column type frequency:
factor	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
dest	0	1	FALSE	104	ATL: 16771, ORD: 16507, LAX: 15942, BOS: 14948
carrier	0	1	FALSE	16	UA: 57489, B6: 53715, EV: 50868, DL: 47465

3.2 Data Split

As usual, we divide the dataset into training and validation sets. This allows us to fit the model on one portion of the data and evaluate performance on unseen observations.

Under the tidymodels framework:

initial_split() performs the random split.
training() and testing() extract the corresponding sets.

set.seed(222)
split_Flight <- initial_split(df_Flight, prop = 3/4)

df_Train <- training(split_Flight)
df_Validate <- testing(split_Flight)

3.3 Creating a Recipe

3.3.1 Basical Recipe

A recipe in tidymodels is a structured, pre-defined plan for data preprocessing and feature engineering.
It allows us to declare, in advance, all the steps needed to transform raw data into model-ready features.

When creating a recipe, two fundamental components must be specified:

Formula (~):
The variable on the left-hand side is the target/outcome, and the variables on the right-hand side are the predictors/features.
Data:
The dataset is used to identify variable names and types. At this stage, no transformations are applied.

# Create a basic recipe
rcp_Flight <- recipe(arr_delay ~ ., data = df_Train)

Using . on the right-hand side indicates that all remaining variables in the dataset should be used as predictors.

3.3.2 Roles

In addition to the standard roles (outcome and predictor), recipes allows defining custom roles using update_role().

Common roles include:

predictor: variables used to predict the outcome
outcome: target variable
ID: identifiers not used for modeling
case_weights: observation weights
datetime, latitude, longitude: for temporal or spatial feature engineering

rcp_Flight <- 
  recipe(arr_delay ~ ., data = df_Train) |> 
  update_role(flight, time_hour, new_role = "ID")

summary(rcp_Flight)

# A tibble: 10 × 4
   variable  type      role      source  
   <chr>     <list>    <chr>     <chr>   
 1 dep_time  <chr [2]> predictor original
 2 flight    <chr [2]> ID        original
 3 origin    <chr [3]> predictor original
 4 dest      <chr [3]> predictor original
 5 air_time  <chr [2]> predictor original
 6 distance  <chr [2]> predictor original
 7 carrier   <chr [3]> predictor original
 8 date      <chr [1]> predictor original
 9 time_hour <chr [1]> ID        original
10 arr_delay <chr [3]> outcome   original

3.4 Feature Engineering with `step_*`

3.4.1 Date Features

Feature engineering transforms raw data into model-ready variables. recipes expresses each transformation as a step_*() function.

Below, we extract:

day of week (dow)
month
holiday indicators (using U.S. holidays)

rcp_Flight <- 
  recipe(arr_delay ~ ., data = df_Train) |> 
  update_role(flight, time_hour, new_role = "ID") |> 
  step_date(date, features = c("dow", "month")) |>               
  step_holiday(
    date,
    holidays = timeDate::listHolidays("US"),
    keep_original_cols = FALSE
  )

This replaces the original date with engineered calendar features.

3.4.2 Categorical Variables (Dummy Encoding)

Most models require converting categorical variables into dummy/one-hot encoded predictors. recipes provides:

step_novel(): handles unseen factor levels at prediction time
step_dummy(): converts categorical predictors into numerical dummy variables

rcp_Flight <- 
  recipe(arr_delay ~ ., data = df_Train) |> 
  update_role(flight, time_hour, new_role = "ID") |> 
  step_date(date, features = c("dow", "month")) |>               
  step_holiday(
    date, 
    holidays = timeDate::listHolidays("US"), 
    keep_original_cols = FALSE
  ) |> 
  step_novel(all_nominal_predictors()) |>  
  step_dummy(all_nominal_predictors())

3.4.3 Removing Zero-Variance Predictors

Predictors with constant values provide no information and may cause issues in some algorithms. We remove them using step_zv():

rcp_Flight <- 
  recipe(arr_delay ~ ., data = df_Train) |> 
  update_role(flight, time_hour, new_role = "ID") |> 
  step_date(date, features = c("dow", "month")) |>               
  step_holiday(
    date, 
    holidays = timeDate::listHolidays("US"), 
    keep_original_cols = FALSE
  ) |> 
  step_novel(all_nominal_predictors()) |>  
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors())

3.5 Combining the Recipe with a Model

3.5.1 Workflow Definition

A workflow allows bundling:

the recipe (preprocessing)
the model (statistical/ML algorithm)

mdl_LogReg <- 
  logistic_reg() |> 
  set_engine("glm")

wflow_Flight <- 
  workflow() |> 
  add_model(mdl_LogReg) |> 
  add_recipe(rcp_Flight)

3.6 Model Fitting

We can now train the full preprocessing + modeling pipeline using fit():

fit_Flight <- 
  wflow_Flight |> 
  fit(data = df_Train)

Inspect model coefficients:

fit_Flight |> 
  extract_fit_parsnip() |> 
  tidy()

# A tibble: 157 × 5
   term                         estimate std.error statistic  p.value
   <chr>                           <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                   7.28    2.73           2.67 7.64e- 3
 2 dep_time                     -0.00166 0.0000141   -118.   0       
 3 air_time                     -0.0440  0.000563     -78.2  0       
 4 distance                      0.00507 0.00150        3.38 7.32e- 4
 5 date_USChristmasDay           1.33    0.177          7.49 6.93e-14
 6 date_USColumbusDay            0.724   0.170          4.25 2.13e- 5
 7 date_USCPulaskisBirthday      0.807   0.139          5.80 6.57e- 9
 8 date_USDecorationMemorialDay  0.585   0.117          4.98 6.32e- 7
 9 date_USElectionDay            0.948   0.190          4.98 6.25e- 7
10 date_USGoodFriday             1.25    0.167          7.45 9.40e-14
# ℹ 147 more rows

3.7 Prediction

Once fitted, the workflow can be used to generate predictions:

pred_Flight <- predict(fit_Flight, df_Validate)

For evaluation we need both class predictions and class probabilities:

pred_Flight <- 
  predict(fit_Flight, df_Validate, type = "prob") |> 
  bind_cols(predict(fit_Flight, df_Validate, type = "class")) |> 
  bind_cols(df_Validate |> select(arr_delay))

3.8 Evaluation

pred_Flight |> 
  roc_auc(truth = arr_delay, .pred_late)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.764

pred_Flight |> 
  accuracy(truth = arr_delay, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.849

aug_Flight <- augment(fit_Flight, df_Validate)

aug_Flight |> 
  roc_curve(truth = arr_delay, .pred_late) |> 
  autoplot()

3.9 Clean Version of the Code

set.seed(222)  # Set seed for reproducibility

# Split the data into 75% training and 25% validation
split_Flight <- initial_split(df_Flight, prop = 3/4)
df_Train <- training(split_Flight)      # Extract training set
df_Validate <- testing(split_Flight)    # Extract validation set

# -------------------------------
# Create a preprocessing recipe
# -------------------------------
rcp_Flight <- 
  recipe(arr_delay ~ ., data = df_Train) |>    # Define target and predictors
  update_role(flight, time_hour, new_role = "ID") |>  # Assign ID role to identifier columns
  step_date(date, features = c("dow", "month")) |>    # Extract day of week & month from date
  step_holiday(
    date, 
    holidays = timeDate::listHolidays("US"), 
    keep_original_cols = FALSE
  ) |>                                             # Add holiday indicator features
  step_novel(all_nominal_predictors()) |>         # Handle new factor levels in prediction
  step_dummy(all_nominal_predictors()) |>         # Convert categorical predictors to dummy variables
  step_zv(all_predictors())                       # Remove predictors with zero variance

# -------------------------------
# Define the logistic regression model
# -------------------------------
mdl_LogReg <- 
  logistic_reg() |> 
  set_engine("glm")  # Use generalized linear model as backend

# -------------------------------
# Combine recipe and model into a workflow
# -------------------------------
wflow_Flight <- 
  workflow() |> 
  add_model(mdl_LogReg) |> 
  add_recipe(rcp_Flight)

# -------------------------------
# Fit the workflow on the training data
# -------------------------------
fit_Flight <- 
  wflow_Flight |> 
  fit(data = df_Train)

# -------------------------------
# Predict on the validation set (class labels)
# -------------------------------
pred_Flight <- predict(fit_Flight, df_Validate)

# -------------------------------
# Predict on validation set (probabilities and class labels)
# -------------------------------
pred_Flight <- 
  predict(fit_Flight, df_Validate, type = "prob") |>   # Predicted probabilities
  bind_cols(predict(fit_Flight, df_Validate, type = "class")) |>  # Predicted classes
  bind_cols(df_Validate |> select(arr_delay))          # Add true labels for evaluation

# -------------------------------
# Evaluate model performance
# -------------------------------
pred_Flight |> 
  roc_auc(truth = arr_delay, .pred_late)   # ROC AUC for "late" class

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.764

pred_Flight |> 
  accuracy(truth = arr_delay, .pred_class) # Classification accuracy

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.849

# -------------------------------
# Visualize ROC curve
# -------------------------------
aug_Flight <- augment(fit_Flight, df_Validate)  # Add predictions and residuals to dataset

aug_Flight |> 
  roc_curve(truth = arr_delay, .pred_late) |>  # Compute ROC curve
  autoplot()                                   # Plot ROC curve

4 Evaluate model with resamples

4.1 Evalute both trning- and validate- dataset

the fited model can run for both datset, after predict we canlaos use the functon roc_auc or accuracy to test the model perfomence, for the both indictaors, bigger values means the better perfomence:

library(modeldata) 

# Dataset ---------
## split -----------
set.seed(123)
split_Cell <- initial_split(modeldata::cells |> select(-case), 
                            strata = class)
df_Train_Cell <- training(split_Cell)
df_Validate_Cell  <- testing(split_Cell)

# Model ------------
## build -----------

mdl_RandomForest <- 
  rand_forest(trees = 1000) |> 
  set_engine("ranger") |> 
  set_mode("classification")

## fit --------
set.seed(234)
fit_RandomForest <- 
  mdl_RandomForest |> 
  fit(class ~ ., data = df_Train_Cell)
fit_RandomForest

parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  1000 
Sample size:                      1514 
Number of independent variables:  56 
Mtry:                             7 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.1187479

## predict -------
pred_Train_Cell <- 
  predict(fit_RandomForest, df_Train_Cell) |> 
  bind_cols(predict(fit_RandomForest, df_Train_Cell, type = "prob")) |> 
  # Add the true outcome data back in
  bind_cols(df_Train_Cell |> 
              select(class))

pred_Validate_Cell <- 
  predict(fit_RandomForest, df_Validate_Cell) |> 
  bind_cols(predict(fit_RandomForest, df_Validate_Cell, type = "prob")) |> 
  bind_cols(df_Validate_Cell |> select(class))

## performence --------

pred_Train_Cell |>                # training set predictions
  roc_auc(truth = class, .pred_PS)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         1.000

pred_Train_Cell |>                # training set predictions
  accuracy(truth = class, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.990

pred_Validate_Cell |>                   # test set predictions
  roc_auc(truth = class, .pred_PS)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.891

pred_Validate_Cell |>                   # test set predictions
  accuracy(truth = class, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.814

The nomal split-method show the stong differnec btewen tring dataset and the validate dataset. This is a nomal phonome by the data-foring model, it will “overfit” the trning data but not good perfomence in new dataset. this is aslo the resone, we ned some stragy to avoid the splingting method.

4.2 resampling

the cross-validation method will split trning datset into several resamplr group (fold), adn th subgroup will continue split into traning-part and testing-part (assessment).

set.seed(345)
folds_Cell <- vfold_cv(df_Train_Cell, v = 10)

wflow_CrossValidate <- 
  workflow() |>
  add_model(mdl_RandomForest) |>
  add_formula(class ~ .)

set.seed(456)
fit_CrossValidate <- 
  wflow_CrossValidate |> 
  fit_resamples(folds_Cell)
collect_metrics(fit_CrossValidate)

# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.833    10 0.00876 Preprocessor1_Model1
2 brier_class binary     0.120    10 0.00406 Preprocessor1_Model1
3 roc_auc     binary     0.905    10 0.00627 Preprocessor1_Model1

compare to the nomal firtng, there si a spcail fit_resamples() function for the resampling methos.