1 Introduction

Model selection is the process of choosing the most relevant features from a set of candidate variables. This procedure is crucial because it ensures that the final model is both accurate and interpretable while being computationally efficient and avoiding overfitting. Stepwise regression algorithms iteratively add or remove features from the model based on certain criteria (e.g., significance level or P-value, information criteria like AIC or BIC, etc.). The process continues until no further improvements can be made according to the chosen criterion. At the end of the stepwise procedure, you’ll have a final model that includes the selected features and their coefficients.

StepReg simplifies model selection tasks by providing a unified programming interface. It currently supports model buildings for five distinct response variable types (section 3.1), four model selection strategies (section 3.2) including the best subsets algorithm, and a variety of selection metrics (section 3.3). Moreover, StepReg detects and addresses the multicollinearity issues if they exist (section 3.4). The output of StepReg includes multiple tables summarizing the final model and the variable selection procedures. Additionally, StepReg offers a plot function to visualize the selection steps (section 4). For demonstration, the vignettes include four use cases covering distinct regression scenarios (section 5). Non-programmers can access the tool through the iterative Shiny app detailed in section 6.

2 Quick demo

The following example selects an optimal linear regression model with the mtcars dataset.


formula <- mpg ~ .
res <- stepwise(formula = formula,
                data = mtcars,
                type = "linear",
                include = c("qsec"),
                strategy = "bidirection",
                metric = c("AIC"))

Breakdown of the parameters:

  • formula: specifies the dependent and independent variables
  • type: specifies the regression category, depending on your data, choose from “linear”, “logit”, “cox”, etc.
  • include: specifies the variables that must be in the final model
  • strategy: specifies the model selection strategy, choose from “forward”, “backward”, “bidirection”, “subset”
  • metric: specifies the model fit evaluation metric, choose one or more from “AIC”, “AICc”, “BIC”, “SL”, etc.

The output consists of multiple tables, which can be viewed with:

Table 1. Summary of arguments for model selection
                       Parameter        Value
               included variable         qsec 
                        strategy  bidirection 
                          metric          AIC 
  tolerance of multicollinearity        1e-07 
      multicollinearity variable         NULL 
                       intercept            1 

Table 2. Summary of variables in dataset      
  Variable_type  Variable_name  Variable_class
      Dependent            mpg         numeric 
    Independent            cyl         numeric 
    Independent           disp         numeric 
    Independent             hp         numeric 
    Independent           drat         numeric 
    Independent             wt         numeric 
    Independent           qsec         numeric 
    Independent             vs         numeric 
    Independent             am         numeric 
    Independent           gear         numeric 
    Independent           carb         numeric 

Table 3. Summary of selection process under bidirection with AIC
  Step  EffectEntered  EffectRemoved  NumberParams         AIC
     1              1                            1   149.94345 
     2           qsec                            2  145.776054 
     3             wt                            3    97.90843 
     4             am                            4   95.307305 

Table 4. Summary of coefficients for the selected model with mpg under bidirection and AIC 
     Variable   Estimate  Std. Error    t value  Pr(>|t|)
  (Intercept)   9.617781    6.959593   1.381946  0.177915 
         qsec   1.225886     0.28867   4.246676  0.000216 
           wt  -3.916504    0.711202  -5.506882     7e-06 
           am   2.935837    1.410905   2.080819  0.046716 

You can also visualize the variable selection procedures with: