To “train a model” involves three components:
lm()
and glm()
.lm()
and glm()
. In Lessons in Statistical
Thinking and the corresponding {LSTbook}
package, we
almost always use model_train()
Once the model object has been constructed, you can plot the model, create summaries such as regression reports or ANOVA reports, and evaluate the model for new inputs, etc.
model_train()
model_train()
is a wrapper around some commonly used
model-fitting functions from the {stats}
package,
particularly lm()
and glm()
. It’s worth
explaining motivation for introducing a new model-fitting function.
model_train()
is pipeline ready. Example:
Galton |> model_train(height ~ mother)
model_train()
has internal logic to figure out
automatically which type of model (e.g. linear, binomial, poisson) to
fit. (You can also specify this with the family=
argument.)
The automatic nature of model_train()
means, e.g., you can
use it with neophyte students for logistic regression without having to
introduce a new function.model_train()
saves a copy of the training data as an
attribute of the model object being produced. This is helpful in
plotting the model, cross-validation, etc., particularly when the model
specification involves nonlinear explanatory terms (e.g.,
splines::ns(mother, 3)
)As examples, consider these two models:
height
of a (fully grown) child with the
sex
of the child, and the mother
’s and
father
’s height. Linear regression is an appropriate
technique here.primary2006
) given the household size
(hhsize
), yearofbirth
and whether the voter
voted in a previous primary election (primary2004
). Since
having voted is a yes or no proposition, logistic regression is
an appropriate technique.vote_model <-
Go_vote |>
model_train(zero_one(primary2006, one = "voted") ~ yearofbirth * primary2004 * hhsize * yearofbirth )
Note that the zero_one()
marks the response variable as
a candidate for logistic regression.
The output of model_train()
is in the format of
whichever {stats}
package function has been used,
e.g. lm()
or glm()
. (The training data is
stored as an “attribute,” meaning that it is invisible.) Consequently,
you can use the model object as an input to whatever model-plotting or
summarizing function you like.
In Lessons in Statistical Thinking we use
{LSTbook}
functions for plotting and summarizing:
model_plot()
R2()
conf_interval()
regression_summary()
and
anova_summary()
Let’s apply some of these to the modeling examples introduced above.
height_model |> conf_interval()
#> # A tibble: 4 × 4
#> term .lwr .coef .upr
#> <chr> <dbl> <dbl> <dbl>
#> 1 (Intercept) 9.95 15.3 20.7
#> 2 sexM 4.94 5.23 5.51
#> 3 mother 0.260 0.321 0.383
#> 4 father 0.349 0.406 0.463
vote_model |> model_plot()
vote_model |> R2()
#> n k Rsquared F adjR2 p df.num df.denom
#> 1 305866 7 0.03799898 1725.91 0.03797696 0 7 305858
The model_eval()
function from this package allows you
to provide inputs and receive the model output, with a prediction
interval by default. (For logistic regression, only a confidence
interval is available.)
vote_model |> model_eval(yearofbirth=c(1960, 1980), primary2004="voted", hhsize=4)
#> Warning in model_eval(vote_model, yearofbirth = c(1960, 1980), primary2004 =
#> "voted", : No prediction interval available, since the response variable is
#> effectively categorical, not quantitative.
#> yearofbirth primary2004 hhsize .output
#> 1 1960 voted 4 0.4418285
#> 2 1980 voted 4 0.3128150