Introduction to glm.predict

After estimating a non linear model, it is often not straight forward to interpret the model output. An easy way of interpretation is to use predicted probabilities/values as well as discrete changes (the difference between two of the former). We usually want confidence intervals with those values to have an idea how exact they are and if the are significant or not. There are basically two approaches to estimate confidence intervals: simulation and bootstrap. Both approaches have in common that they draw 1000 (or more) coefficients (\(\beta\)s). Those can be used to calculate the predicted value (\(\hat{y}\)) given a specific set a x-values we are interested in. Depending on the model we need an inverse link function to get the resulting probability/value. With the resulting 1000 (or more) values we are then able to calculate not only the mean but also the confidence interval using quantiles.

Simulation and Bootstrap explained

Simulation

Monte Carlo simulation uses the property that the coefficients are asymptotically multivariate normal distributed. Asymptotically means that the number of cases should be high.

In R this is straight forward: The package MASS contains a function mvrnorm(). The parameter n is the number of draws (e.g. 1000), mu are the coefficients and Sigma the variance-covariance matrix.

A quick example:

We want to estimate the predicted probability to participate in the Swiss national election 2015 for a 30 year old woman. As younger women participate as often as young men or even more but older women participate less than older men, we include an interaction between age and gender in our logistic regression.

df_selects = selects2015
logit_model = glm(participation ~ age * gender, family = binomial, data = df_selects)
summary(logit_model)
#> 
#> Call:
#> glm(formula = participation ~ age * gender, family = binomial, 
#>     data = df_selects)
#> 
#> Coefficients:
#>                   Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -0.656152   0.126185  -5.200 1.99e-07 ***
#> age               0.036530   0.002728  13.390  < 2e-16 ***
#> genderfemale      0.265886   0.171467   1.551 0.120983    
#> age:genderfemale -0.012976   0.003600  -3.604 0.000313 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 6332.8  on 5197  degrees of freedom
#> Residual deviance: 6007.8  on 5194  degrees of freedom
#>   (139 Beobachtungen als fehlend gelöscht)
#> AIC: 6015.8
#> 
#> Number of Fisher Scoring iterations: 4

Next we simulate 1000 coefficients.

betas = coef(logit_model)
vcov = vcov(logit_model)
betas_simulated = MASS::mvrnorm(1000, betas, vcov)

With the simulated coefficients we can now estimate \(\hat{y}\). x we set to 1 (for the intercept), 30 (for age), 1 (for genderfemale) and 30 (\(1 (female) \cdot30(age)\) for age:genderfemale).

x = c(1, 30, 1, 30)
yhat = betas_simulated %*% x

As the \(\hat{y}\) is on the logit scale, we have to use the inverse link function for logit:

\[ \pi = \frac{e^{\hat{y}}}{1 + e^{\hat{y}}} \]

predProb = exp(yhat) / (1 + exp(yhat))

Now we can calculate the mean and the 95%-confidence interval.

mean(predProb)
#> [1] 0.5780147
quantile(predProb, probs = c(0.025, 0.975))
#>      2.5%     97.5% 
#> 0.5525980 0.6050431

We see that a 30-year old woman has a probability to participate in the national election in Switzerland of about 58% with a confidence interval from 55% to 61%.

Bootstrap

Bootstrap on the other hand estimates the model multiple times with a slightly different dataset each time. The data is sampled from the real dataset with the property that one case can be sampled multiple time. At the end there will be datasets were the an outlier is multiple times in the the data and another where it is not in the data at all. Doing this 1000 times (or more) gives 1000 (or more) coefficients. The benefit of this approach is that it also works with smaller datasets. The downside is that it takes longer as re-estimating the model 1000 times takes time.

We do the same example again but this time with bootstrap. We use again the same model. This time we draw the 1000 coefficients with estimating the model 1000 times.

boot = function(x, model){
  data = model.frame(model)
  data_sample = data[sample(seq_len(nrow(data)), replace = TRUE), ]
  coef(update(model, data = data_sample))
}
betas_boot = do.call(rbind, lapply(1:1000, boot, logit_model))

The rest is identical.

x = c(1, 30, 1, 30)
yhat = betas_boot %*% x
predProb = exp(yhat) / (1 + exp(yhat))
mean(predProb)
#> [1] 0.5782755
quantile(predProb, probs = c(0.025, 0.975))
#>      2.5%     97.5% 
#> 0.5496272 0.6060803

Discrete Changes

For discrete changes the idea is exactly the same. But here we calculate with the drawn coefficients predProb1 and predProb2 with each another x. The 1000 discrete changes are the difference between predProb1 and predProb2. With those we can again calculate mean and confidence intervals.

Using the package

After we covered the idea behind the two approaches, we see how the package helps us doing the job. The two S3-functions basepredict() and dc() do basically exactly the same as the examples above (model is the model, values the x, sim.count how many draws we want (default: 1000), conf.int the confidence interval (default: 0.95), simga allows us to add a corrected variance-covariance matrix, set.seed allows to set a seed and type gives the possibility to choose between simulation and bootstrap). Much more convenient however is the function predicts() which allows to calculate multiple values at once.

The function predicts() needs the model and the wanted values as a character. Other than with the base functions basepredict() and dc() we specify the values for each variable and not for each coefficient. What we can specify depends on the variable type.

Factors / Character

If you want all values of a factor/character you simply can write “all”. If you only want to have certain values, you can tell which one you want with the position of the levels. Lets suppose we have factor country with 26 countries, but we only want the 3rd, the 7th and the 21st country, then we would write “3,7,21”. We can also just use the mode value with “mode”. Another option is “median” where it assumes that the variable is ordinal scaled in the right order. Otherwise the result won’t have a useful meaning.

Numeric

If we have a numeric variable, we have a lot more possibilities. We can again use the mode by writing “mode”, but also the “median” or the “mean”, we can use “min” or “max”, if we want quartile we can write “Q4”, for quintiles “Q5” or in general “Q#” where # stand for any whole number (e.g. “Q2” would add the min, median and max-value). We can also write just any number like “-34.43”, but also two numbers separated with a minus-sign to get all number between the two number (e.g. “1-100” would give 1, 2, 3, …, 100). If you want a different step then 1 you can add the step after a comma (e.g. “-3.1-9.7,0.2” would give -3.1, -2.9, -2.7, …, 9.7). You can also just add multiple number separated by comma (e.g. “3,7.2,9.3”). The last four possibilities we can also surround by a “log()” to include the log of those numbers (e.g. “log(100-1000,100)”). “all” also works here and takes all unique values if the variable.

The parameter position is for discrete changes. If it is null the function return predicted probabilities/values. If we want discrete changes we have to tell for which variable (position). Lets suppose we want to have discrete changes for the 2nd variable, then we would write position=2. The other parameters of the functions are sim.count to change the number of draws, conf.int to change the confidence interval from a 95% to for example a 90% (you would write conf.int = 0.9), simga for a corrected variance-covariance matrix (for example when you correct it for heteroscedasticity), set.seed to set a seed for replication, doPar if you want to run it parallel (multinom() is always sequential) and type to choose between simulation and bootstrap. If not set the choice depends on the number of cases in the dataset with the cut point at 500.

Example

We estimate ordinal logistic regression with the opinion if Switzerland should be part of the European Union (opinion_eu_membership) as dependent variable. As independent variable we take the interaction between elected party (vote_choice) and left-right self position (lr_self) and as control variables age and gender.

df_selects = selects2015
library(MASS)
ologit_model = polr(opinion_eu_membership ~ vote_choice * lr_self + age + gender, 
                    data = df_selects, Hess = TRUE)
summary(ologit_model)
#> Call:
#> polr(formula = opinion_eu_membership ~ vote_choice * lr_self + 
#>     age + gender, data = df_selects, Hess = TRUE)
#> 
#> Coefficients:
#>                             Value Std. Error t value
#> vote_choiceFDP           -0.89529   0.460035 -1.9461
#> vote_choiceBDP           -2.01165   0.685942 -2.9327
#> vote_choiceCVP           -2.00358   0.480561 -4.1693
#> vote_choiceGLP           -1.35016   0.530441 -2.5454
#> vote_choiceSP            -2.02427   0.354073 -5.7171
#> vote_choiceGPS           -2.36429   0.393877 -6.0026
#> vote_choiceother         -1.97721   0.402018 -4.9182
#> lr_self                   0.19294   0.043391  4.4466
#> age                      -0.01926   0.001933 -9.9644
#> genderfemale              0.09259   0.067575  1.3703
#> vote_choiceFDP:lr_self   -0.04832   0.063849 -0.7567
#> vote_choiceBDP:lr_self    0.15831   0.109466  1.4462
#> vote_choiceCVP:lr_self    0.14511   0.071865  2.0192
#> vote_choiceGLP:lr_self   -0.03126   0.100826 -0.3100
#> vote_choiceSP:lr_self     0.10193   0.057970  1.7583
#> vote_choiceGPS:lr_self    0.23998   0.078816  3.0448
#> vote_choiceother:lr_self  0.28552   0.059949  4.7627
#> 
#> Intercepts:
#>                                                             Value    Std. Error
#> Strongly for EU entry|Rather for EU entry                    -4.8574   0.3665  
#> Rather for EU entry|Neither nor                              -2.8368   0.3535  
#> Neither nor|Rather in favour to stay out                     -1.7165   0.3505  
#> Rather in favour to stay out|Strongly in favour to stay out   0.0902   0.3485  
#>                                                             t value 
#> Strongly for EU entry|Rather for EU entry                   -13.2527
#> Rather for EU entry|Neither nor                              -8.0240
#> Neither nor|Rather in favour to stay out                     -4.8978
#> Rather in favour to stay out|Strongly in favour to stay out   0.2589
#> 
#> Residual Deviance: 8018.673 
#> AIC: 8060.673 
#> (2151 Beobachtungen als fehlend gelöscht)

Next we estimate predicted probabilities for the left right positions 0, 5 and 10 for all parties. For age we take the mean, for gender the mode.

df_pred = predicts(ologit_model, "all;0,5,10;mean;mode", set.seed = 1848)
#> Type not specified: Using simulation as n >= 500
head(df_pred)
#>         mean      lower      upper vote_choice lr_self     age gender
#> 1 0.02152058 0.01034651 0.04079810         SVP       0 51.2103   male
#> 2 0.11923718 0.06332958 0.19679025         SVP       0 51.2103   male
#> 3 0.18803061 0.12269316 0.24953023         SVP       0 51.2103   male
#> 4 0.41159766 0.36275411 0.43740129         SVP       0 51.2103   male
#> 5 0.25961397 0.15221299 0.39787545         SVP       0 51.2103   male
#> 6 0.04996451 0.02552328 0.08856728         FDP       0 51.2103   male
#>                            level
#> 1          Strongly for EU entry
#> 2            Rather for EU entry
#> 3                    Neither nor
#> 4   Rather in favour to stay out
#> 5 Strongly in favour to stay out
#> 6          Strongly for EU entry

We get back a data.frame. Next we estimate the discrete change between a SVP and a SP voter for all values of left-right.

df_dc = predicts(ologit_model, "1,6;0-10;mean;mode", position = 1, set.seed = 1848)
#> Type not specified: Using simulation as n >= 500
head(df_dc)
#>    val1_mean val1_lower val1_upper  val2_mean val2_lower val2_upper     dc_mean
#> 1 0.02152058 0.01034651 0.04079810 0.13595763 0.10558222 0.17314105 -0.11443705
#> 2 0.11923718 0.06332958 0.19679025 0.40573368 0.36262648 0.44548660 -0.28649650
#> 3 0.18803061 0.12269316 0.24953023 0.24079835 0.21673367 0.26623913 -0.05276773
#> 4 0.41159766 0.36275411 0.43740129 0.17375975 0.14209936 0.20777842  0.23783791
#> 5 0.25961397 0.15221299 0.39787545 0.04375059 0.03329462 0.05591078  0.21586338
#> 6 0.01760145 0.00919952 0.03127154 0.10492054 0.08351340 0.13076670 -0.08731909
#>     dc_lower    dc_upper vote_choice_val1 vote_choice_val2 lr_self     age
#> 1 -0.1495893 -0.08267517              SVP               SP       0 51.2103
#> 2 -0.3547376 -0.20440982              SVP               SP       0 51.2103
#> 3 -0.1231427  0.01314485              SVP               SP       0 51.2103
#> 4  0.1812114  0.27738706              SVP               SP       0 51.2103
#> 5  0.1051668  0.35372950              SVP               SP       0 51.2103
#> 6 -0.1114845 -0.06549807              SVP               SP       1 51.2103
#>   gender                          level
#> 1   male          Strongly for EU entry
#> 2   male            Rather for EU entry
#> 3   male                    Neither nor
#> 4   male   Rather in favour to stay out
#> 5   male Strongly in favour to stay out
#> 6   male          Strongly for EU entry

We can now plot the discrete change using ggplot2.

library(ggplot2)
# put the levels in the right order
df_dc$level = factor(df_dc$level, levels = levels(df_selects$opinion_eu_membership))
ggplot(df_dc, aes(x = lr_self, y = dc_mean, ymin = dc_lower, ymax = dc_upper)) +
  geom_ribbon(fill=NA, color = "black", linetype = "dashed") +
  geom_line() + facet_wrap(~level) + theme_minimal() + 
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  ylab("discrete change between SP and SVP") + xlab("left-right position") +
  ggtitle("Opinion if Switzerland should join the EU")