Risk scores are sparse linear models that map an integer linear combination of covariates to the probability of an outcome occurring. Unlike regression models, risk score models consist of integer coefficients for often dichotomous variables. This allows risk score predictions to be easily computed by adding or subtracting a few small numbers.

Risk scores developed heuristically by altering logistic regression
models have decreased performance, as there is a fundamental trade-off
between the model’s simplicity and its predictive accuracy. In contrast,
this package presents an optimization approach to learning risk scores,
where the constraints unique to risk score models are integrated into
the model-fitting process, rather than implemented afterward. This
vignette demonstrates how to use the `riskscores`

package to
build a risk score model to predict breast cancer diagnosis.

The `riskscores`

package uses a cyclical coordinate
descent algorithm to solve the following optimization problem.

\[\begin{equation} \begin{aligned} \min_{\alpha,\beta} \quad & \frac{1}{n} \sum_{i=1}^{n} (\gamma y_i x_i^T \beta - log(1 + exp(\gamma x_i^T \beta))) + \lambda_0 \sum_{j=1}^{p} 1(\beta_{j} \neq 0)\\ \textrm{s.t.} \quad & l \le \beta_j \le u \; \; \; \forall j = 1,2,...,p\\ &\beta_j \in \mathbb{Z} \; \; \; \forall j = 1,2,...,p \\ &\beta_0, \gamma \in \mathbb{R} \\ \end{aligned} \end{equation}\]

These constraints ensure that the model will be sparse and include only integer coefficients.

First we’ll load in an example dataset. In this example, we want to
develop a risk score model that predicts whether a breast tissue sample
is benign using features recorded during a biopsy. The
`breastcancer`

dataset was originally accessed from the UCI
Repository and can be loaded into your environment from the
`riskscores`

package as so:

This dataset contains 683 observations and 9 features. Our goal is to
develop a risk score model that predicts whether a breast tissue sample
is benign using 9 (or fewer) features recorded during a biopsy:

- Clump thickness

- Uniformity of cell size

- Uniformity of cell shape

- Marginal adhesion

- Single epithelial cell size

- Bare nuclei

- Bland chromatin

- Normal nucleoli

- Mitoses

Before building a risk score model, data often need to be preprocessed. Specifically, the dataset needs to have a binary outcome with all other variables containing either binary or integer values.

The `breastcancer`

dataset is mostly ready to go. We’ll
still need to split out our data into a matrix with all covariates
(`X`

) and a vector with the outcome data (`y`

). In
this case, the first column in our dataset contains the outcome
variable.

The penalty coefficient \(\lambda_0\) controls the sparsity of the model – a larger value of \(\lambda_0\) will result in fewer non-zero coefficients. We can use cross validation to find the optimal \(\lambda_0\) value that creates a sufficiently sparse model without sacrificing performance.

Ideally, each cross-validation fold should contain an approximately
equal proportion of cases. The `riskscores`

package contains
the function `stratify_folds()`

that creates fold IDs with an
equal proportion of cases in each fold. These fold IDs can be entered
into the `cv_risk_mod()`

function under the
`foldids`

parameter. Otherwise, `cv_risk_mod()`

will set random fold IDs.

The `cv_risk_mod()`

function runs cross validation for a
grid of possible \(\lambda_0\) values.
If the user does not specify the vector of \(\lambda_0\) values to test, the program
constructs this \(\lambda_0\) sequence.
The maximum \(\lambda_0\) in this
sequence is the smallest value such that all coefficients in the
logistic regression model are zero. The minimum \(\lambda_0\) in the sequence is calculated
using the user-defined `lambda_ratio`

argument. The \(\lambda_0\) grid is created by generating
`nlambda`

values linear on the log scale from the minimum
\(\lambda_0\) to the maximum \(\lambda_0\). We’ve set `nlambda`

to 25, so the program will construct an appropriate sequence of 25 \(\lambda_0\) values to test using cross
validation.

Running `plot()`

on a `cv_risk_mod`

object
creates a plot of mean deviance for each \(\lambda_0\) value in the grid. The number
of nonzero coefficients that are produced by each \(\lambda_0\) value when fit on the full data
are listed at the top of the plot. The \(\lambda_0\) value with the lowest mean
deviance (“lambda_min”) is indicated in red, and its standard deviation
is marked with a red dashed line. Its precise value can be accessed by
calling `cv_results$lambda_min`

. If we want a sparser model,
we could increase \(\lambda_0\) to
“lambda_1se”, the largest value whose mean deviance is within one
standard error of “lambda_min”. This value can be accessed by calling
`cv_results$lambda_1se`

. In our example, “lambda_min” creates
a model with 8 non-zero coefficients and “lambda_1se” creates a model
with 3 non-zero coefficients.

To view a dataframe with the full cross-validation results (including
both deviance and accuracy metrics), run
`cv_results$results`

.

```
tail(cv_results$results)
#> # A tibble: 6 × 6
#> lambda0 mean_dev sd_dev mean_acc sd_acc nonzero
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0.0576 33.1 10.8 0.956 0.0103 3
#> 2 0.0845 50.2 20.2 0.934 0.0313 3
#> 3 0.124 47.8 18.4 0.930 0.0239 2
#> 4 0.182 53.7 19.7 0.930 0.0285 2
#> 5 0.267 49.2 23.1 0.931 0.0210 1
#> 6 0.392 81.4 37.9 0.884 0.0650 1
```

We’ll fit a model on the full data using the function
`risk_mod()`

. We’ll use the “lambda_1se” value determined by
cross-validation as our \(\lambda_0\)
parameter.

The integer risk score model can be viewed by calling
`mod$model_card`

. An individual’s risk score can be
calculated by multiplying each covariate response by its respective
number of points and then adding all points together. In our example
below, a patient with a ClumpThickness value of 1, a
UniformityOfCellShape value of 10, and a BareNuclei value of 5 would
receive a score of \(10(1) + 8(10) + 7(5) =
125\).

Points | |
---|---|

ClumpThickness | 10 |

UniformityOfCellShape | 8 |

BareNuclei | 7 |

Each score can then be mapped to a risk probability. The
`mod$score_map`

dataframe maps an integer range of scores to
their associated risk. For this example dataset,
`mod$score_map`

includes a range of integer scores from 25 to
200, which are the minimum and maximum scores predicted from the
training data. The table below shows a sample of these scores mapped to
their associated risk. We can see that a patient who received a score of
125 would have a 77.65% risk of their tissue sample being malignant.

Score | Risk |
---|---|

25 | 0.0014 |

50 | 0.0096 |

75 | 0.0644 |

100 | 0.3284 |

125 | 0.7765 |

150 | 0.9611 |

175 | 0.9943 |

200 | 0.9992 |

The function `get_risk()`

can be used to calculate the
risk from a given score (or a vector of scores). Likewise, the function
`get_score()`

calculates the score associated with a given
risk (or vector of risk probabilities).

We can evaluate the model’s performance under different
classification thresholds using the `get_metrics()`

function.

```
get_metrics(mod, threshold = seq(0.1, 0.9, 0.1))
#> threshold_risk threshold_score accuracy sensitivity specificity
#> 1 0.1 81.1 0.9546120 0.9958159 0.9324324
#> 2 0.2 91.4 0.9648609 0.9748954 0.9594595
#> 3 0.3 98.3 0.9677892 0.9581590 0.9729730
#> 4 0.4 103.9 0.9677892 0.9539749 0.9752252
#> 5 0.5 109.1 0.9648609 0.9372385 0.9797297
#> 6 0.6 114.3 0.9619327 0.9288703 0.9797297
#> 7 0.7 119.9 0.9590044 0.9121339 0.9842342
#> 8 0.8 126.8 0.9458272 0.8744770 0.9842342
#> 9 0.9 137.1 0.9326501 0.8326360 0.9864865
```

Running `summary()`

on our model will return the
intercept, the scores of each nonzero coefficient, the \(\gamma\) multiplier value, the \(\lambda_0\) regularizer value, the
deviance, and the AIC.

A vector containing the risk score model intercept and integer
coefficients can be accessed by calling `coef()`

on the
`risk_mod`

object. This vector is also saved as
`$beta`

within the `risk_mod`

object.

```
coef(mod) # equivalently: mod$beta
#> Intercept ClumpThickness UniformityOfCellSize
#> -109.1192 10.0000 0.0000
#> UniformityOfCellShape MarginalAdhesion SingleEpithelialCellSize
#> 8.0000 0.0000 0.0000
#> BareNuclei BlandChromatin NormalNucleoli
#> 7.0000 0.0000 0.0000
#> Mitoses
#> 0.0000
```

We can map our integer score model to an equivalent logistic
regression model by multiplying the integer and coefficients by \(\gamma\) (saved as `$gamma`

in
the `risk_mod`

object).

```
coef(mod) * mod$gamma
#> Intercept ClumpThickness UniformityOfCellSize
#> -8.5584971 0.7843255 0.0000000
#> UniformityOfCellShape MarginalAdhesion SingleEpithelialCellSize
#> 0.6274604 0.0000000 0.0000000
#> BareNuclei BlandChromatin NormalNucleoli
#> 0.5490279 0.0000000 0.0000000
#> Mitoses
#> 0.0000000
```

The `risk_mod`

object stores a `glm`

object of
this non-integer logistic regression model as `$glm_mod`

.

```
coef(mod$glm_mod)
#> Intercept ClumpThickness UniformityOfCellSize
#> -8.5584971 0.7843255 0.0000000
#> UniformityOfCellShape MarginalAdhesion SingleEpithelialCellSize
#> 0.6274604 0.0000000 0.0000000
#> BareNuclei BlandChromatin NormalNucleoli
#> 0.5490279 0.0000000 0.0000000
#> Mitoses
#> 0.0000000
```

Running `predict()`

on a `risk_mod`

object
allows for three types of prediction, as the `type`

parameter
can be set to either `'link'`

, `'response'`

, or
`'score'`

. These first two options are the same as when
`predict()`

is run on a logistic `glm`

object. The
added `'score'`

option returns each subject’s score, as
calculated from the integer coefficients in the risk score model.

The table below compares the three possible prediction types for five example subjects. The first three columns contain data for clump thickness, uniformity of cell shape, and bare nuclei, respectively.

CT | UCS | BN | ‘score’ | ‘link’ | ‘response’ |
---|---|---|---|---|---|

5 | 1 | 1 | 65 | -3.46 | 0.030 |

5 | 4 | 10 | 152 | 3.36 | 0.967 |

3 | 1 | 2 | 52 | -4.48 | 0.011 |

6 | 8 | 4 | 152 | 3.36 | 0.967 |

4 | 1 | 1 | 55 | -4.24 | 0.014 |

The ‘score’ is a linear combination of the covariates and their integer coefficients:

- \(\text{score} = 10(\text{CT}) + 8(\text{UCS}) + 7(\text{BN})\)

The ‘link’ is a linear combination of the covariates using the full logistic regression equation:

- \(\text{link} = -8.56 + 0.78(\text{CT}) + 0.63(\text{UCS}) + 0.55(\text{BN})\)

The ‘response’ converts these link values to probabilities:

- \(\text{response} = e^{\text{link}}/(1+e^{\text{link}})\)