SPCR regresses a dependent variable onto a few supervised principal
components computed from a large set of predictors. The
**steps** followed by SPCR are the following:

- Regress the dependent variable onto each column of a data set of
*p*possible predictors via*p*simple linear regressions. This results in*p*bivariate association measures. - Define a subset of the original
*p*variables by discarding all variables whose bivariate association measures less than a chosen threshold. - Use the subset of original data to estimate
*q*PCs. - Regress the dependent variable onto the
*q*PCs.

A key aspect of the method is that both the number of PCs and the
threshold value used in step 2 can be determined by cross-validation.
GSPCR **extends** SPCR by allowing the dependent variable
to be of any measurement level (i.e., ratio, interval, ordinal, nominal)
by introducing likelihood-based association measures (or threshold
types) in step 1. Furthermore, GSPCR allows the predictors to be of any
type by combining the PCAmix framework (Kiers, 1991; Chavent et al.,
2017) with SPCR in step 3.

The `gspcr`

R package allows to:

**tune**the of the threshold values and number of PCs in a GSPCR model;**plot**the cross-validation trends used to tune the threshold value and the number of PCs to compute;**estimate**the GSPCR model on a data set;**predict**observations on both the training data and new, previously unseen, data

Before we do anything else, let us **load the packages**
we will need for this vignette. If you don’t have these packages, please
install them using `install.packages()`

.

We start this vignette by estimating `gspcr`

in a very
simple scenario with a continuous dependent variable and a set of
continuous predictors. First, we store the **example
dataset** `GSPCRexdata`

(see the helpfile for details
`?GSPCRexdata`

) in two separate objects:

Then, we randomly select a **subset of the data** to use
as a training set. We use 90% of the data as training data.

```
# Set a seed
set.seed(20230415)
# Sample a subset of the data
train <- sample(x = 1:nrow(X), size = nrow(X) * .9)
```

Now we are ready to **use the** `cv_gscpr()`

**function** to cross-validate the threshold value and the
number of pcs to be used.

We can then **extract** the cross-validated
**solutions** from the resulting object.

```
thr_value thr_number Q
standard -1268.236 2 1
oneSE -1268.236 2 1
```

We can visually **examine the solution paths** produced
by the cross-validation procedure by using the `plot()`

functions.

In this figure, the out-of-sample fit measure obtained with a given threshold value (X-axis) and a given number of principal components is reported on the Y-axis. The values of the threshold considered are reported on the X-axis, and the X-axis title reports the type of threshold used, in this case, the simple regression model likelihoods. For a different number of components considered, a different line is reported. The number of PCs is reported on the line.

Because the fit measure used by default for a continuous dependent variable is the F-statistic, we should look for the highest point on the Y-axis of this plot. This point represents the best K-fold cross-validation fit. As you see, the standard solution reported above matches the one presented in this plot.

Once the cross-validation procedure has identified the values of the
threshold and the number of PCs that should be used, we can
**estimate the GASPCR model** on the whole training data
with the function `est_gspcr()`

.

We can now **obtain predictions** for new unseen data
using the `predict()`

function

```
# Predict new data
y_hat <- predict(
object = gspcr_est,
newdata = X[-train, ]
)
# Look at the first six predictions
head(y_hat)
```

```
8 23 26 64 70 75
0.16692084 -1.10013695 1.11181527 0.73243982 0.04385471 0.73221417
```

Bair E, Hastie T, Paul D, Tibshirani R (2006). “Prediction by supervised principal components.” J. Am. Stat. Assoc., 101(473), 119-137.

Chavent, M., Kuentz-Simonet, V., Labenne, A., & Saracco, J. (2014). Multivariate analysis of mixed data: The R package PCAmixdata. arXiv preprint arXiv:1411.4911.

Kiers, H. A. (1991). Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika, 56(2), 197-212.