refundBayes::sofr_bayes2026-03-03
This vignette provides a detailed guide to the
sofr_bayes() function in the refundBayes
package, which fits Bayesian Scalar-on-Function Regression (SoFR) models
using Stan. The function is designed with a syntax similar to
mgcv::gam, making it accessible to users familiar with the
frequentist approach while providing full Bayesian posterior
inference.
The methodology follows the tutorial by Jiang, Crainiceanu, and Cui (2025), Tutorial on Bayesian Functional Regression Using Stan, published in Statistics in Medicine.
refundBayes PackageThe refundBayes package can be installed from
GitHub:
Scalar-on-Function Regression (SoFR) models the relationship between a scalar outcome and one or more functional predictors (curves or trajectories observed over a continuum), along with optional scalar covariates.
For subject \(i = 1, \ldots, n\), let \(Y_i\) be the scalar outcome, \(\mathbf{Z}_i\) be a \(p \times 1\) vector of scalar predictors, and \(\{W_i(t_m), t_m \in \mathcal{T}\}\) for \(m = 1, \ldots, M\) be a functional predictor observed at \(M\) time points over a domain \(\mathcal{T}\). The SoFR model assumes that the distribution of \(Y_i\) belongs to an exponential family with mean \(\mu_i\), and the linear predictor \(\eta_i = g(\mu_i)\) has the following structure:
\[\eta_i = \eta_0 + \int_{\mathcal{T}} W_i(t)\beta(t)\,dt + \mathbf{Z}_i^t \boldsymbol{\gamma}\]
where:
The integral \(\int_{\mathcal{T}} W_i(t)\beta(t)\,dt\) is approximated using a Riemann sum over the observed time points.
The functional coefficient \(\beta(t)\) is represented nonparametrically using a set of \(K\) pre-specified spline basis functions \(\psi_1(t), \ldots, \psi_K(t)\):
\[\beta(t) = \sum_{k=1}^K b_k \psi_k(t)\]
With this expansion, the linear predictor becomes:
\[\eta_i = \eta_0 + \mathbf{X}_i^t \mathbf{b} + \mathbf{Z}_i^t \boldsymbol{\gamma}\]
where \(\mathbf{X}_i = (X_{i1}, \ldots, X_{iK})^t\) is a \(K \times 1\) vector with entries:
\[X_{ik} = \sum_{m=1}^M L_m W_i(t_m) \psi_k(t_m)\]
Here \(L_m = t_{m+1} - t_m\) are the Riemann sum integration weights and \(t_m\) are the time points at which the functional predictor is observed.
tmat,
lmat, and wmatThe Riemann sum approximation \(\sum_{m=1}^M L_m W_i(t_m)\psi_k(t_m)\) to
the integral \(\int_{\mathcal{T}}
W_i(t)\psi_k(t)\,dt\) is constructed directly from three
user-supplied matrices in the data argument:
tmat (an \(n \times M\) matrix): contains the time
points \(t_m\) at which the functional
predictor is observed. The \((i,
m)\)-th entry equals \(t_m\).
The range of values in tmat determines the domain of
integration \(\mathcal{T}\). For
example, if the functional predictor is observed at minutes \(1, 2, \ldots, 1440\) within a day, then
tmat has entries ranging from \(1\) to \(1440\) and \(\mathcal{T} = [1, 1440]\). There is no
requirement that the domain be rescaled to \([0, 1]\).
lmat (an \(n \times M\) matrix): contains the
integration weights \(L_m = t_{m+1} -
t_m\) for the Riemann sum approximation. The \((i, m)\)-th entry equals \(L_m\). For equally spaced time points with
spacing \(\Delta t\), every entry of
lmat equals \(\Delta t\).
For unevenly spaced time points, lmat reflects the varying
widths of the integration intervals. These weights, together with
tmat, fully specify how the numerical integration is
performed and over what domain.
wmat (an \(n \times M\) matrix): contains the
functional predictor values. The \(i\)-th row contains the \(M\) observed values \(W_i(t_1), \ldots, W_i(t_M)\) for subject
\(i\).
In the formula
s(tmat, by = lmat * wmat, bs = "cc", k = 10), the
mgcv infrastructure uses tmat to construct the
spline basis \(\psi_k(t_m)\) at the
observed time points, and the by = lmat * wmat argument
provides the element-wise product \(L_m \cdot
W_i(t_m)\) that enters the Riemann sum. This means the basis
functions are evaluated on the scale of tmat, and the
integration weights in lmat ensure that the discrete sum
correctly approximates the integral over the actual domain \(\mathcal{T}\) — regardless of whether it is
\([0, 1]\), \([1, 1440]\), or any other interval.
Note that for all subjects, the time points are assumed to be
identical so that \(t_{im} = t_m\) for
all \(i = 1, \ldots, n\). Thus every
row of tmat is the same, and every row of lmat
is the same. The matrices are replicated across rows to match the
mgcv syntax, which expects all terms in the formula to have
the same dimensions.
To induce smoothness on \(\beta(t)\), a quadratic penalty on the spline coefficients is applied. The penalty is based on the integrated squared second derivative of \(\beta(t)\):
\[\int \{\beta''(t)\}^2\,dt = \mathbf{b}^t \mathbf{S} \mathbf{b}\]
where \(\mathbf{S} = \int \boldsymbol{\psi}''(t)\{\boldsymbol{\psi}''(t)\}^t\,dt\) is the penalty matrix. In the Bayesian framework, this penalty is equivalent to placing a multivariate normal prior on the spline coefficients:
\[p(\mathbf{b}) \propto \exp\left(-\frac{\mathbf{b}^t \mathbf{S} \mathbf{b}}{\sigma_b^2}\right)\]
where \(\sigma_b^2\) is the smoothing parameter that controls the smoothness of \(\beta(t)\), and is estimated from the data.
The complete Bayesian SoFR model is:
\[\begin{cases} \mathbf{Y} \sim \text{Exponential Family}(\boldsymbol{\eta}, a) \\ \boldsymbol{\eta} = \eta_0 \mathbf{J}_n + \tilde{\mathbf{X}}_r^t \tilde{\mathbf{b}}_r + \tilde{\mathbf{X}}_f^t \tilde{\mathbf{b}}_f + \mathbf{Z}^t \boldsymbol{\gamma} \\ \tilde{\mathbf{b}}_r \sim N(\mathbf{0}, \sigma_b^2 \mathbf{I}) \\ \eta_0 \sim p(\eta_0);\; \tilde{\mathbf{b}}_f \sim p(\tilde{\mathbf{b}}_f);\; \boldsymbol{\gamma} \sim p(\boldsymbol{\gamma}) \\ \sigma_b^2 \sim p(\sigma_b^2);\; a \sim p(a) \end{cases}\]
where \(\tilde{\mathbf{X}}_r\) and \(\tilde{\mathbf{X}}_f\) are the correspondingly transformed design matrices, and \(p(\cdot)\) denotes non-informative priors (uniform or weakly informative). The smoothing variance \(\sigma_b^2\) is assigned an inverse-Gamma prior \(IG(0.001, 0.001)\).
sofr_bayes()
Functionsofr_bayes(
formula,
data,
family = gaussian(),
joint_FPCA = NULL,
intercept = TRUE,
runStan = TRUE,
niter = 3000,
nwarmup = 1000,
nchain = 3,
ncores = 1
)| Argument | Description |
|---|---|
formula |
Functional regression formula, using the same syntax as
mgcv::gam. Functional predictors are specified using the
s() term with by = lmat * wmat to encode the
Riemann sum integration (see Example below). |
data |
A data frame containing all scalar and functional variables used in the model. |
family |
Distribution of the outcome variable. Currently
supports gaussian() and binomial(). Default is
gaussian(). |
joint_FPCA |
A logical (TRUE/FALSE) vector
of the same length as the number of functional predictors, indicating
whether to jointly model FPCA for each functional predictor. Default is
NULL, which sets all entries to FALSE (no
joint FPCA). |
intercept |
Logical. Whether to include an intercept term in the
linear predictor. Default is TRUE. |
runStan |
Logical. Whether to run the Stan program. If
FALSE, the function only generates the Stan code and data
without sampling. This is useful for inspecting or modifying the
generated Stan code. Default is TRUE. |
niter |
Total number of Bayesian posterior sampling iterations
(including warmup). Default is 3000. |
nwarmup |
Number of warmup (burn-in) iterations. These samples
are discarded and not used for inference. Default is
1000. |
nchain |
Number of Markov chains for posterior sampling.
Multiple chains help assess convergence. Default is 3. |
ncores |
Number of CPU cores to use when executing the chains in
parallel. Default is 1. |
The function returns a list of class "refundBayes"
containing the following elements:
| Element | Description |
|---|---|
stanfit |
The Stan fit object (class stanfit). Can
be used for convergence diagnostics, traceplots, and additional
summaries via the rstan package. |
spline_basis |
Basis functions used to reconstruct the functional coefficients from the posterior samples. |
stancode |
A character string containing the generated Stan model code. |
standata |
A list containing the data passed to the Stan model. |
int |
A vector of posterior samples for the intercept term
\(\eta_0\). NULL if
intercept = FALSE. |
scalar_coef |
A matrix of posterior samples for scalar coefficients
\(\boldsymbol{\gamma}\), where each row
is one posterior sample and each column corresponds to one scalar
predictor. NULL if no scalar predictors are included. |
func_coef |
A list of posterior samples for functional coefficients. Each element is a matrix where each row is one posterior sample and each column corresponds to one location on the functional domain. |
family |
The distribution family used for the outcome. |
The formula follows the mgcv::gam syntax. The key
component for specifying functional predictors is:
where:
tmat: an \(n \times
M\) matrix of time points. Each row contains the same \(M\) observation times (replicated across
subjects). The values in tmat define the domain \(\mathcal{T}\) over which the functional
predictor is observed and the integral is computed. For example,
tmat may contain values from \(1\) to \(100\), from \(0\) to \(1\), or from \(1\) to \(1440\) — the integration domain adapts
accordingly.lmat: an \(n \times
M\) matrix of Riemann sum weights. The \((i,m)\)-th entry equals \(L_m = t_{m+1} - t_m\). These weights
control the numerical integration and should be consistent with the
spacing of the time points in tmat. For equally spaced time
points, lmat is a constant matrix; for irregular spacing,
it reflects the actual gaps between consecutive time points.wmat: an \(n \times
M\) matrix of functional predictor values. The \(i\)-th row contains the \(M\) observed values for subject \(i\).bs: the type of spline basis. Common choices include
"cr" (cubic regression splines) and "cc"
(cyclic cubic regression splines, suitable for periodic data).k: the number of basis functions (maximum degrees of
freedom for the spline).Scalar predictors are included as standard formula terms (e.g.,
y ~ X1 + s(tmat, by = lmat * wmat, bs = "cr", k = 10)).
We demonstrate the sofr_bayes() function using a
simulated example dataset with a binary outcome and one functional
predictor.
## Alternative sample data
# data.SoFR <- readRDS("data/example_data_sofr.rds")
## Load the example data
set.seed(123)
n <- 100
M <- 50
tgrid <- seq(0, 1, length.out = M)
dt <- tgrid[2] - tgrid[1]
tmat <- matrix(rep(tgrid, each = n), nrow = n)
lmat <- matrix(dt, nrow = n, ncol = M)
wmat <- t(apply(matrix(rnorm(n * M), n, M), 1, cumsum)) / sqrt(M)
beta_true <- sin(2 * pi * tgrid)
X1 <- rnorm(n)
eta <- 0.5 * X1 + wmat %*% (beta_true * dt)
prob <- plogis(eta)
y <- rbinom(n, 1, prob)
data.SoFR <- data.frame(y = y, X1 = X1)
data.SoFR$tmat <- tmat
data.SoFR$lmat <- lmat
data.SoFR$wmat <- wmatThe example dataset data.SoFR contains:
y: a binary outcome variable,X1: a scalar predictor,tmat: the \(n \times
M\) time point matrix (defines the domain \(\mathcal{T}\)),lmat: the \(n \times
M\) Riemann sum weight matrix (defines the integration weights
over \(\mathcal{T}\)),wmat: the \(n \times
M\) functional predictor matrix.library(refundBayes)
refundBayes_SoFR <- refundBayes::sofr_bayes(
y ~ X1 + s(tmat, by = lmat * wmat, bs = "cc", k = 10),
data = data.SoFR,
family = binomial(),
runStan = TRUE,
niter = 1500,
nwarmup = 500,
nchain = 3,
ncores = 3
)In this call:
y with one
scalar predictor X1 and one functional predictor encoded
via s(tmat, by = lmat * wmat, bs = "cc", k = 10).tmat, and the integration over the domain \(\mathcal{T}\) (determined by
tmat) is approximated using the weights in
lmat.bs = "cc" uses cyclic cubic regression splines, which
are appropriate when the functional predictor is periodic (e.g.,
physical activity measured over a 24-hour cycle).k = 10 specifies 10 basis functions. In practice, 30–40
basis functions are often sufficient for moderately smooth functional
data on dense grids.family = binomial() specifies a logistic regression for
the binary outcome.The plot() method for refundBayes objects
displays the estimated functional coefficient \(\hat{\beta}(t)\) along with pointwise 95%
credible intervals:
Posterior summaries of the functional coefficient can be computed
directly from the func_coef element:
## Posterior mean of the functional coefficient
mean_curve <- apply(refundBayes_SoFR$func_coef[[1]], 2, mean)
## Pointwise 95% credible interval
upper_curve <- apply(refundBayes_SoFR$func_coef[[1]], 2,
function(x) quantile(x, prob = 0.975))
lower_curve <- apply(refundBayes_SoFR$func_coef[[1]], 2,
function(x) quantile(x, prob = 0.025))The posterior samples in func_coef[[1]] are stored as a
\(Q \times M\) matrix, where \(Q\) is the number of posterior samples and
\(M\) is the number of time points on
the functional domain.
The Bayesian results can be compared with frequentist estimates
obtained via mgcv::gam:
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
## Fit frequentist SoFR using mgcv
fit_freq <- gam(
y ~ s(tmat, by = lmat * wmat, bs = "cc", k = 10) + X1,
data = data.SoFR,
family = "binomial"
)
## Extract frequentist estimates
freq_result <- plot(fit_freq)The functional coefficient estimates from the Bayesian and frequentist approaches are generally comparable in shape and magnitude, though the Bayesian credible intervals tend to be slightly wider due to accounting for uncertainty in the smoothing parameter.
Setting runStan = FALSE allows you to inspect or modify
the Stan code before running the model:
## Generate Stan code without running the sampler
sofr_code <- refundBayes::sofr_bayes(
y ~ X1 + s(tmat, by = lmat * wmat, bs = "cc", k = 10),
data = data.SoFR,
family = binomial(),
runStan = FALSE
)
## Print the generated Stan code
cat(sofr_code$stancode)##
## data{
## //Total number of observations
## int<lower=1> N_num;
## //Outcome variable
## int Y[N_num];
## real plocation;
## real pscale;
## //Number of scalar predictors
## int<lower=0> K_num;
## //Matrix of scalar predictors
## matrix[N_num,K_num] Z_mat;
## int<lower=0> Kr_1;
## matrix[N_num, Kr_1] X_mat_r_1;
## int<lower=0> Kf_1;
## matrix[N_num, Kf_1] X_mat_f_1;
## }
## transformed data {
## matrix[N_num, K_num] X_sc;
## vector[K_num] mean_Xs;
## for (i in 1:K_num) {
## mean_Xs[i] = mean(Z_mat[, i]);
##
## X_sc[, i] = Z_mat[, i] - mean_Xs[i];
## }
## }
## parameters{
## //Linear predictor intercept
## real eta_0;
## vector[K_num] gamma;
## vector[Kr_1] zbr_1;
## real<lower=0>sigmabr_1;
## vector[Kf_1] bf_1;
## }
## transformed parameters {
## real lprior = 0;
## vector[Kr_1] br_1;
## br_1 = sigmabr_1 * zbr_1;
## }
## model{
## vector[N_num] mu = rep_vector(0.0, N_num);
## //Linear predictor
## mu += eta_0 + X_sc * gamma+ X_mat_r_1 * br_1+ X_mat_f_1 * bf_1;
## for (n in 1:N_num) {
## //Binomial log-likelihood
## target += bernoulli_logit_lpmf(Y[n]|mu[n]);
## }
## target += student_t_lpdf(eta_0 | 3, plocation, pscale);
## target += std_normal_lpdf(zbr_1);
## target += inv_gamma_lpdf(sigmabr_1^2 | 0.0005,0.0005);
## }
k): For
illustrative purposes, k = 10 is often used. In practice,
30–40 basis functions are recommended for moderately smooth functional
data observed on dense grids.bs): Use
"cr" (cubic regression splines) for general functional
predictors. Use "cc" (cyclic cubic regression splines) when
the functional predictor is periodic (e.g., 24-hour activity
patterns).niter = 3000 with nwarmup = 1000, but more
complex models may require additional iterations.rstan package (e.g.,
rstan::traceplot(refundBayes_SoFR$stanfit)) to ensure that
the Markov chains have converged.joint_FPCA = TRUE
for the relevant predictor to jointly estimate FPCA scores and
regression coefficients.