Bayesian Scalar-on-Function Regression with refundBayes::sofr_bayes

2026-03-03

Introduction

This vignette provides a detailed guide to the sofr_bayes() function in the refundBayes package, which fits Bayesian Scalar-on-Function Regression (SoFR) models using Stan. The function is designed with a syntax similar to mgcv::gam, making it accessible to users familiar with the frequentist approach while providing full Bayesian posterior inference.

The methodology follows the tutorial by Jiang, Crainiceanu, and Cui (2025), Tutorial on Bayesian Functional Regression Using Stan, published in Statistics in Medicine.

Install the refundBayes Package

The refundBayes package can be installed from GitHub:

library(remotes)
remotes::install_github("https://github.com/ZirenJiang/refundBayes")

Statistical Model

The SoFR Model

Scalar-on-Function Regression (SoFR) models the relationship between a scalar outcome and one or more functional predictors (curves or trajectories observed over a continuum), along with optional scalar covariates.

For subject \(i = 1, \ldots, n\), let \(Y_i\) be the scalar outcome, \(\mathbf{Z}_i\) be a \(p \times 1\) vector of scalar predictors, and \(\{W_i(t_m), t_m \in \mathcal{T}\}\) for \(m = 1, \ldots, M\) be a functional predictor observed at \(M\) time points over a domain \(\mathcal{T}\). The SoFR model assumes that the distribution of \(Y_i\) belongs to an exponential family with mean \(\mu_i\), and the linear predictor \(\eta_i = g(\mu_i)\) has the following structure:

\[\eta_i = \eta_0 + \int_{\mathcal{T}} W_i(t)\beta(t)\,dt + \mathbf{Z}_i^t \boldsymbol{\gamma}\]

where:

The integral \(\int_{\mathcal{T}} W_i(t)\beta(t)\,dt\) is approximated using a Riemann sum over the observed time points.

Basis Expansion and Penalized Splines

The functional coefficient \(\beta(t)\) is represented nonparametrically using a set of \(K\) pre-specified spline basis functions \(\psi_1(t), \ldots, \psi_K(t)\):

\[\beta(t) = \sum_{k=1}^K b_k \psi_k(t)\]

With this expansion, the linear predictor becomes:

\[\eta_i = \eta_0 + \mathbf{X}_i^t \mathbf{b} + \mathbf{Z}_i^t \boldsymbol{\gamma}\]

where \(\mathbf{X}_i = (X_{i1}, \ldots, X_{iK})^t\) is a \(K \times 1\) vector with entries:

\[X_{ik} = \sum_{m=1}^M L_m W_i(t_m) \psi_k(t_m)\]

Here \(L_m = t_{m+1} - t_m\) are the Riemann sum integration weights and \(t_m\) are the time points at which the functional predictor is observed.

The Role of tmat, lmat, and wmat

The Riemann sum approximation \(\sum_{m=1}^M L_m W_i(t_m)\psi_k(t_m)\) to the integral \(\int_{\mathcal{T}} W_i(t)\psi_k(t)\,dt\) is constructed directly from three user-supplied matrices in the data argument:

In the formula s(tmat, by = lmat * wmat, bs = "cc", k = 10), the mgcv infrastructure uses tmat to construct the spline basis \(\psi_k(t_m)\) at the observed time points, and the by = lmat * wmat argument provides the element-wise product \(L_m \cdot W_i(t_m)\) that enters the Riemann sum. This means the basis functions are evaluated on the scale of tmat, and the integration weights in lmat ensure that the discrete sum correctly approximates the integral over the actual domain \(\mathcal{T}\) — regardless of whether it is \([0, 1]\), \([1, 1440]\), or any other interval.

Note that for all subjects, the time points are assumed to be identical so that \(t_{im} = t_m\) for all \(i = 1, \ldots, n\). Thus every row of tmat is the same, and every row of lmat is the same. The matrices are replicated across rows to match the mgcv syntax, which expects all terms in the formula to have the same dimensions.

Smoothness Penalty

To induce smoothness on \(\beta(t)\), a quadratic penalty on the spline coefficients is applied. The penalty is based on the integrated squared second derivative of \(\beta(t)\):

\[\int \{\beta''(t)\}^2\,dt = \mathbf{b}^t \mathbf{S} \mathbf{b}\]

where \(\mathbf{S} = \int \boldsymbol{\psi}''(t)\{\boldsymbol{\psi}''(t)\}^t\,dt\) is the penalty matrix. In the Bayesian framework, this penalty is equivalent to placing a multivariate normal prior on the spline coefficients:

\[p(\mathbf{b}) \propto \exp\left(-\frac{\mathbf{b}^t \mathbf{S} \mathbf{b}}{\sigma_b^2}\right)\]

where \(\sigma_b^2\) is the smoothing parameter that controls the smoothness of \(\beta(t)\), and is estimated from the data.

Full Bayesian Model

The complete Bayesian SoFR model is:

\[\begin{cases} \mathbf{Y} \sim \text{Exponential Family}(\boldsymbol{\eta}, a) \\ \boldsymbol{\eta} = \eta_0 \mathbf{J}_n + \tilde{\mathbf{X}}_r^t \tilde{\mathbf{b}}_r + \tilde{\mathbf{X}}_f^t \tilde{\mathbf{b}}_f + \mathbf{Z}^t \boldsymbol{\gamma} \\ \tilde{\mathbf{b}}_r \sim N(\mathbf{0}, \sigma_b^2 \mathbf{I}) \\ \eta_0 \sim p(\eta_0);\; \tilde{\mathbf{b}}_f \sim p(\tilde{\mathbf{b}}_f);\; \boldsymbol{\gamma} \sim p(\boldsymbol{\gamma}) \\ \sigma_b^2 \sim p(\sigma_b^2);\; a \sim p(a) \end{cases}\]

where \(\tilde{\mathbf{X}}_r\) and \(\tilde{\mathbf{X}}_f\) are the correspondingly transformed design matrices, and \(p(\cdot)\) denotes non-informative priors (uniform or weakly informative). The smoothing variance \(\sigma_b^2\) is assigned an inverse-Gamma prior \(IG(0.001, 0.001)\).

The sofr_bayes() Function

Usage

sofr_bayes(
  formula,
  data,
  family = gaussian(),
  joint_FPCA = NULL,
  intercept = TRUE,
  runStan = TRUE,
  niter = 3000,
  nwarmup = 1000,
  nchain = 3,
  ncores = 1
)

Arguments

Argument Description
formula Functional regression formula, using the same syntax as mgcv::gam. Functional predictors are specified using the s() term with by = lmat * wmat to encode the Riemann sum integration (see Example below).
data A data frame containing all scalar and functional variables used in the model.
family Distribution of the outcome variable. Currently supports gaussian() and binomial(). Default is gaussian().
joint_FPCA A logical (TRUE/FALSE) vector of the same length as the number of functional predictors, indicating whether to jointly model FPCA for each functional predictor. Default is NULL, which sets all entries to FALSE (no joint FPCA).
intercept Logical. Whether to include an intercept term in the linear predictor. Default is TRUE.
runStan Logical. Whether to run the Stan program. If FALSE, the function only generates the Stan code and data without sampling. This is useful for inspecting or modifying the generated Stan code. Default is TRUE.
niter Total number of Bayesian posterior sampling iterations (including warmup). Default is 3000.
nwarmup Number of warmup (burn-in) iterations. These samples are discarded and not used for inference. Default is 1000.
nchain Number of Markov chains for posterior sampling. Multiple chains help assess convergence. Default is 3.
ncores Number of CPU cores to use when executing the chains in parallel. Default is 1.

Return Value

The function returns a list of class "refundBayes" containing the following elements:

Element Description
stanfit The Stan fit object (class stanfit). Can be used for convergence diagnostics, traceplots, and additional summaries via the rstan package.
spline_basis Basis functions used to reconstruct the functional coefficients from the posterior samples.
stancode A character string containing the generated Stan model code.
standata A list containing the data passed to the Stan model.
int A vector of posterior samples for the intercept term \(\eta_0\). NULL if intercept = FALSE.
scalar_coef A matrix of posterior samples for scalar coefficients \(\boldsymbol{\gamma}\), where each row is one posterior sample and each column corresponds to one scalar predictor. NULL if no scalar predictors are included.
func_coef A list of posterior samples for functional coefficients. Each element is a matrix where each row is one posterior sample and each column corresponds to one location on the functional domain.
family The distribution family used for the outcome.

Formula Syntax

The formula follows the mgcv::gam syntax. The key component for specifying functional predictors is:

s(tmat, by = lmat * wmat, bs = "cc", k = 10)

where:

Scalar predictors are included as standard formula terms (e.g., y ~ X1 + s(tmat, by = lmat * wmat, bs = "cr", k = 10)).

Example: Bayesian SoFR with Binary Outcome

We demonstrate the sofr_bayes() function using a simulated example dataset with a binary outcome and one functional predictor.

Load Data

## Alternative sample data 

# data.SoFR <- readRDS("data/example_data_sofr.rds")

## Load the example data
set.seed(123)
n <- 100
M <- 50
tgrid <- seq(0, 1, length.out = M)
dt    <- tgrid[2] - tgrid[1]
tmat  <- matrix(rep(tgrid, each = n), nrow = n)
lmat  <- matrix(dt, nrow = n, ncol = M)
wmat  <- t(apply(matrix(rnorm(n * M), n, M), 1, cumsum)) / sqrt(M)
beta_true <- sin(2 * pi * tgrid)
X1 <- rnorm(n)
eta <- 0.5 * X1 + wmat %*% (beta_true * dt)
prob <- plogis(eta)
y <- rbinom(n, 1, prob)
data.SoFR <- data.frame(y = y, X1 = X1)
data.SoFR$tmat <- tmat
data.SoFR$lmat <- lmat
data.SoFR$wmat <- wmat

The example dataset data.SoFR contains:

Fit the Bayesian SoFR Model

library(refundBayes)

refundBayes_SoFR <- refundBayes::sofr_bayes(
  y ~ X1 + s(tmat, by = lmat * wmat, bs = "cc", k = 10),
  data = data.SoFR,
  family = binomial(),
  runStan = TRUE,
  niter = 1500,
  nwarmup = 500,
  nchain = 3,
  ncores = 3
)

In this call:

Visualization

The plot() method for refundBayes objects displays the estimated functional coefficient \(\hat{\beta}(t)\) along with pointwise 95% credible intervals:

library(ggplot2)
plot(refundBayes_SoFR)

Extracting Posterior Summaries

Posterior summaries of the functional coefficient can be computed directly from the func_coef element:

## Posterior mean of the functional coefficient
mean_curve <- apply(refundBayes_SoFR$func_coef[[1]], 2, mean)

## Pointwise 95% credible interval
upper_curve <- apply(refundBayes_SoFR$func_coef[[1]], 2,
                     function(x) quantile(x, prob = 0.975))
lower_curve <- apply(refundBayes_SoFR$func_coef[[1]], 2,
                     function(x) quantile(x, prob = 0.025))

The posterior samples in func_coef[[1]] are stored as a \(Q \times M\) matrix, where \(Q\) is the number of posterior samples and \(M\) is the number of time points on the functional domain.

Comparison with Frequentist Results

The Bayesian results can be compared with frequentist estimates obtained via mgcv::gam:

library(mgcv)
## Loading required package: nlme

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
## Fit frequentist SoFR using mgcv
fit_freq <- gam(
  y ~ s(tmat, by = lmat * wmat, bs = "cc", k = 10) + X1,
  data = data.SoFR,
  family = "binomial"
)

## Extract frequentist estimates
freq_result <- plot(fit_freq)

The functional coefficient estimates from the Bayesian and frequentist approaches are generally comparable in shape and magnitude, though the Bayesian credible intervals tend to be slightly wider due to accounting for uncertainty in the smoothing parameter.

Inspecting the Generated Stan Code

Setting runStan = FALSE allows you to inspect or modify the Stan code before running the model:

## Generate Stan code without running the sampler
sofr_code <- refundBayes::sofr_bayes(
  y ~ X1 + s(tmat, by = lmat * wmat, bs = "cc", k = 10),
  data = data.SoFR,
  family = binomial(),
  runStan = FALSE
)

## Print the generated Stan code
cat(sofr_code$stancode)
## 
## data{ 
##    //Total number of observations  
##    int<lower=1> N_num;
##    //Outcome variable   
##    int Y[N_num];
##    real plocation;
##    real pscale;
##    //Number of scalar predictors   
##    int<lower=0> K_num;
##    //Matrix of scalar predictors   
##    matrix[N_num,K_num] Z_mat;
##    int<lower=0> Kr_1;
##    matrix[N_num, Kr_1] X_mat_r_1;
##    int<lower=0> Kf_1;
##    matrix[N_num, Kf_1] X_mat_f_1;
## }
## transformed data {
##    matrix[N_num, K_num] X_sc;
##    vector[K_num] mean_Xs;
##    for (i in 1:K_num) {
##       mean_Xs[i] = mean(Z_mat[, i]);
## 
##                                  X_sc[, i] = Z_mat[, i] - mean_Xs[i];
##    }
## }
## parameters{ 
##    //Linear predictor intercept  
##    real eta_0;
##    vector[K_num] gamma;
##    vector[Kr_1] zbr_1;
##    real<lower=0>sigmabr_1;
##    vector[Kf_1] bf_1;
## }
## transformed parameters { 
##  real lprior = 0;
##    vector[Kr_1] br_1;
##    br_1 = sigmabr_1 * zbr_1;
## }
## model{ 
##    vector[N_num] mu = rep_vector(0.0, N_num);
##    //Linear predictor
##    mu += eta_0 + X_sc * gamma+ X_mat_r_1 * br_1+ X_mat_f_1 * bf_1;
##    for (n in 1:N_num) {
##      //Binomial log-likelihood
##      target += bernoulli_logit_lpmf(Y[n]|mu[n]);
##    }
##    target += student_t_lpdf(eta_0 | 3, plocation, pscale);
##    target += std_normal_lpdf(zbr_1);
##    target += inv_gamma_lpdf(sigmabr_1^2 | 0.0005,0.0005);
## }

Practical Recommendations

References