version 1.3.0
SVEMnet
implements Self-Validated Ensemble Models (SVEM,
Lemkus et al. 2021) and the SVEM whole model test (Karl 2024) using
Elastic Net regression via the glmnet
package Friedman et
al. (2010). This vignette provides an overview of the package’s
functionality and usage.
library(SVEMnet)
# Example data
data <- iris
svem_model <- SVEMnet(Sepal.Length ~ ., data = data, nBoot = 300)
coef(svem_model)
## Percent of Bootstraps Nonzero
## Sepal.Width 1.0000000
## Petal.Length 1.0000000
## Speciesvirginica 0.9333333
## Speciesversicolor 0.9300000
## Petal.Width 0.9166667
Generate a plot of actual versus predicted values:
Predict outcomes for new data using the predict()
function:
## 1 2 3 4 5 6 7 8
## 5.005617 4.740836 4.773051 4.867489 5.058573 5.383305 4.925047 5.026358
## 9 10 11 12 13 14 15 16
## 4.687880 4.895104 5.185226 5.100055 4.768450 4.547358 5.123003 5.500692
## 17 18 19 20 21 22 23 24
## 5.088516 4.978003 5.357963 5.210569 5.173753 5.129998 4.763784 5.037954
## 25 26 27 28 29 30 31 32
## 5.321147 4.888231 5.044827 5.079314 4.952661 4.994143 4.941187 4.971130
## 33 34 35 36 37 38 39 40
## 5.424665 5.376310 4.867489 4.699354 4.931919 5.086187 4.667139 5.026358
## 41 42 43 44 45 46 47 48
## 4.904305 4.268832 4.773051 5.042555 5.477744 4.713222 5.311880 4.846748
## 49 50 51 52 53 54 55 56
## 5.185226 4.899704 6.473589 6.298580 6.540413 5.508716 6.160453 6.141983
## 57 58 59 60 61 62 63 64
## 6.471317 5.128633 6.268637 5.619229 5.064203 5.971576 5.538602 6.314720
## 65 66 67 68 69 70 71 72
## 5.531663 6.199541 6.192668 5.877080 5.769018 5.596159 6.436830 5.773497
## 73 74 75 76 77 78 79 80
## 6.222676 6.316992 6.047545 6.146584 6.335461 6.505926 6.139712 5.381940
## 81 82 83 84 85 86 87 88
## 5.469505 5.423422 5.674457 6.448369 6.192668 6.376878 6.393019 5.803505
## 89 90 91 92 93 94 95 96
## 5.953106 5.614628 5.989988 6.293979 5.695198 5.075677 5.867935 6.054418
## 97 98 99 100 101 102 103 104
## 5.973848 6.047545 4.932883 5.847194 6.965055 6.149726 6.842945 6.651739
## 105 106 107 108 109 110 111 112
## 6.741634 7.358827 5.656858 7.167620 6.587309 7.197621 6.386893 6.297121
## 113 114 115 116 117 118 119 120
## 6.548156 5.942502 6.064612 6.451445 6.630998 7.828559 7.312866 5.921704
## 121 122 123 124 125 126 127 128
## 6.746235 6.027673 7.354226 6.029945 6.854419 7.105397 6.009204 6.188814
## 129 130 131 132 133 134 135 136
## 6.515941 6.907318 6.939655 7.662695 6.488327 6.313138 6.603326 6.935112
## 137 138 139 140 141 142 143 144
## 6.750836 6.683954 6.115116 6.527415 6.591967 6.251095 6.149726 6.893629
## 145 146 147 148 149 150
## 6.743963 6.271836 5.970116 6.354678 6.631055 6.336208
This is the serial version of the significance test. It is slower but the code is less complicated to read than the faster parallel version.
test_result <- svem_significance_test(Sepal.Length ~ ., data = data)
print(test_result)
plot(test_result)
SVEM Significance Test p-value:
[1] 0
Note that there is a parallelized version that runs much faster
# Simulate data
set.seed(1)
n <- 25
X1 <- runif(n)
X2 <- runif(n)
X3 <- runif(n)
X4 <- runif(n)
X5 <- runif(n)
#y only depends on X1 and X2
y <- 1 + X1 + X2 + X1 * X2 + X1^2 + rnorm(n)
data <- data.frame(y, X1, X2, X3, X4, X5)
# Perform the SVEM significance test
test_result <- svem_significance_test_parallel(
y ~ (X1 + X2 + X3)^2 + I(X1^2) + I(X2^2) + I(X3^2),
data = data
)
# View the p-value
print(test_result)
SVEM Significance Test p-value:
[1] 0.009399093
test_result2 <- svem_significance_test_parallel(
y ~ (X1 + X2 )^2 + I(X1^2) + I(X2^2),
data = data
)
# View the p-value
print(test_result2)
SVEM Significance Test p-value:
[1] 0.006475736
#note that the response does not depend on X4 or X5
test_result3 <- svem_significance_test_parallel(
y ~ (X4 + X5)^2 + I(X4^2) + I(X5^2),
data = data
)
# View the p-value
print(test_result3)
SVEM Significance Test p-value:
[1] 0.8968502
# Plot the Mahalanobis distances
plot(test_result,test_result2,test_result3)
There are many particular scenarios that we might be interested in focusing on in order to optimize SVEMnet settings. Perhaps a certain number of factors with a certain number of interactions, etc. However, when setting a default for a software, we want it to work well over a wide range of scenarios that might be encountered.
Our simulations target a response surface model in p factors. For a
selected density \(d\in[0,1]\),
n_active <- max(1, floor(p * d))
of the \(\frac{(p+1)(p+2)}{2}\) parameters in the
RSM are set to rexp(1)-rexp(1)
. There are n
points in the Latin hypercube design. This is not an endorsement of the
Latin hypercube method for designed experiments: it is merely used as a
quick way to generate space filling points for the simulation. It would
also be possible to run the simulation using optimal designs or other
space filling approaches (such as Fast Flexible Filling, Jones and
Lekivetz (2014)). However, for supersaturated settings (where \(n<\frac{(p+1)(p+2)}{2}\)) the optimal
designs would require additional work to specify, and that is not needed
for this simulation.
The models are trained on n
observations and compared to
an independent test set with n_holdout
observations.
# Define vectors for p, d, n, sd
p_values <- seq(3, 6, 1) # Number of parameters
d_values <- seq(0.1, 0.9, 0.1) # Density (proportion of active parameters)
n_values <- seq(15, 50, 5) # Number of design points
sd_values <- c(.25,0.5, 1, 1.5) # Standard deviations of noise
nSim <- 20 # Number of simulations per setting
n_holdout <- 1000 # Number of holdout points
# Create a grid of all combinations of p, d, n, sd
param_grid <- expand.grid(p = p_values, d = d_values, n = n_values, sd = sd_values)
First we compare the log root mean squared error (LRMSE) on the
holdout set for four different models corresponding to the combinations
of objective={"wAIC","wSSE"}
and
debias={TRUE,FALSE}
. Lemkus (2021) uses
objective={"wSSE"}
and debias=FALSE
. JMP uses
objective={"wSSE"}
and debias=TRUE
. Based on
the simulations below, SVEMnet
defaults to
objective={"wAIC"}
and debias=FALSE
. Note that
this is not a commentary on JMP’s settings or a statement about globally
optimal SVEM settings. These are simply the combinations that
SVEMnet
seems to work best with over the tested
scenarios.
objective="wAIC"
and debias=FALSE
performs best in SVEMnet
.
The script for this simulation is available below. Note that this
script was generated with GPT o1-preview
. The first plot
shows the test LRMSE for each of the four models.
# Load necessary libraries
library(SVEMnet)
library(lhs)
library(ggplot2)
library(dplyr)
library(pbapply)
library(parallel)
# Define vectors for p, d, n, sd
p_values <- seq(3, 6, 1) # Number of parameters
d_values <- seq(0.1, 0.9, 0.1) # Density (proportion of active parameters)
n_values <- seq(15, 50, 5) # Number of design points
sd_values <- c(.25,0.5, 1, 1.5) # Standard deviations of noise
nSim <- 20 # Number of simulations per setting
n_holdout <- 1000 # Number of holdout points
# Create a grid of all combinations of p, d, n, sd
param_grid <- expand.grid(p = p_values, d = d_values, n = n_values, sd = sd_values)
# Prepare a list of simulation parameters
sim_params_list <- list()
sim_id <- 1 # Initialize simulation ID
for (i in 1:nrow(param_grid)) {
p <- param_grid$p[i]
d <- param_grid$d[i]
n <- param_grid$n[i]
sd <- param_grid$sd[i]
for (sim in 1:nSim) {
sim_params_list[[length(sim_params_list) + 1]] <- list(
sim_id = sim_id,
p = p,
d = d,
n = n,
sd = sd
)
sim_id <- sim_id + 1
}
}
total_iterations <- length(sim_params_list)
cat("Total simulations to run:", total_iterations, "\n")
# Set up parallel backend using 'parallel' package
num_cores <- 8
RNGkind("L'Ecuyer-CMRG")
set.seed(1234)
cl <- makeCluster(num_cores)
clusterEvalQ(cl, {
library(SVEMnet)
library(lhs)
library(dplyr)
})
# Export necessary variables to the cluster
clusterExport(cl, varlist = c("n_holdout"))
# Set up RNG for parallel processing
clusterSetRNGStream(cl, 1234)
# Function to perform one simulation for a given setting
simulate_one <- function(sim_params) {
sim_id <- sim_params$sim_id
p <- sim_params$p
d <- sim_params$d
n <- sim_params$n
sd <- sim_params$sd
# Generate design data X using LHS
X <- randomLHS(n, p)
colnames(X) <- paste0("V", 1:p)
# Generate holdout data X_T using LHS
X_T <- randomLHS(n_holdout, p)
colnames(X_T) <- paste0("V", 1:p)
# Select active parameters
n_active <- max(1, floor(p * d))
active_params <- sample(1:p, n_active)
# Generate coefficients
beta <- numeric(p)
beta[active_params] <- rexp(n_active) - rexp(n_active)
# Generate response variable y
y <- X %*% beta + rnorm(n, mean = 0, sd = sd)
y <- as.numeric(y) # Convert to vector
# Compute true y for holdout data X_T
y_T_true <- X_T %*% beta
y_T_true <- as.numeric(y_T_true)
# Define formula for SVEMnet
data_train <- data.frame(y = y, X)
data_holdout <- data.frame(X_T)
colnames(data_holdout) <- colnames(X)
# Include interactions and quadratic terms
formula <- as.formula(paste(
"y ~ (", paste(colnames(X), collapse = " + "), ")^2 + ",
paste("I(", colnames(X), "^2)", collapse = " + ")
))
# Initialize data frame to store results for this simulation
sim_results <- data.frame()
# Loop over objectives and debias options
for (objective in c("wAIC", "wSSE")) {
for (debias in c(TRUE, FALSE)) {
# Fit SVEMnet model
model <- SVEMnet(
formula = formula,
data = data_train,
objective = objective,
nBoot = 200,
glmnet_alpha = c(1), # Lasso
weight_scheme = "SVEM"
)
# Predict on holdout data
y_pred_T <- predict(
object = model,
newdata = data_holdout,
debias = debias
)
# Compute RMSE over X_T
RMSE <- sqrt(mean((y_T_true - y_pred_T)^2))
logRMSE <- log(RMSE)
# Record results
sim_results <- rbind(
sim_results,
data.frame(
sim_id = sim_id,
p = p,
d = d,
n = n,
sd = sd,
objective = objective,
debias = debias,
logRMSE = logRMSE
)
)
}
}
return(sim_results)
}
# Run simulations using pblapply with progress bar
results_list <- pblapply(sim_params_list, simulate_one, cl = cl)
# Stop the cluster
stopCluster(cl)
# Combine all results into a data frame
results <- bind_rows(results_list)
# Convert sim_id to integer (since it may come back as a factor)
results$sim_id <- as.integer(as.character(results$sim_id))
# Compute the mean logRMSE for each simulation ID
results <- results %>%
group_by(sim_id) %>%
mutate(sim_mean_logRMSE = mean(logRMSE)) %>%
ungroup()
# Compute residuals
results <- results %>%
mutate(residual_logRMSE = logRMSE - sim_mean_logRMSE)
# Create a new variable that concatenates 'objective' and 'debias'
results <- results %>%
mutate(obj_debias = paste0(objective, "_", debias))
# Compute average residuals for each combination of 'obj_debias'
average_residuals <- results %>%
group_by(obj_debias) %>%
summarise(mean_residual_logRMSE = mean(residual_logRMSE),
.groups = 'drop')
# Print average residuals
print(average_residuals)
# Plot residuals side by side for the four combinations
# Set factor levels for desired order
results$obj_debias <- factor(results$obj_debias,
levels = c("wAIC_TRUE", "wAIC_FALSE", "wSSE_TRUE", "wSSE_FALSE"))
ggplot(results, aes(x = obj_debias, y = logRMSE)) +
geom_boxplot() +
xlab("Objective_Debias Combination") +
ylab("LogRMSE") +
ggtitle("logRMSE over all simulations") +
theme_minimal()
ggplot(results, aes(x = obj_debias, y = residual_logRMSE)) +
geom_boxplot() +
xlab("Objective_Debias Combination") +
ylab("Residual LogRMSE") +
ggtitle("Residuals (after removing simulation mean logRMSE)") +
theme_minimal()
The second simulation compares performance across the
weight_scheme
argument of SVEMnet
.
weight_scheme="Identity"
corresponds to the single-shot
(traditional) Lasso (when glmnet_alpha=1
) fit on the
training data. It is fit with nBoot=1
.
weight_scheme="FWR"
corresponds to fractional weight
regression (Xu et al. (2020)) and uses the same exponential weights for
the training data as weight_scheme="SVEM"
, but it uses the
exact same weights for validation and does not compute anti-correlated
validation weights as SVEM does (Lemkus et al. (2021)).
SVEM
and Identity
are used with
nBoot=200
and all models are fit with
objective="wAIC"
and debias=FALSE
.
p=6
there are 28 parameters in the RSM, and
when d=0.9
, 25 of them are active. Some of the simulations
include as few as 15 runs, so this is an extreme case of fitting a
supersaturated design where a larger-than-expected proportion of the
parameters are active. Interested readers are encouraged to modify the
simulation code to focus on scenarios of more personal interest, perhaps
focusing on less extreme situations.
The script for this simulation is available below. Note that this
script was generated with GPT o1-preview
. The first plot
shows the test LRMSE for each of the three models.
# Load necessary libraries
library(SVEMnet)
library(lhs)
library(ggplot2)
library(dplyr)
library(pbapply)
library(parallel)
library(tidyr) # For pivot_wider
# Define vectors for p, d, n, sd
p_values <- seq(3, 6, 1) # Number of parameters
d_values <- seq(0.1, 0.9, 0.1) # Density (proportion of active parameters)
n_values <- seq(15, 50, 5) # Number of design points
sd_values <- c(0.25, 0.5, 1, 1.5) # Standard deviations of noise
nSim <- 20 # Number of simulations per setting
n_holdout <- 1000 # Number of holdout points
# Create a grid of all combinations of p, d, n, sd
param_grid <- expand.grid(p = p_values, d = d_values, n = n_values, sd = sd_values)
# Prepare a list of simulation parameters
sim_params_list <- list()
sim_id <- 1 # Initialize simulation ID
for (i in 1:nrow(param_grid)) {
p <- param_grid$p[i]
d <- param_grid$d[i]
n <- param_grid$n[i]
sd <- param_grid$sd[i]
for (sim in 1:nSim) {
sim_params_list[[length(sim_params_list) + 1]] <- list(
sim_id = sim_id,
p = p,
d = d,
n = n,
sd = sd
)
sim_id <- sim_id + 1
}
}
total_iterations <- length(sim_params_list)
cat("Total simulations to run:", total_iterations, "\n")
# Set up parallel backend using 'parallel' package
num_cores <- 8
RNGkind("L'Ecuyer-CMRG")
set.seed(0)
cl <- makeCluster(num_cores)
clusterEvalQ(cl, {
library(SVEMnet)
library(lhs)
library(dplyr)
})
# Export necessary variables to the cluster
clusterExport(cl, varlist = c("n_holdout"))
# Set up RNG for parallel processing
clusterSetRNGStream(cl, 0)
# Function to perform one simulation for a given setting
simulate_one <- function(sim_params) {
sim_id <- sim_params$sim_id
p <- sim_params$p
d <- sim_params$d
n <- sim_params$n
sd <- sim_params$sd
# Generate design data X using LHS
X <- randomLHS(n, p)
colnames(X) <- paste0("V", 1:p)
# Generate holdout data X_T using LHS
X_T <- randomLHS(n_holdout, p)
colnames(X_T) <- paste0("V", 1:p)
# Select active parameters
n_active <- max(1, floor(p * d))
active_params <- sample(1:p, n_active)
# Generate coefficients
beta <- numeric(p)
beta[active_params] <- rexp(n_active) - rexp(n_active)
# Generate response variable y
y <- X %*% beta + rnorm(n, mean = 0, sd = sd)
y <- as.numeric(y) # Convert to vector
# Compute true y for holdout data X_T
y_T_true <- X_T %*% beta
y_T_true <- as.numeric(y_T_true)
# Define formula for SVEMnet
data_train <- data.frame(y = y, X)
data_holdout <- data.frame(X_T)
colnames(data_holdout) <- colnames(X)
# Include interactions and quadratic terms
formula <- as.formula(paste(
"y ~ (", paste(colnames(X), collapse = " + "), ")^2 + ",
paste("I(", colnames(X), "^2)", collapse = " + ")
))
# Initialize data frame to store results for this simulation
sim_results <- data.frame()
# Set debias = FALSE and objective = "wAIC"
debias <- FALSE
objective <- "wAIC"
# Loop over model types
for (model_type in c("SVEM", "FWR", "Identity")) {
# Set weight_scheme and nBoot
if (model_type == "SVEM") {
weight_scheme <- "SVEM"
nBoot <- 200
} else if (model_type == "FWR") {
weight_scheme <- "FWR"
nBoot <- 200
} else if (model_type == "Identity") {
weight_scheme <- "Identity"
nBoot <- 1
}
# Fit SVEMnet model
model <- SVEMnet(
formula = formula,
data = data_train,
objective = objective,
nBoot = nBoot,
glmnet_alpha = c(1), # Lasso
weight_scheme = weight_scheme
)
# Predict on holdout data
y_pred_T <- predict(
object = model,
newdata = data_holdout,
debias = debias
)
# Compute RMSE over X_T
RMSE <- sqrt(mean((y_T_true - y_pred_T)^2))
logRMSE <- log(RMSE)
# Record results
sim_results <- rbind(
sim_results,
data.frame(
sim_id = sim_id,
p = p,
d = d,
n = n,
sd = sd,
model_type = model_type,
logRMSE = logRMSE
)
)
}
return(sim_results)
}
# Run simulations using pblapply with progress bar
results_list <- pblapply(sim_params_list, simulate_one, cl = cl)
# Stop the cluster
stopCluster(cl)
# Combine all results into a data frame
results <- bind_rows(results_list)
# Convert sim_id to integer
results$sim_id <- as.integer(as.character(results$sim_id))
# Compute the mean logRMSE for each simulation ID
results <- results %>%
group_by(sim_id) %>%
mutate(sim_mean_logRMSE = mean(logRMSE)) %>%
ungroup()
# Compute residuals
results <- results %>%
mutate(residual_logRMSE = logRMSE - sim_mean_logRMSE)
# Compute average residuals for each model_type
average_residuals <- results %>%
group_by(model_type) %>%
summarise(mean_residual_logRMSE = mean(residual_logRMSE),
.groups = 'drop')
# Print average residuals
print(average_residuals)
# Plot residuals side by side for the three models
results$model_type <- factor(results$model_type, levels = c("SVEM", "FWR", "Identity"))
ggplot(results, aes(x = model_type, y = logRMSE)) +
geom_boxplot() +
xlab("Model Type") +
ylab("LogRMSE") +
ggtitle("LogRMSE by Model") +
theme_minimal()
ggplot(results, aes(x = model_type, y = residual_logRMSE)) +
geom_boxplot() +
xlab("Model Type") +
ylab("Residual LogRMSE") +
ggtitle("Residuals (after removing simulation mean logRMSE)") +
theme_minimal()
# Compute paired differences between models for each simulation
results_wide <- results %>%
select(sim_id, model_type, logRMSE) %>%
pivot_wider(names_from = model_type, values_from = logRMSE)
# Compute differences
results_wide <- results_wide %>%
mutate(
diff_SVEM_FWR = SVEM - FWR,
diff_SVEM_Identity = SVEM - Identity,
diff_FWR_Identity = FWR - Identity
)
# Summarize differences
differences_summary <- results_wide %>%
summarise(
mean_diff_SVEM_FWR = mean(diff_SVEM_FWR, na.rm = TRUE),
mean_diff_SVEM_Identity = mean(diff_SVEM_Identity, na.rm = TRUE),
mean_diff_FWR_Identity = mean(diff_FWR_Identity, na.rm = TRUE),
sd_diff_SVEM_FWR = sd(diff_SVEM_FWR, na.rm = TRUE),
sd_diff_SVEM_Identity = sd(diff_SVEM_Identity, na.rm = TRUE),
sd_diff_FWR_Identity = sd(diff_FWR_Identity, na.rm = TRUE)
)
print(differences_summary)
# Optionally, perform paired t-tests
t_test_SVEM_FWR <- t.test(results_wide$SVEM, results_wide$FWR, paired = TRUE)
t_test_SVEM_Identity <- t.test(results_wide$SVEM, results_wide$Identity, paired = TRUE)
t_test_FWR_Identity <- t.test(results_wide$FWR, results_wide$Identity, paired = TRUE)
# Print t-test results
print(t_test_SVEM_FWR)
print(t_test_SVEM_Identity)
print(t_test_FWR_Identity)
This example shows the newly added wrapper for cv.glmnet() to compare performance of SVEM to glmnet’s native CV implementation. In this example, the factors are mixture factors (generated with rdirichlet()).
In this simulation, both SVEM (with objective=wAIC
) and
cv.glmnet
outperform the single-shot elastic net using AIC.
The cv.glmnet function is slightly outperforming SVEM. This raises a
question of when it is better to use
SVEMnet(...,objective="wAIC")
and when it is better to use
cv.glmnet
. It is again suggested that debiasing harms the
predictive performance on holdout data.
GPT o1-preview
. The plot shows the
residual test LRMSE for each of the models (after subtracting the mean
teast LRMSE of each of the models).
# Install if not already installed
# install.packages("gtools")
library(SVEMnet)
library(ggplot2)
library(dplyr)
library(parallel)
library(tidyr)
library(gtools) # for rdirichlet
# Define vectors for p, d, n, sd, and outlier_prop
p_values <- seq(4, 7, 1) # Number of parameters (mixture components)
d_values <- seq(.1, .9, .1) # Density (proportion of terms to be active)
n_values <- seq(20, 70, 10) # Number of design points
sd_values <- c(0.25, 0.5, 1) # Standard deviations of noise
outlier_prop_values <- c(0,0.1) # Proportions of outliers
nSim <- 5 # Number of simulations per setting
n_holdout <- 2000 # Number of holdout points
mult <- 200 # Oversampling multiple for candidate set generation
# Create a grid of all combinations of p, d, n, sd, outlier_prop
param_grid <- expand.grid(
p = p_values,
d = d_values,
n = n_values,
sd = sd_values,
outlier_prop = outlier_prop_values
)
# Prepare a list of simulation parameters
sim_params_list <- list()
sim_id <- 1
for (i in 1:nrow(param_grid)) {
p <- param_grid$p[i]
d <- param_grid$d[i]
n <- param_grid$n[i]
sd <- param_grid$sd[i]
outlier_prop <- param_grid$outlier_prop[i]
for (sim in 1:nSim) {
sim_params_list[[length(sim_params_list) + 1]] <- list(
sim_id = sim_id,
p = p,
d = d,
n = n,
sd = sd,
outlier_prop = outlier_prop
)
sim_id <- sim_id + 1
}
}
total_iterations <- length(sim_params_list)
cat("Total simulations to run:", total_iterations, "\n")
# Set up parallel backend
num_cores <- 10
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
cl <- makeCluster(num_cores)
clusterSetRNGStream(cl, 1)
clusterEvalQ(cl, {
library(SVEMnet)
library(dplyr)
library(gtools)
})
clusterExport(cl, varlist = c("n_holdout","mult"))
simulate_one <- function(sim_params) {
sim_id <- sim_params$sim_id
p <- sim_params$p
d <- sim_params$d
n <- sim_params$n
sd <- sim_params$sd
outlier_prop <- sim_params$outlier_prop
# Step 1: Generate a large candidate set of points in the simplex
N <- n * mult
candidates <- rdirichlet(N, rep(1, p))
# Step 2: k-means to get n cluster centroids as final design
km_result <- kmeans(candidates, centers = n, nstart = 10)
# Final mixture design
X <- km_result$centers
colnames(X) <- paste0("V", 1:p)
# Use a fourth-order model: (X1 + X2 + ... + Xp)^4
formula_str <- paste("y ~ (", paste(colnames(X), collapse = " + "), ")^",as.character(3))
formula <- as.formula(formula_str)
#model formula
formula_str_model <- paste("y ~ (", paste(colnames(X), collapse = " + "), ")^3")
formula_model <- as.formula(formula_str)
# Temporary data frame with y=0 just to build model matrix
data_train <- data.frame(y = 0, X)
mf <- model.frame(formula, data_train)
X_full <- model.matrix(formula, mf)
# Count how many terms (excluding intercept)
M <- ncol(X_full) - 1
n_active <- max(1, floor(M * d))
# Select active terms from all non-intercept terms
active_terms <- sample(2:(M+1), n_active, replace = FALSE)
# Create coefficients
beta <- numeric(M+1)
beta[active_terms] <- rexp(n_active) - rexp(n_active)
# Generate response y
data_train$y <- drop(X_full %*% beta + rnorm(n, mean = 0, sd = sd))
# Introduce outliers
num_outliers <- ceiling(outlier_prop * n)
if (num_outliers > 0) {
outlier_indices <- sample(seq_len(n), num_outliers, replace = FALSE)
data_train$y[outlier_indices] <- data_train$y[outlier_indices] + rnorm(num_outliers, mean = 0, sd = 3 * sd)
}
# Holdout data
X_T <- rdirichlet(n_holdout, rep(1, p))
colnames(X_T) <- paste0("V", 1:p)
data_holdout <- data.frame(X_T)
# Remove response for holdout
terms_obj <- terms(formula, data = data_train)
terms_obj_no_resp <- delete.response(terms_obj)
mf_T <- model.frame(terms_obj_no_resp, data_holdout)
X_T_full <- model.matrix(terms_obj_no_resp, mf_T)
y_T_true <- drop(X_T_full %*% beta)
# Fit models
model_svem <- SVEMnet(
formula = formula_model,
data = data_train,
standardize = TRUE,
objective = "wAIC",
nBoot = 200,
glmnet_alpha = c(0,0.5,1),
weight_scheme = "SVEM"
)
model_glmnet_cv <- glmnet_with_cv(
formula = formula_model,
data = data_train,
glmnet_alpha = c(0,0.5,1),
standardize = TRUE
)
model_svem_identity <- SVEMnet(
formula = formula_model,
data = data_train,
objective = "wAIC",
standardize=TRUE,
nBoot = 1,
glmnet_alpha = c(0,0.5,1),
weight_scheme = "Identity"
)
pred_list <- list()
# SVEMnet debias=FALSE
y_pred_svem_false <- predict(model_svem, newdata = data_holdout, debias = FALSE)
RMSE_svem_false <- sqrt(mean((y_T_true - y_pred_svem_false)^2))
logRMSE_svem_false <- log(RMSE_svem_false)
pred_list[[length(pred_list)+1]] <- data.frame(
sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
model = "SVEMnet", debias = FALSE, logRMSE = logRMSE_svem_false
)
# SVEMnet debias=TRUE
y_pred_svem_true <- predict(model_svem, newdata = data_holdout, debias = TRUE)
RMSE_svem_true <- sqrt(mean((y_T_true - y_pred_svem_true)^2))
logRMSE_svem_true <- log(RMSE_svem_true)
pred_list[[length(pred_list)+1]] <- data.frame(
sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
model = "SVEMnet", debias = TRUE, logRMSE = logRMSE_svem_true
)
# glmnet_cv debias=FALSE
y_pred_glmnet_false <- predict_cv(model_glmnet_cv, newdata = data_holdout, debias = FALSE)
RMSE_glmnet_false <- sqrt(mean((y_T_true - y_pred_glmnet_false)^2))
logRMSE_glmnet_false <- log(RMSE_glmnet_false)
pred_list[[length(pred_list)+1]] <- data.frame(
sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
model = "glmnet_cv", debias = FALSE, logRMSE = logRMSE_glmnet_false
)
# glmnet_cv debias=TRUE
y_pred_glmnet_true <- predict_cv(model_glmnet_cv, newdata = data_holdout, debias = TRUE)
RMSE_glmnet_true <- sqrt(mean((y_T_true - y_pred_glmnet_true)^2))
logRMSE_glmnet_true <- log(RMSE_glmnet_true)
pred_list[[length(pred_list)+1]] <- data.frame(
sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
model = "glmnet_cv", debias = TRUE, logRMSE = logRMSE_glmnet_true
)
# SVEMnet_Identity debias=FALSE
y_pred_svem_identity_false <- predict(model_svem_identity, newdata = data_holdout, debias = FALSE)
RMSE_svem_identity_false <- sqrt(mean((y_T_true - y_pred_svem_identity_false)^2))
logRMSE_svem_identity_false <- log(RMSE_svem_identity_false)
pred_list[[length(pred_list)+1]] <- data.frame(
sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
model = "SVEMnet_Identity", debias = FALSE, logRMSE = logRMSE_svem_identity_false
)
# SVEMnet_Identity debias=TRUE
y_pred_svem_identity_true <- predict(model_svem_identity, newdata = data_holdout, debias = TRUE)
RMSE_svem_identity_true <- sqrt(mean((y_T_true - y_pred_svem_identity_true)^2))
logRMSE_svem_identity_true <- log(RMSE_svem_identity_true)
pred_list[[length(pred_list)+1]] <- data.frame(
sim_id = sim_id, p = p, d = d, n = n, sd = sd, outlier_prop = outlier_prop,
model = "SVEMnet_Identity", debias = TRUE, logRMSE = logRMSE_svem_identity_true
)
sim_results <- bind_rows(pred_list)
return(sim_results)
}
# Run load-balanced parallel computations
results_list <- parLapplyLB(cl, sim_params_list, simulate_one)
stopCluster(cl)
results <- bind_rows(results_list)
results <- results %>%
group_by(sim_id) %>%
mutate(sim_mean_logRMSE = mean(logRMSE)) %>%
ungroup() %>%
mutate(residual_logRMSE = logRMSE - sim_mean_logRMSE)
average_residuals <- results %>%
group_by(model, debias) %>%
summarise(mean_residual_logRMSE = mean(residual_logRMSE), .groups = 'drop')
print(average_residuals)
results$model_debias <- paste0(results$model, "_debias=", results$debias)
results$model_debias <- factor(results$model_debias,
levels = c("SVEMnet_debias=FALSE", "SVEMnet_debias=TRUE",
"glmnet_cv_debias=FALSE", "glmnet_cv_debias=TRUE",
"SVEMnet_Identity_debias=FALSE", "SVEMnet_Identity_debias=TRUE"))
# Your ggplot calls:
ggplot(results, aes(y = model_debias, x = logRMSE, fill = as.factor(debias))) +
geom_boxplot() +
ylab("Model & Debias Setting") +
xlab("LogRMSE") +
ggtitle("LogRMSE by Model and Debias Setting") +
theme_minimal() +
coord_flip() +
scale_fill_discrete(name = "Debias", labels = c("FALSE", "TRUE"))
ggplot(results, aes(y = model_debias, x = residual_logRMSE, fill = as.factor(debias))) +
geom_boxplot() +
ylab("Model & Debias Setting") +
xlab("Residual LogRMSE") +
ggtitle("Residuals (after removing simulation mean)") +
theme_minimal() +
coord_flip() +
scale_fill_discrete(name = "Debias", labels = c("FALSE", "TRUE"))
results_wide <- results %>%
dplyr::select(sim_id, model_debias, logRMSE) %>%
pivot_wider(names_from = model_debias, values_from = logRMSE)
results_wide <- results_wide %>%
mutate(
diff_SVEM_wAIC_debiasTRUE_FALSE = `SVEMnet_debias=TRUE` - `SVEMnet_debias=FALSE`,
diff_glmnet_cv_debiasTRUE_FALSE = `glmnet_cv_debias=TRUE` - `glmnet_cv_debias=FALSE`,
diff_SVEM_glmnet_cv_FALSE = `SVEMnet_debias=FALSE` - `glmnet_cv_debias=FALSE`,
diff_SVEM_glmnet_cv_TRUE = `SVEMnet_debias=TRUE` - `glmnet_cv_debias=TRUE`
)
differences_summary <- results_wide %>%
summarise(
mean_diff_SVEM_debias = mean(diff_SVEM_wAIC_debiasTRUE_FALSE, na.rm = TRUE),
sd_diff_SVEM_debias = sd(diff_SVEM_wAIC_debiasTRUE_FALSE, na.rm = TRUE),
mean_diff_glmnet_debias = mean(diff_glmnet_cv_debiasTRUE_FALSE, na.rm = TRUE),
sd_diff_glmnet_debias = sd(diff_glmnet_cv_debiasTRUE_FALSE, na.rm = TRUE),
mean_diff_SVEM_glmnet_F = mean(diff_SVEM_glmnet_cv_FALSE, na.rm = TRUE),
sd_diff_SVEM_glmnet_F = sd(diff_SVEM_glmnet_cv_FALSE, na.rm = TRUE),
mean_diff_SVEM_glmnet_T = mean(diff_SVEM_glmnet_cv_TRUE, na.rm = TRUE),
sd_diff_SVEM_glmnet_T = sd(diff_SVEM_glmnet_cv_TRUE, na.rm = TRUE)
)
print(differences_summary)
t_test_SVEM_debias <- t.test(results_wide$`SVEMnet_debias=TRUE`, results_wide$`SVEMnet_debias=FALSE`, paired = TRUE)
t_test_glmnet_debias <- t.test(results_wide$`glmnet_cv_debias=TRUE`, results_wide$`glmnet_cv_debias=FALSE`, paired = TRUE)
t_test_SVEM_vs_glmnet_FALSE <- t.test(results_wide$`SVEMnet_debias=FALSE`, results_wide$`glmnet_cv_debias=FALSE`, paired = TRUE)
t_test_SVEM_vs_glmnet_TRUE <- t.test(results_wide$`SVEMnet_debias=TRUE`, results_wide$`glmnet_cv_debias=TRUE`, paired = TRUE)
print(t_test_SVEM_debias)
print(t_test_glmnet_debias)
print(t_test_SVEM_vs_glmnet_FALSE)
print(t_test_SVEM_vs_glmnet_TRUE)
Lemkus, T., Gotwalt, C., Ramsey, P., & Weese, M. L.
(2021). Self-Validated Ensemble Models for Elastic Net
Regression.
Chemometrics and Intelligent Laboratory Systems, 219,
104439.
DOI: 10.1016/j.chemolab.2021.104439
Karl, A. T. (2024). A Randomized Permutation
Whole-Model Test for SVEM.
Chemometrics and Intelligent Laboratory Systems, 249,
105122.
DOI: 10.1016/j.chemolab.2024.105122
Friedman, J. H., Hastie, T., & Tibshirani, R.
(2010). Regularization Paths for Generalized Linear Models
via Coordinate Descent.
Journal of Statistical Software, 33(1), 1–22.
DOI: 10.18637/jss.v033.i01
Gotwalt, C., & Ramsey, P. (2018). Model
Validation Strategies for Designed Experiments Using Bootstrapping
Techniques With Applications to Biopharmaceuticals.
JMP Discovery Conference.
Link
Ramsey, P., Gaudard, M., & Levin, W. (2021).
Accelerating Innovation with Space-Filling Mixture Designs, Neural
Networks, and SVEM.
JMP Discovery Conference.
Link
Ramsey, P., & Gotwalt, C. (2018). Model
Validation Strategies for Designed Experiments Using Bootstrapping
Techniques With Applications to Biopharmaceuticals.
JMP Discovery Summit Europe.
Link
Ramsey, P., Levin, W., Lemkus, T., & Gotwalt, C.
(2021). SVEM: A Paradigm Shift in Design and Analysis of
Experiments.
JMP Discovery Summit Europe.
Link
Ramsey, P., & McNeill, P. (2023). CMC,
SVEM, Neural Networks, DOE, and Complexity: It’s All About
Prediction.
JMP Discovery Conference.
Karl, A., Wisnowski, J., & Rushing, H.
(2022). JMP Pro 17 Remedies for Practical Struggles with
Mixture Experiments.
JMP Discovery Conference.
Link
Xu, L., Gotwalt, C., Hong, Y., King, C. B., & Meeker,
W. Q. (2020). Applications of the Fractional-Random-Weight
Bootstrap.
The American Statistician, 74(4), 345–358.
Link
Karl, A. T. (2024). SVEMnet: Self-Validated
Ensemble Models with Elastic Net Regression.
R package version 1.1.1.
JMP Help Documentation Overview of
Self-Validated Ensemble Models.
Link