Abstract
FLASHMM is a package for analysis of single-cell differential expression (DE) using a linear mixed-effects model (LMM). The mixed-effects models have become a powerful tool in single-cell studies due to their ability to model intra-subject correlation and inter-subject variability. The classic LMM estimation method faces limitations of speed and computer memory in analysis of large scale scRNA-seq data. The package FLASHMM provides a fast and scalable method to address the scalability of the LMM estimation. FLASHMM is fast and requires less computer memory to fit the LMM. This vignette describes the method for LMM estimation and inference and the R functions for implementing the method, and demonstrates the use of the package via an example.Consider the linear mixed-effects model (LMM) as expressed below (Searle, Casella, and McCulloch 2006) \[\begin{equation} \tag{1.1} y = X\beta + Zb + \epsilon, \end{equation}\] where \(y\) is an \(n\times 1\) vector of observed responses, \(X\) is an \(n\times p\) design matrix for fixed effects \(\beta\), \(Z\) is an \(n\times q\) design matrix for random effects \(b\), and \(\epsilon\) is an \(n\times 1\) vector of residual errors. The term of random effects may be a combination of various components, \[ Zb = Z_1 b_1 + \cdots + Z_K b_K, \] where \(Z=[Z_1,\ldots,Z_K]\), \(b=[b^T_1,\ldots,b^T_K]^T\), \(K\) is the number of random-effect components, and \(Z_k\) is an \(n\times q_k\) design matrix for the \(k\)-th random-effect component. The superscript \(T\) denotes a transpose of vector or matrix. The basic assumptions are as follows:
Here \(\mathbf{0}\) is a vector or matrix of zero elements, \(I_n\) is an \(n\times n\) identity matrix, and \(\sigma^2_k\) and \(\sigma^2\) are unknown parameters, called variance components.
Hartley and Rao (1967) developed the maximum likelihood estimation (MLE) method for estimating the LMM parameters (fixed effects and variance components). Patterson and Thompson (1971) proposed a modified maximum likelihood procedure which partitions the data into two mutually uncorrelated parts, one being free of the fixed effects used for estimating variance components, called restricted maximum likelihood (REML) estimator. The REML estimator is unbiased. The MLE of variance components is biased in general. Both methods are asymptotically identical for estimating variance components. With variance components, \(\theta\), estimated, the fixed effects estimated by either MLE or REML are given as follows \[ \hat\beta = (X^TV_{\theta}^{-1}X )^{-1}X^TV_{\theta}^{-1}y, \] with covariance matrix \[var(\hat\beta) = (X^TV_{\theta}^{-1}X)^{-1},\] where \(\theta = (\theta_0, \theta_1,\ldots, \theta_K)\), \(\theta_0 = \sigma^2\), \(\theta_k = \sigma^2_k\), and \[V_{\theta} = \theta_0 I_n + \theta_1 Z_1Z_1^T + \ldots + \theta_K Z_KZ_K^T.\]
Estimating the variance components by either MLE or REML is a difficult numerical problem. Various iterative methods based on the log likelihood, called gradient methods, have been proposed (Searle, Casella, and McCulloch 2006). The gradient methods are represented by the iteration equation \[\begin{equation} \tag{1.2} \theta^{(i+1)} = \theta^{(i)} + \Gamma(\theta^{(i)})\frac{\partial l(\theta^{(i)})}{\partial\theta}, \end{equation}\] where \(\partial l(\theta)/\partial\theta\) is the gradient of the log likelihood function, and \(\Gamma(\theta)\) is a modifier matrix of the gradient direction, which can be specified by Newton–Raphson, Fisher scoring or average information, see Supplementary Materials of the manuscript for the details.
The hypotheses for testing fixed effects and variance components can be respectively defined as \[ H_{0, i}: \beta_i = 0 ~~\text{versus}~~H_{1,i}: \beta_i\ne 0, \] \[ H_{0, k}: \theta_k \leq 0 ~~\text{versus}~~H_{1,k}: \theta_k > 0, \] where \(\theta_k\), \(k=1, \ldots, K\), represent the parameters of variance components \(\sigma^2_k\) but are allowed to be negative. The lower boundary of the parameters, \(\theta\), can be the negative value such that the variance-covariance matrix, \(V_{\theta}\), remains definable (positive definite). Note that the negative lower boundary exists and can be less than \(- \sigma^2/\lambda_{max}\), where \(\lambda_{max} > 0\), is the largest singular value of \(ZZ^T\). If \(\theta_k > 0\), then \(\sigma_k^2 = \theta_k\) is definable and the mixed model is well-specified. Otherwise, if \(\theta_k \le 0\), the mixed model is miss-specified and no random effects are needed.
Allowing the parameters of variance components to take negative value avoids the zero boundary problem in hypothesis testing for the variance components. Consequently, the asymptotic properties of the maximum likelihood estimation hold under regularity conditions, which enables us to use z-statistics or t-statistics for hypothesis testing of fixed effects and variance components. The t-statistics for fixed effects are given by \[\begin{equation} \tag{1.3} T_i = \frac{\hat\beta_i}{\sqrt{var(\hat\beta_i)}} = \frac{\hat\beta_i}{\sqrt{var(\hat\beta)_{ii}}} ~\sim ~t(n - p). \end{equation}\] The t-statistic for a contrast, a linear combination of the estimated fixed effects, \(c^T\hat\beta\), is \[\begin{equation} \tag{1.4} T_c = \frac{c^T\hat\beta}{\sqrt{c^Tvar(\hat\beta) c}} \sim t(n-p). \end{equation}\] The z-statistics for the parameters of variance components are given by \[\begin{equation} \tag{1.5} Z_k = \frac{\hat\theta_k}{\sqrt{[I(\hat\theta)^{-1}]_{kk}}} \sim N(0, 1), \end{equation}\] where \(I(\theta)\) is the Fisher information matrix.
We developed a summary statistics based algorithm for implementing the gradient method (1.2). Let \(XX\), \(XY\), \(ZX\), \(ZY\), and \(ZZ\) denote a matrix, respectively, which define the summary statistics that are computed from cell-level data \(X\), \(Y\) and \(Z\) as follows: \[\begin{equation} \tag{1.6} \begin{array}{l} XX = X^TX, ~XY = X^TY^T,\\ ZX = Z^TX, ~ZY = Z^TY^T, ~ZZ = Z^TZ,\\ Ynorm = [y_1y_1^T, \ldots, y_my_m^T], \end{array} \end{equation}\] where \(Y = [y_1^T, \ldots, y_m^T]^T\) is a \(m\)-by-\(n\) matrix of gene expression profile with each row \(y_i\) corresponding to the expression of gene \(i\), \(i=1,\ldots,m\). Once the summary statistics are precomputed, the summary-level data based algorithm has a complexity of \(O(m(p^3 + q^3))\), which doesn’t depend on number of cells (sample size \(n\)). In single-cell DE analysis, the numbers of fixed and random effects, \(p\) and \(q\), are relatively small. Therefore, the algorithm is fast and scalable, and requires less computer memory. See Supplemental Materials of the manuscript for details.
FLASHMM package provides two functions, lmm and lmmfit, for fitting LMM. The lmm function uses summary statistics as arguments. The lmmfit function is a wrapper function of lmm, which directly uses cell-level data and computes the summary statistics inside the function. The lmmfit function is simple to be operated but it has a limitation of memory use. For large scale data, we should use lmm instead of lmmfit. That is, we precompute the summary-level data by (1.6) or sslmm function in FLASHMM package, and then use lmm function to fit LMM. FLASHMM provides the lmmtest function to perform statistical test on the fixed effects and the contrasts of fixed effects.
In summary, FLASHMM package provides the following functions.
#Install FLASHMM from Github.
devtools::install_github("https://github.com/Baderlab/FLASHMM")
#Load the package.
library(FLASHMM)
Simulate a multi-sample multi-cell-cluster scRNA-seq dataset comprising 25 samples and 4 clusters (cell-types) with 2 treatments.
##(1) Model design
##Y: gene expression profile (log-transformed counts)
##X: design matrix for fixed effects
##Z: design matrix for random effects
Y <- log(counts + 1)
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = metadata)
Z <- model.matrix(~ 0 + sam, data = metadata)
d <- ncol(Z)
##
##(2) LMM fitting
##Fit LMM by lmmfit using cell-level data.
fit <- lmmfit(Y, X, Z, d = d)
##Fit LMM by lmm using summary-level data computed as follows.
##- Computing summary statistics
n <- nrow(X)
XX <- t(X)%*%X; XY <- t(Y%*%X)
ZX <- t(Z)%*%X; ZY <- t(Y%*%Z); ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
##- Fitting LMM
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d)
identical(fit, fitss)
#> [1] TRUE
##
##Fit LMM by lmm using summary-level data computed by sslmm.
##- Computing summary statistics
ss <- sslmm(X, Y, Z)
##- Fitting LMM
fitss <- lmm(summary.stats = ss, d = d)
identical(fit, fitss)
#> [1] TRUE
##
##(3) Hypothesis tests
test <- lmmtest(fit)
#head(test)
##t-values
all(t(fit$t) == test[, grep("_t", colnames(test))])
#> [1] TRUE
fit$t[, 1:5]
#> Gene1 Gene2 Gene3 Gene4 Gene5
#> log(libsize) 3.5564382 6.1353128 3.6098989 4.1239445 2.8179995
#> clsC1 -2.4158938 -4.8697164 -2.3122737 -2.6709945 -1.3220206
#> clsC2 -2.5957105 -4.9918921 -2.2467924 -2.5108410 -1.2387859
#> clsC3 -2.5576263 -5.0249733 -2.1484996 -2.5856150 -1.2535998
#> clsC4 -2.4466344 -5.0065480 -2.2202673 -2.6983076 -1.2058758
#> clsC1:trtB 0.8697778 0.3659413 0.1515357 0.9707563 -0.9577914
#> clsC2:trtB 2.0700456 1.0976568 -0.1112106 0.7423556 -0.5997369
#> clsC3:trtB 1.5066385 1.5001771 -0.3659261 0.8883383 -1.0410003
#> clsC4:trtB -0.3263183 1.6810047 -0.4559326 0.6462646 -1.7515738
##
##p-values
all(t(fit$p) == test[, grep("_p", colnames(test))])
#> [1] TRUE
fit$p[, 1:5]
#> Gene1 Gene2 Gene3 Gene4 Gene5
#> log(libsize) 0.0003936946 1.226867e-09 0.0003216502 4.036515e-05 0.004928509
#> clsC1 0.0158766072 1.300111e-06 0.0209667982 7.686618e-03 0.186466367
#> clsC2 0.0095791682 7.060831e-07 0.0248727531 1.220264e-02 0.215718133
#> clsC3 0.0106867912 5.971329e-07 0.0319158551 9.862323e-03 0.210283172
#> clsC4 0.0145925607 6.556356e-07 0.0266262016 7.087769e-03 0.228153191
#> clsC1:trtB 0.3846324624 7.144869e-01 0.8795840262 3.319065e-01 0.338401544
#> clsC2:trtB 0.0387066712 2.726210e-01 0.9114719020 4.580478e-01 0.548818677
#> clsC3:trtB 0.1322220329 1.338870e-01 0.7144983040 3.745743e-01 0.298129310
#> clsC4:trtB 0.7442524470 9.307711e-02 0.6485383571 5.182577e-01 0.080156444
We use a simulated multi-sample multi-cell-type scRNA-seq dataset to illustrate how to utilize FLASHMM to perform single-cell differential expression analysis. We are interested in identifying the genes differentially expressed between two treatments (conditions) within a cell type.
We generate a multi-sample multi-cell-type scRNA-seq dataset using a reference dataset by simuRNAseq function in FLASHMM package. We use PBMC 10X droplet-based scRNA-seq dataset (Kang et al. 2018) as the reference dataset, which contains count data and metadata. The metadata should include 3 columns: individuals (subjects or samples), cell types (clusters), and treatments (conditions) which are named as ‘sam’, ‘cls’, and ‘trt’, respectively.
Note that the simuRNAseq function can also generate the scRNA-seq dataset without a reference dataset. In this case, the following step for preparing the reference dataset can be skipped.
To load the PBMC dataset, we need ExperimentHub package.
library(ExperimentHub)
##Load data.
eh <- ExperimentHub()
#query(eh, "Kang")
sce <- eh[["EH2259"]]
##Remove undetected genes.
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
##Remove cells with few or many detected genes (outliers).
nGenes <- colSums(counts(sce) > 0)
bx <- boxplot(log(nGenes), plot = FALSE)
sce <- sce[, nGenes >= exp(bx$stats[1]) & nGenes <= exp(bx$stats[5])]
##Remove lowly expressed genes.
##counts per cell (cpc)
cpc <- rowSums(counts(sce))/ncol(sce)
sce <- sce[(rowSums(counts(sce) > 1) >= 10) & (cpc > 0.005), ]
##counts and metadata
counts <- assay(sce, "counts")
coldata <- as.data.frame(colData(sce))
head(coldata)
#> ind stim cluster cell multiplets
#> AAACATACAATGCC-1 107 ctrl 5 CD4 T cells doublet
#> AAACATACATTTCC-1 1016 ctrl 9 CD14+ Monocytes singlet
#> AAACATACCAGAAA-1 1256 ctrl 9 CD14+ Monocytes singlet
#> AAACATACCAGCTA-1 1256 ctrl 9 CD14+ Monocytes doublet
#> AAACATACCATGCA-1 1488 ctrl 3 CD4 T cells singlet
#> AAACATACCTCGCT-1 1256 ctrl 9 CD14+ Monocytes singlet
all(colnames(counts) == rownames(coldata))
#> [1] TRUE
dim(counts)
#> [1] 7483 28721
rm(eh, sce)
We use the reference dataset to generate a scRNA-seq dataset by simuRNAseq function. First we need to specify which columns represent samples (individuals), cell clusters (types), and treatments (experimental conditions) in the reference metadata by ‘sam’, ‘cls’, and ‘trt’, respectively. We also specify the numbers of genes, cells and differentially expressed (DE) genes to be generated. We use default settings for other arguments in simuRNAseq function. The generated dataset contains count data, metadata, and the DE genes specific to a cell cluster. The DE genes can be considered as the positive controls and the others the negative controls. Without confusion, DE can represent either ‘differentially expressed’ or ‘differential expression’.
##Specify which columns represent samples, treatments, and cell-types.
colnames(coldata)[colnames(coldata) == "ind"] <- "sam" #samples
colnames(coldata)[colnames(coldata) == "stim"] <- "trt" #treatments
colnames(coldata)[colnames(coldata) == "cell"] <- "cls" #cell-types
coldata <- coldata[, c("sam", "trt", "cls")]
head(coldata)
#> sam trt cls
#> AAACATACAATGCC-1 107 ctrl CD4 T cells
#> AAACATACATTTCC-1 1016 ctrl CD14+ Monocytes
#> AAACATACCAGAAA-1 1256 ctrl CD14+ Monocytes
#> AAACATACCAGCTA-1 1256 ctrl CD14+ Monocytes
#> AAACATACCATGCA-1 1488 ctrl CD4 T cells
#> AAACATACCTCGCT-1 1256 ctrl CD14+ Monocytes
##
##Generate the dataset by simuRNAseq function.
set.seed(2412)
dat <- simuRNAseq(counts, nGenes = 100, nCells = 120000, metadata = coldata,
nsam = 25, ncls = 10, ntrt = 2, nDEgenes = 10,
minbeta = 0.5, maxbeta = 1, var.randomeffects = 0.1)
str(dat)
#> List of 5
#> $ ref.mean.dispersion:'data.frame': 7483 obs. of 2 variables:
#> ..$ mu : num [1:7483] 0.0718 0.2008 19.708 0.0725 0.0755 ...
#> ..$ dispersion: num [1:7483] 0.5183 0.0814 0.1902 0.0963 0.0483 ...
#> $ metadata :'data.frame': 120000 obs. of 4 variables:
#> ..$ sam : int [1:120000] 1256 101 1488 1016 1244 1244 107 1256 101 1256 ...
#> ..$ trt : Factor w/ 2 levels "ctrl","stim": 1 2 2 1 1 2 2 1 2 2 ...
#> ..$ cls : Factor w/ 8 levels "B cells","CD14+ Monocytes",..: 3 3 3 4 3 3 3 4 3 3 ...
#> ..$ libsize: num [1:120000] 909 1459 1030 1680 856 ...
#> $ counts : num [1:100, 1:120000] 0 0 0 0 0 0 0 0 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> .. ..$ : chr [1:120000] "GCAGGGCTATCGTG-1" "AGTCGAACAGAGTA-1" "CCTAAACTAACAGA-1" "AAATGGGAATTCCT-1" ...
#> $ DEgenes :'data.frame': 10 obs. of 3 variables:
#> ..$ gene : chr [1:10] "RCE1" "CD68" "FUNDC2_ENSG00000165775" "PAIP2" ...
#> ..$ beta : num [1:10] 0.714 0.689 -0.77 -0.716 -0.932 ...
#> ..$ cluster: chr [1:10] "B cells" "B cells" "CD14+ Monocytes" "CD4 T cells" ...
#> $ treatment : chr "stim"
##Remove the reference dataset that is no longer needed in the following analysis.
rm(counts, coldata)
We perform differential expression analysis of the simulated dataset using FLASHMM package. We are to identify the significantly differentially expressed genes between two treatments in a cell-type. The analyses involve following steps: LMM design, LMM fitting, and exploring LMM fit and the DE genes.
LMM design: construct design matrices for fixed and random effects described as (1.1), and compute the gene expression profile. The gene expression is taken as log-transformed count matrix, \[Y = \log(1+\text{counts}),\] in which each row corresponds to the expression profile for a gene. The design matrix for the fixed effects can be created by the function model.matrix, as follows \[ X = model.matrix(\sim 0 + log(library.size) + cell.type + cell.type:treatment), \] where the interaction term \(cell.type:treatment\) represents the treatment effect in a specific cell-type. \(library.size\) is defined as the total sum of counts across all genes for each cell, a normalization factor of the scRNA-seq counts. We consider the subjects (samples) as random effects that reflect the intra-subject correlation and inter-subject variability. The design matrix for the random effects is given by \[ Z = model.matrix(\sim 0 + subject). \]
LMM fitting: We use either lmm or lmmfit function to fit LMMs for all genes. With the cell-level data matrices \(Y\), \(X\) and \(Z\), the LMMs can be fit by \[lmmfit(Y, X, Z, d = d),\] where \(d\) is the number of random-effects. For \(K\) components of random-effects, \(d = (q_1, \ldots, q_K)\), where \(q_k\) is the number of columns of the design matrix \(Z_k\) for the \(k\)-th random-effect component. For a large-scale data, the lmmfit function has a problem of computer memory limit. In this case, it is recommended to pre-compute the summary-level data: \(XX\), \(XY\), \(ZX\), \(ZY\), \(ZZ\), and \(Y_{norm}\), defined in (1.6), and then use the lmm function to fit the LMMs as follows: \[lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d).\] The summary-level data doesn’t depend on the sample size \(n\). This makes lmm memory-efficient. Both lmm and lmmfit functions also perform hypothesis testing on the fixed effects. We can use lmmtest function to further perform statistical test on the contrasts of fixed effects.
LMM fitting returns a list of estimates of LMM parameters (coefficients and variance components), standard errors of the estimates, covariance matrix of the coefficients (fixed effects), and t-values and p-values for hypothesis testing on the coefficients, for each gene. The LMM fitting also returns the numbers of iterations and the first partial derivatives of log likelihood at the termination of iterations for each gene.
Exploring LMM fit and the DE genes: We check if the LMM fitting is convergent, and perform hypothesis testing on the variance components of random effects. Then we identify the DE genes based on hypothesis testing p-values for the coefficients (fixed effects). If the absolute first partial derivatives of log likelihood are all less than the convergence tolerance, the LMM fitting converges, otherwise it doesn’t converge. The genes for which LMM fitting doesn’t converge should be excluded in the subsequent analysis because the estimated coefficients for these genes are not reliable. The DE genes can be identified by adjusting p-values obtained by false discovery rate (FDR) or family-wise error rate (Bonferroni correction). We might exclude the genes with small coefficients (effect size or log-fold-change).
##Gene expression matrix, Y = log2(1 + counts)
Y <- log2(1 + dat$counts)
dat$counts <- NULL #Remove the counts.
##Design matrix for fixed effects
X <- model.matrix(~ 0 + log(libsize) + cls + cls:trt, data = dat$meta)
colnames(X) <- gsub("\\+", "p", colnames(X))
colnames(X) <- gsub(" ", "_", colnames(X))
##Design matrix for random effects
Z <- model.matrix(~ 0 + as.factor(sam), data = dat$meta)
##Dimension of random effects
d <- ncol(Z)
##(1) Fit LMM by cell-level data.
max.iter <- 100
epsilon <- 1e-5
fit <- lmmfit(Y, X, Z, d = d, max.iter = max.iter, epsilon = epsilon)
str(fit)
#> List of 12
#> $ method : chr "REML-FS"
#> $ dlogL : num [1:2, 1:100] 6.60e-06 -9.31e-09 7.68e-06 -1.01e-08 3.24e-08 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:100] "dl" "dl" "dl" "dl" ...
#> $ niter : num [1:100] 7 9 6 6 13 11 9 8 7 8 ...
#> $ coef : num [1:17, 1:100] 0.021 -0.134 -0.13 -0.132 -0.132 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ se : num [1:17, 1:100] 0.00109 0.00817 0.00862 0.00799 0.00811 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ t : num [1:17, 1:100] 19.3 -16.4 -15.1 -16.6 -16.3 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ p : num [1:17, 1:100] 5.62e-83 2.39e-60 2.89e-51 1.47e-61 8.16e-60 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ cov : num [1:17, 1:17, 1:100] 1.18e-06 -8.37e-06 -9.02e-06 -8.35e-06 -8.27e-06 ...
#> ..- attr(*, "dimnames")=List of 3
#> .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#> .. ..$ : chr [1:17] "log(libsize)" "clsB_cells" "clsCD14p_Monocytes" "clsCD4_T_cells" ...
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ df : int 119983
#> $ theta : num [1:2, 1:100] 3.08e-05 2.16e-02 1.05e-04 8.07e-02 3.31e-04 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "var1" "var0"
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ se.theta: num [1:2, 1:100] 1.75e-05 8.84e-05 6.02e-05 3.29e-04 1.80e-04 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "var1" "var0"
#> .. ..$ : chr [1:100] "EPC2" "TNFRSF18" "TMEM87A" "SIRT6" ...
#> $ RE : NULL
##
##(2) Fit LMM by summary-level data.
##- Compute the summary-level data.
n <- nrow(X)
XX <- t(X)%*%X
XY <- t(Y%*%X)
ZX <- t(Z)%*%X
ZY <- t(Y%*%Z)
ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
rm(X, Y, Z) #release the memory.
##- Fit LMM.
fitss <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d,
max.iter = max.iter, epsilon = epsilon)
identical(fit, fitss)
#> [1] TRUE
rm(fitss)
##Test the treatment effect within all cell-types.
##- Construct a contrast by summing the treatment effects across cell-types.
contrast <- cbind("trt" = numeric(nrow(fit$coef)))
contrast[grep(":", rownames(fit$coef)), ] <- 1
##- Test the contrast.
test <- lmmtest(fit, contrast = contrast)
head(test)
#> trt_t trt_p
#> EPC2 -0.9250580 0.3549376
#> TNFRSF18 -0.4655622 0.6415298
#> TMEM87A 0.1783232 0.8584694
#> SIRT6 0.9348075 0.3498894
#> IL12RB1 -0.4848279 0.6277993
#> RBFA -1.3544274 0.1756026
##(1) Check which LMM fittings converge.
cvg <- (apply(abs(fit$dlogL), 2, max) < epsilon)
sum(cvg)
#> [1] 100
##
##(2) Hypothesis testing for variance components:
## H0, theta </= 0 vs H1, theta > 0.
z <- fit$theta["var1", ]/fit$se.theta["var1", ]
p <- pnorm(z, lower.tail = FALSE)
sum(p <= 0.05)
#> [1] 90
##
##(3) The DE genes specific to a cell-type
##Coefficients and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit$coef))
beta <- t(fit$coef[index, cvg])
p <- t(fit$p[index, cvg])
##Adjust p-values by FDR.
padj <- matrix(p.adjust(c(p), method = "fdr"), nrow = nrow(p), ncol = ncol(p))
##The DE genes specific to a cell cluster with FDR < 0.05.
degenes <- NULL
for (j in 1:ncol(p)){
i <- (padj[, j] < 0.05)
if (sum(i) > 0) degenes <- rbind(degenes, data.frame(gene = rownames(p)[i], cluster = j, coef = beta[i, j], p = p[i, j], FDR = padj[i, j]))
}
rownames(degenes) <- NULL
degenes
#> gene cluster coef p FDR
#> 1 RCE1 1 0.023144076 1.285161e-10 2.056257e-08
#> 2 CD68 1 0.169924579 3.359959e-40 8.959890e-38
#> 3 N4BP1 2 0.013871752 1.377216e-04 1.001611e-02
#> 4 BEST1 2 0.011185860 1.056919e-03 4.973737e-02
#> 5 MYADM 2 0.011748386 2.521124e-07 2.881284e-05
#> 6 C1orf174 2 0.010434876 5.622747e-04 2.998798e-02
#> 7 FUNDC2_ENSG00000165775 2 -0.074345865 1.513410e-92 6.053640e-90
#> 8 PAIP2 3 -0.162167570 2.257004e-218 1.805603e-215
#> 9 CAMK2D 3 -0.004915633 6.519633e-07 6.519633e-05
#> 10 NOC4L 5 0.026286349 4.995860e-04 2.854777e-02
#> 11 MALSU1 5 -0.033691602 8.530013e-05 6.824010e-03
#> 12 FAM53B 5 -0.017410267 4.704017e-04 2.854777e-02
#> 13 HGS 6 -0.013463985 6.440482e-04 3.220241e-02
#> 14 UTRN 6 0.018717417 1.546949e-04 1.031299e-02
#> 15 INTS10 6 -0.065494690 3.951917e-27 7.903833e-25
#> 16 ZNF770 7 0.058368842 1.127945e-05 1.002618e-03
#> 17 RBBP7 8 0.044868013 1.457681e-09 1.943575e-07
##The simulated DE genes
dat$DEgenes
#> gene beta cluster
#> 91 RCE1 0.7138792 B cells
#> 92 CD68 0.6893330 B cells
#> 93 FUNDC2_ENSG00000165775 -0.7704019 CD14+ Monocytes
#> 94 PAIP2 -0.7157209 CD4 T cells
#> 95 CAMK2D -0.9319393 CD4 T cells
#> 96 DMXL1 -0.7755333 Dendritic cells
#> 97 SLC35F6 -0.7766885 Dendritic cells
#> 98 INTS10 -0.8749181 FCGR3A+ Monocytes
#> 99 ZNF770 0.8655967 Megakaryocytes
#> 100 RBBP7 0.5337121 NK cells
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Monterey 12.7.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Toronto
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] muscData_1.18.0 SingleCellExperiment_1.26.0
#> [3] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [5] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
#> [7] IRanges_2.38.1 S4Vectors_0.42.1
#> [9] MatrixGenerics_1.16.0 matrixStats_1.4.1
#> [11] ExperimentHub_2.12.0 AnnotationHub_3.12.0
#> [13] BiocFileCache_2.12.0 dbplyr_2.5.0
#> [15] BiocGenerics_0.50.0 FLASHMM_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
#> [4] filelock_1.0.3 Biostrings_2.72.1 fastmap_1.2.0
#> [7] promises_1.3.2 digest_0.6.37 mime_0.12
#> [10] lifecycle_1.0.4 ellipsis_0.3.2 processx_3.8.4
#> [13] KEGGREST_1.44.1 RSQLite_2.3.9 magrittr_2.0.3
#> [16] compiler_4.4.2 rlang_1.1.4 sass_0.4.9
#> [19] tools_4.4.2 yaml_2.3.10 knitr_1.49
#> [22] S4Arrays_1.4.1 htmlwidgets_1.6.4 bit_4.5.0.1
#> [25] pkgbuild_1.4.5 curl_6.0.1 DelayedArray_0.30.1
#> [28] abind_1.4-8 pkgload_1.4.0 miniUI_0.1.1.1
#> [31] withr_3.0.2 purrr_1.0.2 desc_1.4.3
#> [34] grid_4.4.2 urlchecker_1.0.1 profvis_0.4.0
#> [37] xtable_1.8-4 MASS_7.3-64 cli_3.6.3
#> [40] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3
#> [43] remotes_2.5.0 rstudioapi_0.17.1 httr_1.4.7
#> [46] sessioninfo_1.2.2 DBI_1.2.3 cachem_1.1.0
#> [49] zlibbioc_1.50.0 AnnotationDbi_1.66.0 BiocManager_1.30.25
#> [52] XVector_0.44.0 vctrs_0.6.5 devtools_2.4.5
#> [55] Matrix_1.7-1 jsonlite_1.8.9 bookdown_0.41
#> [58] callr_3.7.6 bit64_4.5.2 jquerylib_0.1.4
#> [61] glue_1.8.0 ps_1.8.1 BiocVersion_3.19.1
#> [64] later_1.4.1 UCSC.utils_1.0.0 tibble_3.2.1
#> [67] pillar_1.10.0 rappdirs_0.3.3 htmltools_0.5.8.1
#> [70] GenomeInfoDbData_1.2.12 R6_2.5.1 evaluate_1.0.1
#> [73] shiny_1.10.0 lattice_0.22-6 png_0.1-8
#> [76] memoise_2.0.1 httpuv_1.6.15 bslib_0.8.0
#> [79] Rcpp_1.0.13-1 SparseArray_1.4.8 xfun_0.49
#> [82] fs_1.6.5 usethis_3.1.0 pkgconfig_2.0.3
Building design matrices for fixed and random effects is a key step for LMM-based DE analysis. Including library size, a normalization factor for scRNA-seq, as a fixed effect can help reduce p-value inflation. If needed, we can add principal components (PCs) as fixed effects to further remove the unknown batch effect.