biglasso
extends lasso and elastic-net linear and
logistic regression models for ultrahigh-dimensional, multi-gigabyte
data sets that cannot be loaded into memory. It utilizes memory-mapped
files to store the massive data on the disk and only read those into
memory whenever necessary during model fitting. Moreover, some advanced
feature screening rules are proposed and implemented to accelerate the
model fitting. As a result, this package is much more memory-
and computation-efficient and highly scalable as compared to existing
lasso-fitting packages such as glmnet and ncvreg.
Bechmarking experiments using both simulated and real data sets show
that biglasso
is not only 1.5x to 4x times faster than
existing packages, but also at least 2x more memory-efficient. More
importantly, to the best of our knowledge, biglasso
is the
first R package that enables users to fit lasso models with data sets
that are larger than available RAM, thus allowing for powerful big data
analysis on an ordinary laptop.
To install the latest stable release version from CRAN:
install.packages("biglasso")
To install the latest development version from GitHub:
::install_github("pbreheny/biglasso") remotes
biglasso
at least 2x more memory-efficient than
glmnet
.biglasso (1.4-0)
, glmnet (4.0-2)
,
ncvreg (3.12-0)
, and picasso (1.3-1)
.lambda
values
equally spaced on the log scale of lambda / lambda_max
from
0.1 to 1; varying number of observations n
and number of
features p
; 20 replications, the mean computing time (in
seconds) are reported.y = X * beta + 0.1 eps
, where X
and
eps
are i.i.d. sampled from N(0, 1)
.biglasso
is more computation-efficient:In all the settings, biglasso
(1 core) is uniformly
faster than picasso
, glmnet
and
ncvreg
. When the data gets bigger, biglasso
achieves 6-9x speed-up compared to other packages. Moreover, the
computing time of biglasso
can be further reduced by half
via parallel-computation of multiple cores.
biglasso
is
more memory-efficient:To prove that biglasso
is much more memory-efficient, we
simulate a 1000 X 100000
large feature matrix. The raw data
is 0.75 GB. We used Syrupy to measure the
memory used in RAM (i.e. the resident set size, RSS) every 1 second
during lasso model fitting by each of the packages.
The maximum RSS (in GB) used by a single fit and
10-fold cross validation is reported in the Table below. In the single
fit case, biglasso
consumes 0.60 GB memory in RAM, 23% of
that used by glmnet
and 24% of that used by
ncvreg
. Note that the memory consumed by
glmnet
and ncvreg
are respectively 3.4x and
3.3x larger than the size of the raw data. biglasso
also
requires less additional memory to perform cross-validation, compared
other packages. For serial 10-fold cross-validation,
biglasso
requires just 31% of the memory used by
glmnet
and 11% of that used by ncvreg
, making
it 3.2x and 9.4x more memory-efficient compared to these two,
respectively.
Package | picasso | ncvreg | glmnet | biglasso |
---|---|---|---|---|
Single fit | 0.74 | 2.47 | 2.57 | 0.60 |
10-fold CV | - | 4.62 | 3.11 | 0.96 |
Note: ..* the memory savings offered by
biglasso
would be even more significant if cross-validation
were conducted in parallel. However, measuring memory usage across
parallel processes is not straightforward and not implemented in
Syrupy
; ..* cross-validation is not implemented in
picasso
at this point.
The performance of the packages are also tested using diverse real data sets: * Breast cancer gene expression data (GENE); * MNIST handwritten image data (MNIST); * Cardiac fibrosis genome-wide association study data (GWAS); * Subset of New York Times bag-of-words data (NYT).
The following table summarizes the mean (SE) computing time (in
seconds) of solving the lasso along the entire path of 100
lambda
values equally spaced on the log scale of
lambda / lambda_max
from 0.1 to 1 over 20 replications.
Package | GENE | MNIST | GWAS | NYT |
---|---|---|---|---|
n=536 |
n=784 |
n=313 |
n=5,000 |
|
p=17,322 |
p=60,000 |
p=660,495 |
p=55,000 |
|
picasso | 0.67 (0.02) | 2.94 (0.01) | 14.96 (0.01) | 15.91 (0.16) |
ncvreg | 0.87 (0.01) | 4.22 (0.00) | 19.78 (0.01) | 25.59 (0.12) |
glmnet | 0.74 (0.01) | 3.82 (0.01) | 16.19 (0.01) | 24.94 (0.16) |
biglasso | 0.31 (0.01) | 0.61 (0.02) | 4.82 (0.01) | 5.91 (0.78) |
To demonstrate the out-of-core computing capability of
biglasso
, a 96 GB real data set from a large-scale
genome-wide association study is analyzed. The dimensionality of the
design matrix is: n = 973, p = 11,830,470
. Note
that the size of data is 3x larger than the installed 32 GB of
RAM.
Since other three packages cannot handle this data-larger-than-RAM
case, we compare the performance of screening rules SSR
and
Adaptive
based on our package biglasso
. In
addition, two cases in terms of lambda_min
are considered:
(1) lam_min = 0.1 lam_max
; and (2)
lam_min = 0.5 lam_max
, as in practice there is typically
less interest in lower values of lambda
for very
high-dimensional data such as this case. Again the entire solution path
with 100 lambda
values is obtained. The table below
summarizes the overall computing time (in minutes) by
screening rule SSR
(which is what other three packages are
using) and our new rule Adaptive
. (No replication is
conducted.)
Cases | SSR | Adaptive |
---|---|---|
lam_min / lam_max = 0.1 , 1
core |
189.67 | 66.05 |
lam_min / lam_max = 0.1 , 4
cores |
86.31 | 46.91 |
lam_min / lam_max = 0.5 , 1
core |
177.84 | 24.84 |
lam_min / lam_max = 0.5 , 4
cores |
85.67 | 15.14 |