Compute fully-factorized variational approximation for Bayesian variable selection in linear (family = gaussian) or logistic regression (family = binomial). More precisely, find the "best" fully-factorized approximation to the posterior distribution of the coefficients, with spike-and-slab priors on the coefficients. By "best", we mean the approximating distribution that locally minimizes the Kullback-Leibler divergence between the approximating distribution and the exact posterior.
varbvs (X, Z, y, family = c("gaussian","binomial"), sigma, sa, logodds, alpha, mu, eta, update.sigma, update.sa, optimize.eta, initialize.params, nr = 100, sa0 = 1, n0 = 10, tol = 1e-4, maxiter = 1e4, verbose = TRUE)
n x p input matrix, where n is the number of samples, and p is the number of variables. X cannot be sparse, and cannot have any missing values (NA).
n x m covariate data matrix, where m is the number of
covariates. Do not supply an intercept as a covariate (i.e.,
a column of ones), because an intercept is automatically
included in the regression model. For no covariates, set
Vector of length n containing observations of binary
"gaussian" for linear regression model, or "binomial" for logistic regression model.
Candidate settings for the residual variance
parameter. Must be of the same length as inputs sa and logodds (or
have length equal to the number of columns of logodds). Only used for
linear regression, and will generate an error if
Hyperparameter sa is the prior variance of regression
coefficients for variables that are included in the model. This
prior variance is always scaled by sigma (for logistic regression,
we take sigma = 1). Scaling the variance of the coefficients in this
way is necessary to ensure that this prior is invariant to
measurement scale (e.g., switching from grams to kilograms). This
input specifies the candidate settings for sa, of the same length as
inputs sigma and logodds (or have length equal to the number of
columns of logodds). If missing, prior variance is automatically
fitted to data by compute approximate maximum (ML) estimates, or
maximum a posteriori estimates when
Hyperparameter logodds is the prior log-odds that a variable is included in the regression model; it is defined as \(logodds = log10(q/(1-q)),\) where q is the prior probability that a variable is included in the regression model. Note that we use the base-10 logarithm instead of the natural logarithm because it is usually more natural to specify prior log-odds settings in this way. The prior log-odds may also be specified separately for each variable, which is useful is there is prior information about which variables are most relevant to the outcome. This is accomplished by setting logodds to a p x ns matrix, where p is the number of variables, and ns is the number of hyperparameter settings. Note it is not possible to fit the logodds parameter; if logodds input is not provided as input, then it is set to the default value when sa and sigma are missing, and otherwise an error is generated.
Good initial estimate for the variational parameter alpha for each hyperparameter setting. Either missing, or a p x ns matrix, where p is the number of variables, and ns is the number of hyperparameter settings.
Good initial estimate for the variational parameter mu for each hyperparameter setting. Either missing, or a p x ns matrix, where p is the number of variables, and ns is the number of hyperparameter settings.
Good initial estimate of the additional free parameters specifying the variational approximation to the logistic regression factors. Either missing, or an n x ns matrix, where n is the number of samples, and ns is the number of hyperparameter settings.
Setting this to TRUE ensures that sigma is always fitted to data, in which case input vector sigma is used to provide initial estimates.
Setting this to TRUE ensures that sa is always fitted to data, in which case input vector sa is used to provide initial estimates.
If FALSE, the initialization stage of the variational inference algorithm (see below) will be skipped, which saves computation time for large data sets.
Number of samples of "model PVE" to draw from posterior.
Scale parameter for a scaled inverse chi-square prior on hyperparameter sa. Must be >= 0.
Number of degrees of freedom for a scaled inverse chi-square
prior on hyperparameter sa. Must be >= 0. Large settings of
Convergence tolerance for inner loop.
Maximum number of inner loop iterations.
Two types of outcomes (y) are modeled: (1) a continuous outcome,
also a "quantitative trait" in the genetics literature; or (2) a
binary outcome with possible values 0 and 1. For the former, set
family = "gaussian", in which case, the outcome is
i.i.d. normal with mean \(u0 + Z*u + X*b\) and variance sigma, in
which u and b are vectors of regresion coefficients, and u0 is the
intercept. In the second case, we use logistic regression to model
the outcome, in which the probability that y = 1 is equal to
\(sigmoid(u0 + Z*u + X*b).\) See
help(sigmoid) for a
description of the sigmoid function. Note that the regression always
includes an intercept term (u0).
For both regression models, the fitting procedure consists of an inner
loop and an outer loop. The outer loop iterates over each of the
hyperparameter settings (sa, sigma and logodds). Given a setting of
the hyperparameters, the inner loop cycles through coordinate ascent
updates to tighten the lower bound on the marginal likelihood,
\(Pr(Y | X, sigma, sa, logodds)\). The inner loop coordinate
ascent updates terminate when either (1) the maximum number of inner
loop iterations is reached, as specified by
maxiter, or (2)
the maximum difference between the estimated posterior inclusion
probabilities is less than
To provide a more accurate variational approximation of the posterior distribution, by default the fitting procedure has two stages. In the first stage, the entire fitting procedure is run to completion, and the variational parameters (alpha, mu, s, eta) corresponding to the maximum lower bound are then used to initialize the coordinate ascent updates in a second stage. Although this has the effect of doubling the computation time (in the worst case), the final posterior estimates tend to be more accurate with this two-stage fitting procedure.
Outputs alpha, mu and s specify the approximate posterior distribution
of the regression coefficients. Each of these outputs is a p x ns
matrix. For the ith hyperparameter setting, alpha[,i] is the
variational estimate of the posterior inclusion probability (PIP)
for each variable; mu[,i] is the variational estimate of the
posterior mean coefficient given that it is included in the model;
and s[,i] is the estimated posterior variance of the coefficient
given that it is included in the model. These are also the
quantities that are optimized as part of the inner loop coordinate
ascent updates. An additional free parameter, eta, is needed for
fast computation with the logistic regression model
"binomial"). The fitted value of eta is returned as an n x ns
The variational estimates should be interpreted carefully, especially when variables are strongly correlated. For example, consider the simple scenario in which 2 candidate variables are closely correlated, and at least one of them explains the outcome with probability close to 1. Under the correct posterior distribution, we would expect that each variable is included with probability ~0.5. However, the variational approximation, due to the conditional independence assumption, will typically get this wrong, and concentrate most of the posterior weight on one variable (the actual variable that is chosen will depend on the starting conditions of the optimization). Although the individual PIPs are incorrect, a statistic summarizing the variable selection for both correlated variables (e.g., the total number of variables included in the model) should be reasonably accurate.
More generally, if variables can be reasonably grouped together
based on their correlations, we recommend interpreting the variable
selection results at a group level. For example, in genome-wide
association studies (see the vignettes) ,a SNP with a high PIP
indicates that this SNP is probably associated with the trait, and
one or more nearby SNPs within a chromosomal region, or ``locus,''
may be associated as well. Therefore, we interpreted the GWAS
variable selection results at the level of loci, rather than at the
level of individual SNPs.
Also note that special care is required for interpreting the results
of the variational approximation with the logistic regression
model. In particular, interpretation of the individual estimates of
the regression coefficients (e.g., the posterior mean estimates
fit$mu) is not straightforward due to the additional
approximation introduced on the individual nonlinear factors in the
likelihood. As a general guideline, only the relative magnitudes of
the coefficients are meaningful.
In many settings, it is good practice to account for uncertainty in the hyperparameters when reporting final posterior quantities. For example, hyperparameter sa is often estimated with a high degree of uncertainty when only a few variables are included in the model. Provided that (1) the hyperparameter settings sigma, sa and logodds adequately represent the space of possible hyperparameter settings with high posterior mass, (2) the hyperparameter settings are drawn from the same distribution as the prior, and (3) the fully-factorized variational approximation closely approximates the true posterior distribution, then final posterior quantities can be calculated by using logw as (unnormalized) log-marginal probabilities.
Even when conditions (1), (2) and/or (3) are not satisfied, this can
approach can still often yield reasonable estimates of averaged
posterior quantities. The examples below demonstrate how final
posterior quantities are reported by function
more details). To account for discrepancies between the prior on
(sigma,sa,logodds) and the sampling density used to draw candidate
settings of the hyperparameters, adjust the log-probabilities by
fit$logw <- fit$logw + logp/logq, where logp is the
log-density of the prior distribution, and logq is the log-density
of the sampling distribution. (This is importance sampling; see, for
example, R. M. Neal, Annealed importance sampling, Statistics
and Computing, 2001.)
Specifying the prior variance of the regression coefficients (sa) can
be difficult, which is why we have included the option of fitting this
hyperparameter to the data (see input argument
above). However, in many settings, especially when a small number of
variables are included in the regression model, it is preferrable to
average over candidate settings of sa instead of fitting sa to the
data. To choose a set of candidate settings for sa, we have advocated
for setting sa indirectly through a prior estimate of the proportion
of variance in the outcome explained by the variables (abbreviated as
PVE), since it is often more natural to specify the PVE rather than
the prior variance (see references below). This is technically only
suitable or the linear regression model (
family = "gaussian"),
but could potentially be used for the linear regression model in an
For example, one could approximate a uniform prior on the PVE by drawing the PVE uniformly between 0 and 1, additionally specifying candidate settings for the prior log-odds, then computing the prior variance (sa) as follows:
sx <- sum(var1.cols(X)) sa <- PVE/(1-PVE)/(sigmoid(log(10)*logodds)*sx)
Note that this calculation will yield
sa = 0 when
sa = Inf when
PVE = 1.
Also, bear in mind that if there are additional covariates (Z)
included in the linear regression model that explain variance in Y,
then it will usually make more sense to first remove the linear
effects of these covariates before performing this calculation. The
PVE would then represent the prior proportion of variance in the
residuals of Y that are explained by the candidate variables. For an
example of how to do this, see the code for function
file varbvs.R, under "preprocessing steps". Alternatively, one could
include the matrix Z in the calculation above, taking care to ensure
that the covariates are included in the model with probability 1.
Finally, we point out that the optimization procedures were
carefully designed so that they can be applied to very large data
sets; to date, this code has been tested on data sets with >500,000
variables and >10,000 samples. An important limiting factor is the
ability to store the data matrix X in memory. To reduce memory
requirements, in the MATLAB interface we require that X be single
precision, but this option is not available in R. Additionally, we
mostly avoid generating intermediate products that are of the same
size as X. Only one such intermediate product is generated when
family = "gaussian", and none for
family = "binomial".
An object with S3 class
Either "gaussian" or "binomial".
Number of data samples used to fit model.
Settings for sigma (family = "gaussian" only).
Settings for prior variance parameter.
Prior log-odds settings.
TRUE if prior is identical for all variables. When
logodds is a p x ns matrix,
prior.same = FALSE.
Scale parameter for prior on hyperparameter sa.
Degrees of freedom for prior on hyperparameter sa.
If TRUE, sigma was fit to data for each setting of
prior logodds (
family = "gaussian" only).
If TRUE, sa was fit to data for each setting of prior logodds.
An array with ns elements, in which
logw[i] is the
variational lower bound on the marginal log-likelihood for setting i
of the hyperparameters. These provide approximate values of the
marginal log-likelihood for each hyperparameter setting.
Normalized weights (posterior probabilities) for each of the
hyperparameter settings computed from
Variational estimates of posterior inclusion probabilities for each hyperparameter setting.
Variational estimates of posterior mean coefficients for each hyperparameter setting.
Variational estimates of posterior variances for each hyperparameter setting.
"Averaged" posterior inclusion probabilities computed as a
weighted average of the individual PIPs (
alpha), with weights
"Averaged" posterior mean regression coefficients.
Posterior mean regression coefficients for covariates, including intercept, for each hyperparameter setting.
Additional variational parameters for
If TRUE, eta was fit to data (
For each hyperparameter setting, and for each variable,
mean estimate of the proportion of variance in outcome explained
conditioned on variable being included in the model (
Samples drawn from posterior distribution giving
estimates of proportion of variance in outcome (y) explained by fitted
variable selection model. This is for
family = "gaussian"
only). To obtain the posterior mean estimate of the proportion of
variance explained (PVE), for example, simply type
P. Carbonetto and M. Stephens (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7, 73--108.
Y. Guan and M. Stephens (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Annals of Applied Statistics 5, 1780--1815.
X. Zhou, P. Carbonetto and M. Stephens (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9, e1003264.
# LINEAR REGRESSION EXAMPLE # ------------------------- # Data are 200 uncorrelated ("unlinked") single nucleotide polymorphisms # (SNPs) with simulated genotypes, in which the first 20 of them have an # effect on the outcome. Also generate data for 3 covariates. maf <- 0.05 + 0.45*runif(200) X <- (runif(400*200) < maf) + (runif(400*200) < maf) X <- matrix(as.double(X),400,200,byrow = TRUE) Z <- randn(400,3) # Generate the ground-truth regression coefficients for the variables # (X) and additional 3 covariates (Z). Adjust the QTL effects so that # the variables (SNPs) explain 50 percent of the variance in the # outcome. u <- c(-1,2,1) beta <- c(rnorm(20),rep(0,180)) beta <- 1/sd(c(X %*% beta)) * beta # Generate the quantitative trait measurements. y <- c(-2 + Z %*% u + X %*% beta + rnorm(400)) # Fit the variable selection model. fit <- varbvs(X,Z,y,logodds = seq(-3,-1,0.1)) print(summary(fit)) # Compute the posterior mean estimate of hyperparameter sa. sa <- with(fit,sum(sa * w)) # Compare estimated outcomes against observed outcomes. y.fit <- predict(fit,X,Z) print(cor(y,y.fit)) # LOGISTIC REGRESSION EXAMPLE # --------------------------- # Data are 100 uncorrelated ("unlinked") single nucleotide polymorphisms # (SNPs) with simulated genotypes, in which the first 10 of them have an # effect on the outcome. Also generate data for 2 covariates. maf <- 0.05 + 0.45*runif(100) X <- (runif(750*100) < maf) + (runif(750*100) < maf) X <- matrix(as.double(X),750,100,byrow = TRUE) Z <- randn(750,2) # Generate the ground-truth regression coefficients for the variables # (X) and additional 2 covariates (Z). u <- c(-1,1) beta <- c(0.5*rnorm(10),rep(0,90)) # Simulate the binary trait (case-control status) as a coin toss with # success rates given by the logistic regression. y <- as.double(runif(750) < sigmoid(-1 + Z %*% u + X %*% beta)) # Fit the variable selection model. fit <- varbvs(X,Z,y,"binomial",logodds = seq(-2,-0.5,0.5)) print(summary(fit))