Compute fully-factorized variational approximation for Bayesian variable selection in linear (family = gaussian) or logistic regression (family = binomial). More precisely, find the "best" fully-factorized approximation to the posterior distribution of the coefficients, with spike-and-slab priors on the coefficients. By "best", we mean the approximating distribution that locally minimizes the Kullback-Leibler divergence between the approximating distribution and the exact posterior.

varbvs (X, Z, y, family = c("gaussian","binomial"), sigma, sa,
logodds, alpha, mu, eta, update.sigma, update.sa,
optimize.eta, initialize.params, nr = 100, sa0 = 1, n0 = 10,
tol = 1e-4, maxiter = 1e4, verbose = TRUE)

## Arguments

X n x p input matrix, where n is the number of samples, and p is the number of variables. X cannot be sparse, and cannot have any missing values (NA). n x m covariate data matrix, where m is the number of covariates. Do not supply an intercept as a covariate (i.e., a column of ones), because an intercept is automatically included in the regression model. For no covariates, set Z = NULL. The covariates are assigned an improper, uniform prior. Although improper priors are generally not advisable because they can result in improper posteriors and Bayes factors, this choice allows us to easily integrate out these covariates. Vector of length n containing observations of binary (family = "binomial") or continuous (family = "gaussian") outcome. For a binary outcome, all entries of y must be 0 or 1. "gaussian" for linear regression model, or "binomial" for logistic regression model. Candidate settings for the residual variance parameter. Must be of the same length as inputs sa and logodds (or have length equal to the number of columns of logodds). Only used for linear regression, and will generate an error if family = "binomial". If missing, residual variance parameter is automatically fitted to data by computing approximate maximum-likelihood (ML) estimate. Hyperparameter sa is the prior variance of regression coefficients for variables that are included in the model. This prior variance is always scaled by sigma (for logistic regression, we take sigma = 1). Scaling the variance of the coefficients in this way is necessary to ensure that this prior is invariant to measurement scale (e.g., switching from grams to kilograms). This input specifies the candidate settings for sa, of the same length as inputs sigma and logodds (or have length equal to the number of columns of logodds). If missing, prior variance is automatically fitted to data by compute approximate maximum (ML) estimates, or maximum a posteriori estimates when n0 > 0 and sa0 > 0. Hyperparameter logodds is the prior log-odds that a variable is included in the regression model; it is defined as $$logodds = log10(q/(1-q)),$$ where q is the prior probability that a variable is included in the regression model. Note that we use the base-10 logarithm instead of the natural logarithm because it is usually more natural to specify prior log-odds settings in this way. The prior log-odds may also be specified separately for each variable, which is useful is there is prior information about which variables are most relevant to the outcome. This is accomplished by setting logodds to a p x ns matrix, where p is the number of variables, and ns is the number of hyperparameter settings. Note it is not possible to fit the logodds parameter; if logodds input is not provided as input, then it is set to the default value when sa and sigma are missing, and otherwise an error is generated. Good initial estimate for the variational parameter alpha for each hyperparameter setting. Either missing, or a p x ns matrix, where p is the number of variables, and ns is the number of hyperparameter settings. Good initial estimate for the variational parameter mu for each hyperparameter setting. Either missing, or a p x ns matrix, where p is the number of variables, and ns is the number of hyperparameter settings. Good initial estimate of the additional free parameters specifying the variational approximation to the logistic regression factors. Either missing, or an n x ns matrix, where n is the number of samples, and ns is the number of hyperparameter settings. Setting this to TRUE ensures that sigma is always fitted to data, in which case input vector sigma is used to provide initial estimates. Setting this to TRUE ensures that sa is always fitted to data, in which case input vector sa is used to provide initial estimates. When optimize.eta = TRUE, eta is fitted to the data during the inner loop coordinate ascent updates, even when good estimates of eta are provided as input. If FALSE, the initialization stage of the variational inference algorithm (see below) will be skipped, which saves computation time for large data sets. Number of samples of "model PVE" to draw from posterior. Scale parameter for a scaled inverse chi-square prior on hyperparameter sa. Must be >= 0. Number of degrees of freedom for a scaled inverse chi-square prior on hyperparameter sa. Must be >= 0. Large settings of n0 provide greater stability of the parameter estimates for cases when the model is "sparse"; that is, when few variables are included in the model. Convergence tolerance for inner loop. Maximum number of inner loop iterations. If verbose = TRUE, print progress of algorithm to console.

## Regression models

Two types of outcomes (y) are modeled: (1) a continuous outcome, also a "quantitative trait" in the genetics literature; or (2) a binary outcome with possible values 0 and 1. For the former, set family = "gaussian", in which case, the outcome is i.i.d. normal with mean $$u0 + Z*u + X*b$$ and variance sigma, in which u and b are vectors of regresion coefficients, and u0 is the intercept. In the second case, we use logistic regression to model the outcome, in which the probability that y = 1 is equal to $$sigmoid(u0 + Z*u + X*b).$$ See help(sigmoid) for a description of the sigmoid function. Note that the regression always includes an intercept term (u0).

## Co-ordinate ascent optimization procedure

For both regression models, the fitting procedure consists of an inner loop and an outer loop. The outer loop iterates over each of the hyperparameter settings (sa, sigma and logodds). Given a setting of the hyperparameters, the inner loop cycles through coordinate ascent updates to tighten the lower bound on the marginal likelihood, $$Pr(Y | X, sigma, sa, logodds)$$. The inner loop coordinate ascent updates terminate when either (1) the maximum number of inner loop iterations is reached, as specified by maxiter, or (2) the maximum difference between the estimated posterior inclusion probabilities is less than tol.

To provide a more accurate variational approximation of the posterior distribution, by default the fitting procedure has two stages. In the first stage, the entire fitting procedure is run to completion, and the variational parameters (alpha, mu, s, eta) corresponding to the maximum lower bound are then used to initialize the coordinate ascent updates in a second stage. Although this has the effect of doubling the computation time (in the worst case), the final posterior estimates tend to be more accurate with this two-stage fitting procedure.

## Variational approximation

Outputs alpha, mu and s specify the approximate posterior distribution of the regression coefficients. Each of these outputs is a p x ns matrix. For the ith hyperparameter setting, alpha[,i] is the variational estimate of the posterior inclusion probability (PIP) for each variable; mu[,i] is the variational estimate of the posterior mean coefficient given that it is included in the model; and s[,i] is the estimated posterior variance of the coefficient given that it is included in the model. These are also the quantities that are optimized as part of the inner loop coordinate ascent updates. An additional free parameter, eta, is needed for fast computation with the logistic regression model (family = "binomial"). The fitted value of eta is returned as an n x ns matrix.

The variational estimates should be interpreted carefully, especially when variables are strongly correlated. For example, consider the simple scenario in which 2 candidate variables are closely correlated, and at least one of them explains the outcome with probability close to 1. Under the correct posterior distribution, we would expect that each variable is included with probability ~0.5. However, the variational approximation, due to the conditional independence assumption, will typically get this wrong, and concentrate most of the posterior weight on one variable (the actual variable that is chosen will depend on the starting conditions of the optimization). Although the individual PIPs are incorrect, a statistic summarizing the variable selection for both correlated variables (e.g., the total number of variables included in the model) should be reasonably accurate.

## References

P. Carbonetto and M. Stephens (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7, 73--108.

Y. Guan and M. Stephens (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Annals of Applied Statistics 5, 1780--1815.

X. Zhou, P. Carbonetto and M. Stephens (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics 9, e1003264.

summary.varbvs, varbvscoefcred, varbvspve, varbvsnorm, varbvsbin, varbvsbinz, normalizelogweights, varbvs-internal

## Examples

# LINEAR REGRESSION EXAMPLE
# -------------------------
# Data are 200 uncorrelated ("unlinked") single nucleotide polymorphisms
# (SNPs) with simulated genotypes, in which the first 20 of them have an
# effect on the outcome. Also generate data for 3 covariates.
maf <- 0.05 + 0.45*runif(200)
X   <- (runif(400*200) < maf) + (runif(400*200) < maf)
X   <- matrix(as.double(X),400,200,byrow = TRUE)
Z   <- randn(400,3)

# Generate the ground-truth regression coefficients for the variables
# (X) and additional 3 covariates (Z). Adjust the QTL effects so that
# the variables (SNPs) explain 50 percent of the variance in the
# outcome.
u    <- c(-1,2,1)
beta <- c(rnorm(20),rep(0,180))
beta <- 1/sd(c(X %*% beta)) * beta

# Generate the quantitative trait measurements.
y <- c(-2 + Z %*% u + X %*% beta + rnorm(400))

# Fit the variable selection model.
fit <- varbvs(X,Z,y,logodds = seq(-3,-1,0.1))
print(summary(fit))

# Compute the posterior mean estimate of hyperparameter sa.
sa <- with(fit,sum(sa * w))

# Compare estimated outcomes against observed outcomes.
y.fit <- predict(fit,X,Z)
print(cor(y,y.fit))

# LOGISTIC REGRESSION EXAMPLE
# ---------------------------
# Data are 100 uncorrelated ("unlinked") single nucleotide polymorphisms
# (SNPs) with simulated genotypes, in which the first 10 of them have an
# effect on the outcome. Also generate data for 2 covariates.
maf <- 0.05 + 0.45*runif(100)
X   <- (runif(750*100) < maf) + (runif(750*100) < maf)
X   <- matrix(as.double(X),750,100,byrow = TRUE)
Z   <- randn(750,2)

# Generate the ground-truth regression coefficients for the variables
# (X) and additional 2 covariates (Z).
u    <- c(-1,1)
beta <- c(0.5*rnorm(10),rep(0,90))

# Simulate the binary trait (case-control status) as a coin toss with
# success rates given by the logistic regression.
y <- as.double(runif(750) < sigmoid(-1 + Z %*% u + X %*% beta))

# Fit the variable selection model.
fit <- varbvs(X,Z,y,"binomial",logodds = seq(-2,-0.5,0.5))
print(summary(fit))