truncash
)truncash
(Truncated ASH) is an exploratory project with Matthew, built on ashr
.
Note: All the results and plots presented in the pages below should be reproduceable on your computer. Follow the setup instructions if you are interested in reproducing the results for yourself.
Prof. Matthew Stephens did a quick investigation of the p values and z scores obtained for simulated null data (using just voom transform, no correction) from real RNA-seq data of GTEx. Here is what he found.
“I found something that I hadn’t realized, although is obvious in hindsight: although you sometimes see inflation under null of \(p\)-values/\(z\)-scores, the most extreme values are not inflated compared with expectations (and tend to be deflated). That is the histograms of \(p\)-values that show inflation near \(0\) (and deflation near \(1\)) actually hide something different going on in the very left hand side near \(0\). The qq-plots are clearer… showing most extreme values are deflated, or not inflated. This is expected under positive correlation i think. For example, if all \(z\)-scores were the same (complete correlation), then most extreme of n would just be \(N(0,1)\). but if independent the most extreme of n would have longer tails…”
Matthew’s initial observation inspired this project. If under positive correlation, the most extreme tend to be not inflated, maybe we can use them to control the false discoveries. Meanwhile, if the moderate are more prone to inflation due to correlation, maybe it’s better to make only partial use of their information.
As Prof. Michael Stein pointed during a conversation with Matthew, if the marginal distribution is correct then the expected number exceeding any threshold should be correct. So if the tail is “usually”" deflated, it should be that with some small probability there are many large \(z\)-scores (even in the tail). Therefore, if “on average” we have the right number of large \(z\)-scores/small \(p\)-values, and “usually” we have too few, then “rarely” we should have too many. A simulation is run to check this intuition.
In order to understand the behavior of \(p\)-values of top expressed, correlated genes under the global null, simulated from GTEx data, we apply two FWER-controlling multiple comparison procedures, Holm’s “step-down” ([Holm 1979]) and Hochberg’s “step-up.” ([Hochberg 1988])
Using a toy model to examine and document the pipeline to simulate null summary statistics at each step, including edgeR::calcNormFactors
, limma::voom
, limma::lmFit
, limma::eBayes
.
Apply two FDR-controlling procedures, BH and BY, as well as two \(s\) value models, ash
and truncash
to the simulated, correlated null data, and compare the numbers of false discoveries (by definition, all discoveries should be false) obtained. Part 1 uses \(z\) scores only, Part 2 uses \(\hat \beta\) and moderated \(\hat s\).
\(\hat\pi_0\) estimated by ash
and truncash
with \(T = 1.96\) on correlated global null data simulated from GTEx/Liver. Ideally they should be close to \(1\).
For various FWER / FDR controlling procedures, and for truncash
, what matters the most is the behavior of the most extreme observations. Here these extreme \(p\) values are plotted along with common critical values used by various procedures, in order to shed light on their behavior.
It’s very exploratory. May be related to Extreme Value Thoery and Concentration of Measure. To be continued.
What will happen if we allow the threshold in truncash
dependent on data? Let’s start from the case when we only know the single most extreme observation.
When moving to \(t\) likelihood, or in other words, when taking randomness of \(\hat s\) into consideration, things get trickier. Here we review several techiniques currently used in Matthew’s lab, regarding \(t\) likelihood and uniform mixture priors, based on a discussion with Matthew.
ash
-hostile data setsBH
-hostile data setsAn implicit key question of this inquiry is: what’s the behavior of \(z\) scores under dependency? We take a look at their histograms. First for randomly sampled data sets. Second for those most “hostile” to ash
. Third for those most “hostile” to BH
. The bottom line is we are reproducing what Efron observed in microarray data, that “the theoretical null may fail” in different ways. Now the key questions are
Why the theoretical null may fail? What does it mean by correlation?
Can truncash
make ash
more robust against some of the foes that make the theoretical null fail?
Generally, how robust is empirical Bayes? Is empirical Bayes robust or non-robust to certain kinds of correlation?
Inspired by Schwartzman 2010, we experiment a new way to tackle “empirical null.”
In Gaussian derivatives decomposition, weights \(W_k\) and especially normalized weights \(W_k^s = W_k\sqrt{k!}\) contain substantial information. In order to produce a proper density, and in order to regularize the fitted density to make it describe a plausible empiricall correlated null, constraints need to be imposed on the weights.
Both true effects and correlation can distort the empirical distribution away from the standard normal \(N(0, 1)\), and Gaussian derivatives are presumably able to fit both. Therefore, is there a way to let Gaussian derivatives with a reasonable number of reasonable weights automatically identify correlated null from true effects? It works well when effects are not too small right now.
Continuing from the empirical studies above, we are looking at why \(N\left(0, \sigma^2\right)\) when \(\sigma^2\) is small can be satisfactorily fitted by a limited number of Gaussian derivatives.
ASH
ASH
on correlated nullASH
on correlated null: BH
vs \(\hat\pi_0\)ASH
on correlated null: BH
vs lfsr
Under the exchangeability assumption, the goodness of fit of empirical Bayes can be measured by the behavior of \(\left\{\hat F_j = \hat F_{\hat\beta_j | \hat s_j}\left(\hat\beta_j\mid\hat s_j\right)\right\}\). Meanwhile, if ASH
is applied to correlated null \(z\) scores, estimated \(\hat g\) will not only be different from \(\delta_0\); moreover, with this estimated \(\hat g\), \(\left\{\hat F_j\right\}\) might not behave like \(\text{Unif}\left[0, 1\right]\).
Essentially we hope to tell whether \(n\) random samples are uniformly distributed, and the task seems more complicated than what it sounds. There are multiple ways to do it, and some of them have been absorbed into the ashr
package.
Take a look at the null \(z\) scores we’ve been experimenting with so far and check whether they truly are marginally \(N\left(0, 1\right)\). It’s an enhanced version of the simulation on occurrence of extreme observations. The non-null \(z\) scores are produced for comparison.
This whole project comes from a ubiquitous issue that in the simultaneous inference problem, the deviation from what would be expected under the null can come from both distortion due to correlation and enrichment due to true effects. Ideally, it would be perfect if we could create controls which keep the distortion by correlation unchanged, but remove the true effects. Efron in a series of papers (for example, Efron et al 2001) suggests to create the null from the raw data by resampling or permutation. We apply the idea to our data sets, and the results are not promising.
The previous investigation is boiled down to a prototypical problem: the classic normal means inference, but with heteroskedastic and especially correlated noise. Gaussian derivatives and ASH
provide a way to tackle it.
cvxr
is still under active development. We are moving to the more established Rmosek
, exploring and experimenting this convex optimization package.
Rmosek
: Primal and dualRmosek
: NormalizationRmosek
: RegularizationSeveral ways are explored to make fitting Gaussian derivatives more efficient and accurate.
ASH
: Simulated true effects with homoscedastic correlated noisesASH
: Simulated true effects with heteroskedastic correlated noisesFitting the corrrelated null with Gaussian derivatives has two constraints: non-negativity and orderly decay, and both are intractable. We are using \(L_1\) regularization to try to solve both.
Using the seqgendiff
pipeline developed by David, Mengyin, and Matthew, we are adding simulated \(\pi_0\delta_0 + \left(1 - \pi0\right)N\left(0, \sigma^2\right)\) signals to GTEx/Liver RNA-seq expression matrix, and compare our method with BH
, qvalue
, locfdr
, ash
, sva
, cate
, mouthwash
. The pipeline and result here are mostly exploratory and in development stage. More mature version will follow.
Following Gao’s suggestion, we are taking a look at whether Gaussian derivatives can satisfactorily fit synthetic correlated \(N\left(0, 1\right)\) \(z\) scores. Suggested by Matthew, it might have something to do with the distribution of eigenvalues in a random matrix, and the Wigner semicircle distribution. It’s also an interesting comparison with Gaussian mixtures.
The large-scale simulation testing for GD-ASH with simulated effects under different sparsity, strength, and shape, and realistic correlated noises.
AUC-wise, \(p\) values and \(BH\) should perform almost the same, since BH
doesn’t change the order of \(p\) values, but in practice their performance might differ. It turns out ties should be one important reason.
Talked about using Gaussian derivatives to handle correlation on Matthew’s group meeting.
An RNA-seq data set of mouse hearts contains \(23K\)+ genes and only 4 samples, two for each condition. GD-ASH
applying to this data set raises several interesting questions.
GD-ASH
with Gaussian derivative likelihood: Initial simulationGD-ASH
with Gaussian derivative likelihood: Model, formulae, and simulationWhen the noise is correlated, the likelihood should be correlated. However, in the GD-ASH
model, noise is essentially treated to be independently sampled from an empirical null fitted by Gaussian derivatives, so it would make more sense to use Gaussian derivatives, rather than simple Gaussian, as likelihood.
A more in-depth look into the primal vs dual problem in Rmosek
implementation.
A close look at Efron’s classic BRCA data, used in many of his papers to illustrate the empirical Bayes idea which inspires this project.
CASH
Simulation, Part 1: Overall data setsCASH
Simulation, Part 2: Over-dispersed data setsCASH
Simulation, Part 3: Under-dispersed data setsCASH
Simulation, Part 4: Different effect size distributionsCASH
Simulation, Part 5: pFDP variabilityCASH
Simulation, Part 6: Synthetic vs real dataCASH
Simulation, Part 7: More on synthetic dataUsing the same framework of Gaussian derivative likelihood, extensive simulations are performed to compare CASH
with other methods, smaller simulations are used to facilitate the exploration, different visualization methods are explored to illustrate the advantages of CASH
.
CASH
on FDR, Part 1: \(g = 0.5\delta_0 + 0.5N(0, \sigma_n^2)\)CASH
on FDR, Part 2: \(g = 0.5\delta_0 + 0.5N(0, (2\sigma_n)^2)\)CASH
on FDR, Part 3: \(g = 0.9\delta_0 + 0.1N(0, \sigma_n^2)\)CASH
on FDR, Part 4: \(g = 0.9\delta_0 + 0.1N(0, (2\sigma_n)^2)\)CASH
on FDR, Part 5: \(g = 0.5\delta_0 + 0.25N(-2\sigma_n, \sigma_n^2) + 0.25N(2\sigma_n, \sigma_n^2)\)CASH
on FDR, Part 6: \(g = 0.6\delta_0 + 0.3N(0, \sigma_n^2) + 0.1N(0, (2\sigma_n)^2)\)An extensive study of CASH
and other methods in different scenarios shows that 1. CASH
produces more accurate \(\hat\pi_0\); 2. CASH
produces less variable false discovery proportions at nominal FDR cutoff; 3. CASH
tends to overfit when the noise is under-dispersed or looks like independent; 4. CASH
doesn’t work well when Unimodal Assumption (UA) is broken.
On the deconvolution front, CASH
generates more robust \(\hat g\) than ASH
in the correlated noise case, and doesn’t show noticeable overfitting in the independent noise case.
First public presentation of CASH
.
This R Markdown site was created with workflowr