When only the most extreme observation is known

Last updated: 2017-12-21

Code version: 6e42447

Introducation

Up until now, truncash only uses a threshold that’s pre-specified, that is, independent with the data. So a natrual question is, what will happen if we choose a threshold that is data driven, such as the \(n^\text{th}\) most extreme observation or the top \(q\%\) quantile?

For a start, Matthew had an idea that what if the only thing we know is the most extreme observation \((\hat\beta_{(n)}, \hat s_{(n)})\), as well as the total number of observations \(n\). What does this single data point tell us?

Model

Start with our usual ash model.

\[ \begin{array}{c} \hat\beta_j | \hat s_j, \beta_j \sim N(\beta_j, \hat s_j^2)\\ \beta_j \sim \sum_k\pi_k N(0, \sigma_k^2) \end{array} \] Now we only observe \((\hat\beta_{(n)}, \hat s_{(n)})\) with the information that \(|\hat\beta_{(n)}/\hat s_{(n)}| \geq |\hat\beta_{j}/\hat s_{j}|\), \(j = 1, \ldots, n\). This is essentially separating \(n\) observations into two groups.

\[ \text{Group 1: }(\hat\beta_{(1)}, \hat s_{(1)}), \ldots, (\hat\beta_{(n - 1)}, \hat s_{(n - 1)}), \text{ with } |\hat\beta_j/\hat s_j| \leq t = |\hat\beta_{(n)}/\hat s_{(n)}| \] \[ \text{Group 2: }(\hat\beta_{n}, \hat s_{n}), \text{ with } |\hat\beta_{(n)}/\hat s_{(n)}| = t \] Or in other words, it should be related to truncash using the threshold \(t = |\hat\beta_{(n)}/\hat s_{(n)}|\), at least from the likelihood principle point of view.

Back-of-the-envelope calculation

Suppose \(X_1 \sim F_1, X_2\sim F_2, \ldots, X_n \sim F_n\), with \(F_i\) being the cdf of the random variable \(X_i\), with a pdf \(f_i\). In ash’s setting, we can think of \(X_i = |\hat\beta_i/ \hat s_i|\), and \(f_i\) is the convolution of a common unimodel distribution \(g\) (to be estimated) and the idiosyncratic likelihood of \(|\hat\beta_j / \hat s_j|\) given \(\hat s_j\) (usually related to normal or Student’s t, but could be generalized to others). Let \(X_{(n)}:=\max\{X_1, X_2, \ldots, X_n\}\), the extreme value of these \(n\) random variables.

\[ \begin{array}{rl} & P(X_{(n)} \leq t) = \prod_{i = 1}^n F_i(t) \\ \Rightarrow & p_{X_{(n)}}(t) = dP(X_{(n)} \leq t)/dt \neq \prod_{i = 1}^{n-1} F_i(t)f_n(t) \end{array} \] where \(\{1, \ldots, n-1\}\) are the index set of less extreme observations and \(n\) of the most extreme one. So these two statements are not equivalent.

The largest value in \(\{X_1, X_2, \ldots, X_n\}\) is \(t\).
We have \(n\) random variables and we only observe one; all others are less than it.

Special case

If we have \(F_1 = F_2 = \cdots = F_n\), the two statements are somehow indeed related because \[ \begin{array}{rl} & P(X_{(n)} \leq t) = (F(t))^n \\ \Rightarrow & p_{X_{(n)}}(t) = dP(X_{(n)} \leq t)/dt = n(F(t))^{n-1}f(t) \\ \propto & (F(t))^{n-1}f(t)\\ \end{array} \] In other words, we can regard “known the largest observation only” as equivalent to “using the largest observation as the threshold in truncash.”

\(F_1 = F_2 = \cdots = F_n\) in current setting implies that \(\hat\beta_j / \hat s_j\) has the same marginal distribution for every observation. Actually it’s not a wild assumption. For example, we always have

\[ \hat\beta_j / \hat s_j | \beta_j, s_j, \nu_j \sim t_{\nu_j}(\beta_j / s_j) \] If we further assume

\[ \beta_j / s_j \sim g \] then we’ll arrive at the result that \(\hat\beta_j / \hat s_j\) has the same marginal distribution. This assumption is essentially the gold standard everybody implicitly makes, refered to as \(\alpha = 1\) assumption in ash.

Session Information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.3  backports_1.1.2 magrittr_1.5    rprojroot_1.3-1
 [5] tools_3.4.3     htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.14   
 [9] stringi_1.1.6   rmarkdown_1.8   knitr_1.17      git2r_0.20.0   
[13] stringr_1.2.0   digest_0.6.13   workflowr_0.8.0 evaluate_0.10.1

This R Markdown site was created with workflowr