（代表者） 理工学術院 基幹理工学部 教授 酒井　哲也

I published the following full paper at SIGIR 2017, the top conference in information retrieval.
The following is the abstract:

Using classical statistical significance tests, researchers can only
discuss P(D+|H), the probability of observing the data D at hand or
something more extreme, under the assumption that the hypothesis
H is true (i.e., the p-value). But what we usually want is P(D+|H),
the probability that a hypothesis is true, given the data. If we use
Bayesian statistics with state-of-the-art Markov Chain Monte Carlo
(MCMC) methods for obtaining posterior distributions, this is no
longer a problem. That is, instead of the classical p-values and 95%
confidence intervals, which are often misinterpreted respectively
as “probability that the hypothesis is (in)correct” and “probability
that the true parameter value drops within the interval is 95%,” we
can easily obtain P(H|D) and credible intervals which represent
exactly the above. Moreover, with Bayesian tests, we can easily
handle virtually any hypothesis, not just “equality of means,” and
obtain an Expected A Posteriori (EAP) value of any statistic that
we are interested in. We provide simple tools to encourage the
IR community to take up paired and unpaired Bayesian tests for
comparing two systems. Using a variety of TREC and NTCIR data,
we compare P(H|D) with p-values, credible intervals with confidence
intervals, and Bayesian EAP effect sizes with classical ones.
Our results show that (a) p-values and confidence intervals can
respectively be regarded as approximations of what we really want,
namely, P(H|D) and credible intervals; and (b) sample effect sizes
from classical significance tests can differ considerably from the
Bayesian EAP effect sizes, which suggests that the former can be
poor estimates of population effect sizes. For both paired and unpaired
tests, we propose that the IR community report the EAP, the
credible interval, and the probability of hypothesis being true, not
only for the raw difference in means but also for the effect size in
terms of Glass’s delta.