表題番号：2017K-188
日付：2018/02/21

研究課題ベイズ統計に基づく情報アクセス評価体系の構築

研究者所属（当時） | 資格 | 氏名 | |
---|---|---|---|

（代表者） | 理工学術院 基幹理工学部 | 教授 | 酒井 哲也 |

- 研究成果概要
- I published the following full paper at SIGIR 2017, the top conference in information retrieval.

The following is the abstract:

Using classical statistical significance tests, researchers can only

discuss P(D+|H), the probability of observing the data D at hand or

something more extreme, under the assumption that the hypothesis

H is true (i.e., the p-value). But what we usually want is P(D+|H),

the probability that a hypothesis is true, given the data. If we use

Bayesian statistics with state-of-the-art Markov Chain Monte Carlo

(MCMC) methods for obtaining posterior distributions, this is no

longer a problem. That is, instead of the classical p-values and 95%

confidence intervals, which are often misinterpreted respectively

as “probability that the hypothesis is (in)correct” and “probability

that the true parameter value drops within the interval is 95%,” we

can easily obtain P(H|D) and credible intervals which represent

exactly the above. Moreover, with Bayesian tests, we can easily

handle virtually any hypothesis, not just “equality of means,” and

obtain an Expected A Posteriori (EAP) value of any statistic that

we are interested in. We provide simple tools to encourage the

IR community to take up paired and unpaired Bayesian tests for

comparing two systems. Using a variety of TREC and NTCIR data,

we compare P(H|D) with p-values, credible intervals with confidence

intervals, and Bayesian EAP effect sizes with classical ones.

Our results show that (a) p-values and confidence intervals can

respectively be regarded as approximations of what we really want,

namely, P(H|D) and credible intervals; and (b) sample effect sizes

from classical significance tests can differ considerably from the

Bayesian EAP effect sizes, which suggests that the former can be

poor estimates of population effect sizes. For both paired and unpaired

tests, we propose that the IR community report the EAP, the

credible interval, and the probability of hypothesis being true, not

only for the raw difference in means but also for the effect size in

terms of Glass’s delta.