表題番号:2017K-188
        日付:2018/02/21
    
研究課題ベイズ統計に基づく情報アクセス評価体系の構築
    | 研究者所属(当時) | 資格 | 氏名 | |
|---|---|---|---|
| (代表者) | 理工学術院 基幹理工学部 | 教授 | 酒井 哲也 | 
- 研究成果概要
- I published the following full paper at SIGIR 2017, the top conference in information retrieval.
 The following is the abstract:
 Using classical statistical significance tests, researchers can only
 discuss P(D+|H), the probability of observing the data D at hand or
 something more extreme, under the assumption that the hypothesis
 H is true (i.e., the p-value). But what we usually want is P(D+|H),
 the probability that a hypothesis is true, given the data. If we use
 Bayesian statistics with state-of-the-art Markov Chain Monte Carlo
 (MCMC) methods for obtaining posterior distributions, this is no
 longer a problem. That is, instead of the classical p-values and 95%
 confidence intervals, which are often misinterpreted respectively
 as “probability that the hypothesis is (in)correct” and “probability
 that the true parameter value drops within the interval is 95%,” we
 can easily obtain P(H|D) and credible intervals which represent
 exactly the above. Moreover, with Bayesian tests, we can easily
 handle virtually any hypothesis, not just “equality of means,” and
 obtain an Expected A Posteriori (EAP) value of any statistic that
 we are interested in. We provide simple tools to encourage the
 IR community to take up paired and unpaired Bayesian tests for
 comparing two systems. Using a variety of TREC and NTCIR data,
 we compare P(H|D) with p-values, credible intervals with confidence
 intervals, and Bayesian EAP effect sizes with classical ones.
 Our results show that (a) p-values and confidence intervals can
 respectively be regarded as approximations of what we really want,
 namely, P(H|D) and credible intervals; and (b) sample effect sizes
 from classical significance tests can differ considerably from the
 Bayesian EAP effect sizes, which suggests that the former can be
 poor estimates of population effect sizes. For both paired and unpaired
 tests, we propose that the IR community report the EAP, the
 credible interval, and the probability of hypothesis being true, not
 only for the raw difference in means but also for the effect size in
 terms of Glass’s delta.