Research Notes

Accurate Methods for the Statistics of Surprise and Coincidence. Ted Dunning. Computational Linguistics 1993.

"ordinary words are 'rare', any statistical work with texts must deal with the reality of rare events ... Unfortunately, the foundational assumption of most common statistical analyses used in computational linguistics is that the events being analyzed are relatively common."
Counting a word w can be viewed as a series of Bernoulli trials. Each token is tested to see if it is w. Assuming a uniform probability p that it is w, the count is distributed binomially and, for np(1-p)>5 (where n=number of tokens), it is distributed almost normally. But this is not true when np(1-p)<1.
Given outcomes k, propose a model with parameters w. The likelihood of the parameter value w is P(k|w). A hypothesis is a subset of the parameter space W.
The likelihood ratio of the hypothesis (parameter subspace) W0 = LR = max_{w \in W0} P(k|w) / max_{w \in W} P(k|w).
Fact: -2 log LR ~ \Chi^2(dim W - dim W0).

More info on parametric and distribution-free tests: Bradley (1968), and Mood, Graybill, and Boes (1974).
Likelihood ratio tests: Mood et al. (1974)

Friday, April 26, 2013