Research Notes

Wednesday, October 1, 2014

Relation between precision and recall in binary classification

Let $tp, fp, fn,$ and $tn$ be the number of true positives, false positives, false negatives and true negatives obtained on some set by a binary classifier. Let $P$ and $R$ be the precision and recall given by $P=\frac{tp}{tp+fp}, R = \frac{tp}{tp+fn}$. Let $N=tp+fp+fn+tn$ be the total size of the set. The precision and recall for the negative class are $P'=\frac{tn}{tn+fn}, R'=\frac{tn}{tn+fp}$.

We want to show that $P'$ increases with $R$, and $R'$ increases with $P$ (and hence analyzing the performance of any one class is sufficient).

We use $$P'=\frac{N-(tp+fp)-fn}{N-(tp+fp)} = 1 - \frac{fn}{N-(tp+fp)}$$, and
$$R = \frac{tp}{tp+fn} \Rightarrow tp+fn = \frac{tp}{r} \Rightarrow fn = tp(\frac{1}{R}-1)$$, to get $$P' = 1 - (\frac{1}{R}-1) \frac{tp}{N-(tp+fp)} = 1 - (\frac{1}{R}-1) \frac{1}{\frac{N}{tp}-(1+\frac{fp}{tp})}$$.

$R$ increases when $tp$ increases (if $fn$ decreases, then $tp$ has to increase). Also, an increase in $tp$ has no effect on $fp$. Thus, by increasing $R$ (or $tp$), the second term in the RHS to decrease, thus increasing $P'$. QED.

A similar relation holds between $P$ and $R'$ by symmetry between the two classes.

So what?

When comparing the results from two binary classifiers, we might be interested in how the classifiers performs on both classes, rather than just one. In such cases, it is sufficient to compute both precision and recall for one of the classes and compare the same for the two classifiers. If one classifier has better precision on the positive class, it also means that it has better recall on the negative class.
If we want to find the best hyper-parameters for a classifier using grid search, and we want to find the parameter for which the the classifier does well on both classes, then we could define "best" in terms of a score that incorporates both precision and recall (e.g. the F1 score). This ensures, e.g., that precision is not optimized at the cost of recall (i.e. at the cost of precision of the negative class).

Tuesday, July 16, 2013

Language Models for Keyword Search over Data Graphs. Yosi Mass, Yehoshua Sagiv. WSDM 2012

Problem

given a keyword query, find entities in a graph of entities

the graph is probably derived from a database; and it is presumed that the user will find SQL difficult to use.
examples of such databases include Wikipedia, IMDB, and Mondial.

Tuesday, July 9, 2013

Characterizing the Influence of Domain Expertise on Web Search Behavior. Ryen W. White, Susan T. Dumais, Jaime Teevan. WSDM 2009

Look up

maximum-margin averaged perceptron (Collins, M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. EMNLP 2002)

Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search. Navin Goyal, Yury Lifshits, Hinrich Schütze. WSDM 2008

Ideas

"combinatorial" setting: only comparisons between similarity scores allowed.
two randomized algorithms for nearest-neighbor search, requiring O(n^2) and O(nlogn) space for preprocessed data

Sunday, June 30, 2013

Cross-Lingual Latent Topic Extraction. Duo Zhang, Qiaozhu Mei, ChengXiang Zhai. ACL 2010

Key ideas

Input: unaligned document sets in two languages, a bilingual dictionary
Output:

a set of aligned topics (word distributions) in the two languages, that can characterize the shared topics
a topic coverage distribution for each language (coverage of each topic in that language)

Method:

Start with ML objective of PLSA
Add a term to incorporate dictionary constraints (DC)
Dictionary modeled as a weighted bipartite graph (weight = translation probability)
ML using Generalized EM (because DC maximization has no closed-form solution)

We can't maximize DC; instead just try to improve over current

Friday, April 26, 2013

Accurate Methods for the Statistics of Surprise and Coincidence. Ted Dunning. Computational Linguistics 1993.

Ideas

"ordinary words are 'rare', any statistical work with texts must deal with the reality of rare events ... Unfortunately, the foundational assumption of most common statistical analyses used in computational linguistics is that the events being analyzed are relatively common."
Counting a word w can be viewed as a series of Bernoulli trials. Each token is tested to see if it is w. Assuming a uniform probability p that it is w, the count is distributed binomially and, for np(1-p)>5 (where n=number of tokens), it is distributed almost normally. But this is not true when np(1-p)<1.
Given outcomes k, propose a model with parameters w. The likelihood of the parameter value w is P(k|w). A hypothesis is a subset of the parameter space W.
The likelihood ratio of the hypothesis (parameter subspace) W0 = LR = max_{w \in W0} P(k|w) / max_{w \in W} P(k|w).
Fact: -2 log LR ~ \Chi^2(dim W - dim W0).

References

More info on parametric and distribution-free tests: Bradley (1968), and Mood, Graybill, and Boes (1974).
Likelihood ratio tests: Mood et al. (1974)

Thursday, April 25, 2013

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension. Daniel Andrade, Takuya Matsuzaki, Jun'ichi Tsuji. TALIP 2012

Ideas

Use only statistically significant context---determined using Bayesian estimate of PMI
"calculate a similarity score ... using the probability that the same pivots [(words from the seed lexicon)] will be extracted for both the query word and the translation candidate."
"several context [features] ... a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence."
"In order to make these context positions comparable across Japanese and English ... we use several heuristics to adjust the dependency trees appropriately."

Comments

"the degree of association is defined as a measurement for finding words that co-occur, or which do not co-occur, more often than we would expect by pure chance [e.g.] Log-Likelihood-Ratio ... As an alternative, we suggest to use the statistical significance of a positive association"
Heuristics to make dependency trees comparable are language-pair-specific (for EN-JP only)

Related work

Standard approach

Fix corpora in two languages, and pivot words (seed lexicon)
For each query word, construct vector of pivot words, and compare.

Construct vector: some measure of association between query word and pivot word
Compare: some similarity measure suitable for association measure

"Context" is a bag-of-words, usually a sentence (or doc?).

Variations of standard approach

Peters and Picchi 1997: PMI
Fung 1998: tf.idf and cosine similarity
Rapp 1999: log-likelihood ratio and Manhattan distance
Koehn and Knight 2002: Spearman correlation
Pekar et al 2006: conditional probability
Laroche and Langlais 2010: log-odds-ratio and cosine similarity

Variations incorporating syntax

Rapp 1999: use word order (assumes word ordering is similar for both languages)
Pekar et al 2006: use verb-noun dependency
Otero & Campos 2008: POS tag the corpus and use lexico-syntactic patterns as features; e.g. extract (see, SUBJ, man) from "A man sees a dog." and use (see, SUBJ, *) to find translations for "man".
Garera et al 2009: use predecessors and successors in dependency graph (and do not use bag-of-words at all)

Variations incorporating non-pivot words (to overcome the "seed lexicon bottleneck")

Gaussier et al 2004: construct vector of all words (and not just pivot words) for the query word and each pivot word. Now construct vector of pivot words, and instead of association measure between query and pivot, use the similarity between all-words query vector and all-words pivot vector.
Dejean et al 2002: use domain-specific multilingual thesaurus

Variations incorporating senses

Ismail and Manandhar 2010: construct query vector "given" another word (the sense-disambiguator (SD) word, say). For a query word, one can construct different vectors given different SD words. For each vector, find translation.

Probabilistic approach

Haghighi et al 2008: use a generative model where source and target words are generated from a common latent subspace. Maximize likelihood in the graphical model to learn the source-target matchings.

"suffers from high computational costs ... They did not compare [with] ... standard context vector approaches, which makes it difficult to estimate the possible gains from their method."

Graph-based approach

Laws et al 2010

one graph per language, words as nodes, 3 types of nodes (adjectives, verbs, nouns) and 3 types of edges (adjectival modification, verb-object relation, noun coordination), edge weights represent strength of correlation
seed lexicon for connecting the two graphs
node pair similarity computed using SimRank-like algorithm

Attended a literature review on Question Answering by Akihiro Katsura. Some interesting references.

Green, Chomsky, et al. 1961. The BASEBALL system.

rule-based

Isozaki et al. 2009.

machine learning-based

Methodology of QA

Question analysis

Xue et al. SIGIR 2008. Retrieval models for QA archives.

Text retrieval

Jones et al. IPM 2000. ---Okapi/BM25
Berger et al. SIGIR 2000.---Bridging the lexical chasm (OOV problem)
Brill et al. TREC 2001.---Data intensive QA

Answer candidate extraction

Lafferty et al. ICML 2001.---CRF
Ravichandran & Hovy. ACL 2002.---Surface text patterns for QA

Answer selection

Clarke et al. SIGIR 2001.---Redundancy in QA

Other related work

Cao et al. WWW 2008.---Recommending questions
Wang et al. SIGIR 2009.---Similar questions
Wang et al. SIGIR 2009.---answer ranking
Jeon et al. SIGIR 2006.---answer ranking with non-textual features (votes, etc.)