Research Notes

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension. Daniel Andrade, Takuya Matsuzaki, Jun'ichi Tsuji. TALIP 2012

Ideas

Use only statistically significant context---determined using Bayesian estimate of PMI
"calculate a similarity score ... using the probability that the same pivots [(words from the seed lexicon)] will be extracted for both the query word and the translation candidate."
"several context [features] ... a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence."
"In order to make these context positions comparable across Japanese and English ... we use several heuristics to adjust the dependency trees appropriately."

Comments

"the degree of association is defined as a measurement for finding words that co-occur, or which do not co-occur, more often than we would expect by pure chance [e.g.] Log-Likelihood-Ratio ... As an alternative, we suggest to use the statistical significance of a positive association"
Heuristics to make dependency trees comparable are language-pair-specific (for EN-JP only)

Related work

Standard approach

Fix corpora in two languages, and pivot words (seed lexicon)
For each query word, construct vector of pivot words, and compare.

Construct vector: some measure of association between query word and pivot word
Compare: some similarity measure suitable for association measure

"Context" is a bag-of-words, usually a sentence (or doc?).

Variations of standard approach

Peters and Picchi 1997: PMI
Fung 1998: tf.idf and cosine similarity
Rapp 1999: log-likelihood ratio and Manhattan distance
Koehn and Knight 2002: Spearman correlation
Pekar et al 2006: conditional probability
Laroche and Langlais 2010: log-odds-ratio and cosine similarity

Variations incorporating syntax

Rapp 1999: use word order (assumes word ordering is similar for both languages)
Pekar et al 2006: use verb-noun dependency
Otero & Campos 2008: POS tag the corpus and use lexico-syntactic patterns as features; e.g. extract (see, SUBJ, man) from "A man sees a dog." and use (see, SUBJ, *) to find translations for "man".
Garera et al 2009: use predecessors and successors in dependency graph (and do not use bag-of-words at all)

Variations incorporating non-pivot words (to overcome the "seed lexicon bottleneck")

Gaussier et al 2004: construct vector of all words (and not just pivot words) for the query word and each pivot word. Now construct vector of pivot words, and instead of association measure between query and pivot, use the similarity between all-words query vector and all-words pivot vector.
Dejean et al 2002: use domain-specific multilingual thesaurus

Variations incorporating senses

Ismail and Manandhar 2010: construct query vector "given" another word (the sense-disambiguator (SD) word, say). For a query word, one can construct different vectors given different SD words. For each vector, find translation.

Probabilistic approach

Haghighi et al 2008: use a generative model where source and target words are generated from a common latent subspace. Maximize likelihood in the graphical model to learn the source-target matchings.

"suffers from high computational costs ... They did not compare [with] ... standard context vector approaches, which makes it difficult to estimate the possible gains from their method."

Graph-based approach

Laws et al 2010

one graph per language, words as nodes, 3 types of nodes (adjectives, verbs, nouns) and 3 types of edges (adjectival modification, verb-object relation, noun coordination), edge weights represent strength of correlation
seed lexicon for connecting the two graphs
node pair similarity computed using SimRank-like algorithm

Research Notes

Thursday, April 25, 2013

No comments:

Post a Comment