Research Notes

Identifying Word Translations from Comparable Documents Without a Seed Lexicon. Reinhard Rapp, Serge Sharoff, Bogdan Babych. LREC 2012

Idea

Assume only document-aligned comparable corpora (and no seed lexicon---"typically comprising at least 10,000 words")
Characterize each article by a set of keywords
"Formulate translation identification as a variant of the word alignment problem in a noisy setting"

actually solved using a neural net-style algorithm by Rumelhart & McClelland (1987)

Comments

"If ... in language A two words co-occur more often than expected by chance, then their translated equivalents in language B should also co-occur more frequently than expected."

Experiments

Preprocessing

lemmatization (of corpora and evaluation pairs)
"we use the log-likelihood score as a measure of keyness [or salience of words in a document], since it has been shown to be robust to small [documents] ... the threshold of 15.13 for the log-likelihood score is a conservative recommendation for statistical significance."
"[we] applied a threshold of five [occurrences] ... [and] added all words of the ... gold standard(s) [even if they were below the threshold]"

Gold standard

"The source language words in the gold standard were supposed to be systematically derived from a large corpus, covering a wide range of frequencies, parts of speech, and variances of their
distribution. In addition, the corpus from which the gold standard was derived was supposed to be completely separate from the development set (Wikipedia)."
"list of words extracted from the British National Corpus (BNC) by Adam Kilgarriff for the purpose of examining distributional variability." http://kilgarriff.co.uk/bnc-readme.html

A Linguistically Grounded Graph Model for Bilingual Lexicon Extraction. Florian Laws, Lukas Michelbacher, Beate Dorow, Christian Scheible, Ulrich Heid, Hinrich Schutze. COLING 2010

TOREAD

Research Notes

Wednesday, April 24, 2013

No comments:

Post a Comment