Thursday, April 25, 2013

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension. Daniel Andrade, Takuya Matsuzaki, Jun'ichi Tsuji. TALIP 2012
  • Ideas
    • Use only statistically significant context---determined using Bayesian estimate of PMI
    • "calculate a similarity score ... using the probability that the same pivots [(words from the seed lexicon)] will be extracted for both the query word and the translation candidate."
    • "several context [features] ... a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence."
    • "In order to make these context positions comparable across Japanese and English ... we use several heuristics to adjust the dependency trees appropriately."
  • Comments
    •  "the degree of association is defined as a measurement for finding words that co-occur, or which do not co-occur, more often than we would expect by pure chance [e.g.] Log-Likelihood-Ratio ... As an alternative, we suggest to use the statistical significance of a positive association"
    • Heuristics to make dependency trees comparable are language-pair-specific (for EN-JP only)
  • Related work
    • Standard approach
      • Fix corpora in two languages, and pivot words (seed lexicon)
      • For each query word, construct vector of pivot words, and compare.
        • Construct vector: some measure of association between query word and pivot word
        • Compare: some similarity measure suitable for association measure
      • "Context" is a bag-of-words, usually a sentence (or doc?).
    • Variations of standard approach
      • Peters and Picchi 1997: PMI
      • Fung 1998: tf.idf and cosine similarity
      • Rapp 1999: log-likelihood ratio and Manhattan distance
      • Koehn and Knight 2002: Spearman correlation
      • Pekar et al 2006: conditional probability
      • Laroche and Langlais 2010: log-odds-ratio and cosine similarity
    • Variations incorporating syntax
      • Rapp 1999: use word order (assumes word ordering is similar for both languages)
      • Pekar et al 2006: use verb-noun dependency
      • Otero & Campos 2008: POS tag the corpus and use lexico-syntactic patterns as features; e.g. extract (see, SUBJ, man) from "A man sees a dog." and use (see, SUBJ, *) to find translations for "man".
      • Garera et al 2009: use predecessors and successors in dependency graph (and do not use bag-of-words at all)
    • Variations incorporating non-pivot words (to overcome the "seed lexicon bottleneck")
      • Gaussier et al 2004: construct vector of all words (and not just pivot words) for the query word and each pivot word. Now construct vector of pivot words, and instead of association measure between query and pivot, use the similarity between all-words query vector and all-words pivot vector.
      • Dejean et al 2002: use domain-specific multilingual thesaurus
    • Variations incorporating senses
      • Ismail and Manandhar 2010: construct query vector "given" another word (the sense-disambiguator (SD) word, say). For a query word, one can construct different vectors given different SD words. For each vector, find translation.
    • Probabilistic approach
      • Haghighi et al 2008: use a generative model where source and target words are generated from a common latent subspace. Maximize likelihood in the graphical model to learn the source-target matchings.
        • "suffers from high computational costs ... They did not compare [with] ...  standard context vector approaches, which makes it difficult to estimate the possible gains from their method."
    • Graph-based approach
      • Laws et al 2010
        • one graph per language, words as nodes, 3 types of nodes (adjectives, verbs, nouns) and 3 types of edges (adjectival modification, verb-object relation, noun coordination), edge weights represent strength of correlation
        • seed lexicon for connecting the two graphs
        • node pair similarity computed using SimRank-like algorithm

No comments:

Post a Comment