Sunday, January 20, 2013

Bilingual lexicon extraction from comparable corpora using in-domain terms. Azniah Ismail, Suresh Manandhar. COLING 2010.

  • Problem: Bilingual lexicon induction from comparable corpora without using orthographic features or large seed dictionaries.
  • Key ideas:
    • Context vectors have many noise words; choosing the right words (and omitting the others) should improve accuracy.
    • How to choose: Given source word s, find a highly associated source word a (having high LLR). Find in-domain terms by intersecting their context words (also high-LLR words, but a larger list ). Assume translation of a exists in target language. Get in-domain terms for each target word t. Get translation of s given a, by comparing in-domain terms vector. [Note: This requires translations of the in-domain terms in the target language.]
    • Rank-binning similarity measure to overcome need for dictionary (not sure how this works.)
  • Experiments: Interesting performance comparison---with different seed dictionaries.
  • Interesting papers:
    • Wilson Yiksen Wong. Learning lightweight ontologies from text across different domains using the web as background knowledge. Ph.D. Thesis. 2009

No comments:

Post a Comment