Bilingual lexicon extraction from comparable corpora using in-domain terms. Azniah Ismail, Suresh Manandhar. COLING 2010.
- Problem: Bilingual lexicon induction from comparable corpora without using orthographic features or large seed dictionaries.
- Key ideas:
- Context vectors have many noise words; choosing the right words (and omitting the others) should improve accuracy.
- How to choose: Given source word s, find a highly associated source word a (having high LLR). Find in-domain terms by intersecting their context words (also high-LLR words, but a larger list ). Assume translation of a exists in target language. Get in-domain terms for each target word t. Get translation of s given a, by comparing in-domain terms vector. [Note: This requires translations of the in-domain terms in the target language.]
- Rank-binning similarity measure to overcome need for dictionary (not sure how this works.)
- Experiments: Interesting performance comparison---with different seed dictionaries.
- Interesting papers:
- Wilson Yiksen Wong. Learning lightweight ontologies from text across different domains using the web as background knowledge. Ph.D. Thesis. 2009
No comments:
Post a Comment