Research Notes

Sunday, January 20, 2013

Problem: Bilingual lexicon induction from comparable corpora without using orthographic features or large seed dictionaries.
Key ideas:

Context vectors have many noise words; choosing the right words (and omitting the others) should improve accuracy.
How to choose: Given source word s, find a highly associated source word a (having high LLR). Find in-domain terms by intersecting their context words (also high-LLR words, but a larger list ). Assume translation of a exists in target language. Get in-domain terms for each target word t. Get translation of s given a, by comparing in-domain terms vector. [Note: This requires translations of the in-domain terms in the target language.]
Rank-binning similarity measure to overcome need for dictionary (not sure how this works.)

Experiments: Interesting performance comparison---with different seed dictionaries.
Interesting papers:

Wilson Yiksen Wong. Learning lightweight ontologies from text across different domains using the web as background knowledge. Ph.D. Thesis. 2009

Sunday, January 20, 2013