Cross-Lingual Latent Topic Extraction. Duo Zhang, Qiaozhu Mei, ChengXiang Zhai. ACL 2010
- Key ideas
- Input: unaligned document sets in two languages, a bilingual dictionary
- Output:
- a set of aligned topics (word distributions) in the two languages, that can characterize the shared topics
- a topic coverage distribution for each language (coverage of each topic in that language)
- Method:
- Start with ML objective of PLSA
- Add a term to incorporate dictionary constraints (DC)
- Dictionary modeled as a weighted bipartite graph (weight = translation probability)
- ML using Generalized EM (because DC maximization has no closed-form solution)
- We can't maximize DC; instead just try to improve over current