Recent Advances in Methods of Lexical Semantic Relatedness – a Survey. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. NLE 2012
- Corpora: Wikipedia, Wiktionary, Wordnet, various biomedical corpora
- Methods:
- based on Path, Information Content, Gloss, Vector
- all methods use structure, mainly from Wordnet/Wikipedia
- Some methods that treat Wiki articles as concepts (and use no other structure)
- based on distributional similarity
- PMI, Chi-squared test
- Dice, Jaccard and Cosine (search engine based)
- hybrid
- combination: run each method separately, and then combine scores e.g. by linear combination
- integration: run each method separately, and then use scores as features in hybrid model
- Notes
- "distributional similarity methods ... have been used as a proxy for [methods of semantic relatedness]."
- Distinguish concept and word; model relatedness between concepts, and between words separately; usually a (polysemous) word w has several associated concepts C(w), and the relatedness between words w1 and w2 would some function of the relatedness between the concepts in C(w1) and C(w2).
- Distinguish similarity and relatedness between words/concepts; model them separately; also model distance.
- "two words are distributionally similar if (1) they tend to occur in
each other’s context; or (2) the contexts each tends to occur in are
similar; or (3) that if one word is substituted for another in a
context, its “plausibility” is unchanged [(measured using search engines)]. Different methods have adopted
different definitions of contexts ..."
- Method surveys: Weeds (2003), Turney and Pantel (2010)
- "Budantisky and Hirst (2006) argued that there are three essential differences between [semantic relatedness and distributional similarity] ... Firstly, semantic relatedness is inherently a relation on concepts, while distributional similarity is a relation on words; secondly, semantic relatedness is typically symmetric, whereas distributional similarity can be potentially asymmetric; finally, semantic relatedness depends on a structured lexicographic or knowledge bases, distributional similarity is relative to a corpus."
- Evaluation
- In-vitro
- "In-vitro evaluation ... [i.e.] correlation with human judgement ... does not assess how well the method performs on real data ... Spearman correlation is a more robust measure ... [but] it may yield skewed results on datasets with many tied ranks."
- "we argue that vector based methods are generally superior to other[s]"
- In-vivo
- text similarity, word choice (e.g. TOEFL), WSD, sense clustering, IR: {document ranking, query expansion}, coreference resolution, ontology construction and matching, Malapropism detection
- "there is no strong evidence of a positive correlation between the ... [performance] in in-vitro evaluation ... and in in-vivo evaluation"
- Data sets: Rubenstein and Goodenough, Finkelstein et al., and many others; all were originally used for similarity (not relatedness)
- Tools
- Wikipedia: Parse::MediaWikiDump, Ponzetto and Strube (2007)
- DEXTRACT: creating evaluation datasets
- WordNet::Similarity