#### A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa. NAACL-HLT 2009

- Claims
- a supervised combination of [our methods] yields the best published results on all datasets
- we pioneer cross-lingual similarity
- A discussion on the differences between learning similarity and relatedness scores
- Cross lingual similarity
- Wordnet-based: Since it a multilingual aligned WordNet, the monolingual methods are directly applicable
- Distributional: translate target into source language (using machine translation) and then use monolingual method
- Results
- [among distributional methods], the method based on context windows provides the best results for similarity, and the bag-of-

words representation [does best] for relatedness. - upper-bounding combined performance: "we took the output[s] of three systems ... we implemented an oracle that chooses [among the outputs] ... the rank that is most similar to the rank of the pair in the gold-standard. ... gives as an indication of the correlations that could be achieved by choosing for each pair the rank output by the best classifier for that pair."
- On evaluation
- "Pearson correlation suffers much when the scores of two systems are not linearly correlated, [e.g.] due to the different nature of the techniques applied ... Spearman correlation provides an evaluation metric that is independent of such data-dependent transformations"

#### Lexical Co-occurrence, Statistical Significance, and Word Association. Dipak L. Chaudhari, Om P. Damani, Srivatsan Laxman. EMNLP 2011

- Claims
- We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences.
- We ... construct a significance test that allows us to detect different kinds of co-occurrences within a single unified framework
- Key ideas
- Existing co-occurrence measures ... assume that each document is drawn from a multinomial distribution based on global unigram frequencies ... [The problem with this] is
- the overbearing influence of the unigram frequencies on the detection of word associations. For example, the association between anomochilidae (dwarf pipe snakes) and snake could go undetected ... since less than 0.1% of the pages containing snake also contained anomochilidae.
- the expected span of a word pair is very sensitive to the associated unigram frequencies: the expected span of a word pair composed of low frequency unigrams is much larger than that with high frequency unigrams. This is contrary to how word associations appear in language, where semantic relationships manifest with small inter-word distances irrespective of the underlying unigram distributions.
- To solve the above, "we employ a null model that represents each document as a bag of words"
- A random permutation of the associated bag of words gives a linear representation for the document.
- If the observed span distribution of a word-pair resembles that under the (random permutation) null model, then the relation between the words is not strong enough for one word to influence the placement of the other.
- Experiments
- New data sets (from the "free association" problem)
- Edinburg (Kiss et al.,1973), Florida (Nelson et al., 1980), Goldfarb-Halpern (Goldfarb and Halpern, 1984), Kent (Kent and

Rosanoff, 1910), Minnesota (Russell and Jenkins, 1954), White-Abrams (White and Abrams, 2004) - Comments
- The basic approach in this kind of modeling: "We need a null hypothesis that can account for an observed co-occurrence as a pure chance event and this in-turn requires a corpus generation model. Documents in a corpus can be assumed to be generated independent of each other."
- Comprehensive list of co-occurrence measures
- CSR, CWCD (Washtell and Markert, 2009), Dice (Dice, 1945), LLR (Dunning, 1993), Jaccard (Jaccard, 1912), Ochiai (Janson and Vegelius,1981), Pearson’s X^2 test, PMI (Church and Hanks, 1989), SCI (Washtell and Markert, 2009), T-test

#### Harnessing different knowledge sources to measure semantic relatedness under a uniform model. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. EMNLP 2011

- Claims
- introduces a method of harnessing different knowledge sources under a uniform model for measuring semantic relatedness between words or concepts.
- we identify two issues that have not been addressed in the previous works. First, existing works typically employ a single knowledge source of semantic evidence ... Second, ... evaluated in general domains only ... evaluation ... in specific domains is ... important.
- Key ideas
- knowledge from different sources are mapped into a graph representation in 3 stages, and a general graph-based (random walk) algorithm is used for final relatedness computation
- Random walk: "formalizes the idea that taking successive steps along the paths in a graph, the “easier” it is to arrive at a target node starting from a source node, the more related the two nodes are ... P(t)(j|i) [is] the probability of reaching other nodes from a starting

node on the graph after t steps ... [following] Rowe and Ciravegna (2010) ... set t=2 in order to preserve locally connected nodes ... Effectively, this formalizes the notion that two concepts related to a third concept is also semantically related [similar to] Patwardhan and Pedersen (2006)". - The stages involve “feature integration” as merging feature types from different knowledge sources into single types of features based on their similarity in semantics.
- "the difference between cross-source feature
*combination*and*integration*is that the former introduces more types of features, whereas the latter retains same number of feature types but increases feature values for each type. Both have the effect of establishing additional path (via features) between concepts, but in different ways." - Comments
- "Zhang et al. (2010) argue that ... different knowledge sources may complement each other."
- "evaluation of [semantic relatedness] methods in specific domains is increasingly important" (They also evaluate on (biomedical) domain-specific data sets)
- "Wikipedia ... [has] reasonable coverage of many domains (Holloway et al., 2007; Halavais, 2008)."
- Classifies SR approaches
- path based: use wordnet-like semantic network
- Information Content (IC) based: use taxonomy (a special case of network) and a corpus
- statistical
- distributional
- co-occurrence-based
- hybrid: combine the above, e.g. Riensche et al. (2007), Pozo et al. (2008), Han and Zhao (2010). Note: the idea of combining methods is distinguished from the idea of combining knowledge sources.
- Evaluation
- Data sets: general (Rubenstein and Goodenough, Miller and Charles, Finkelstein et al.) and biomedical (Petrakis et al. (2006), Pedersen et al. (2006))
- Measure: Spearman correlation ("better metric ... (Zesch and Gurevych, 2010)")
- "some datasets have a ... low sample size, ... correlation values [could have] occurred by chance. Therefore, we measure the statistical significance of correlation by computing the p-value for the correlation values"
- Interesting papers
- Random walk for semantic relatedness
- Zhang, Z., Gentile, A., Xia, L., Iria, J., Chapman, S. A random graph walk based approach to compute semantic relatedness using knowledge from Wikipedia. LREC 2010. (compare this with Manning's paper)
- Rowe, M., Ciravegna, F. Disambiguating identity web references using Web 2.0 data and semantics. The Journal of Web Semantics 2010
- Hybrid methods
- Tsang, V., Stevenson, S. A graph-theoretic framework for semantic distance. CL 2010
- Han, X., Zhao, J. Structural semantic relatedness: a knowledge-based method to named entity disambiguation. ACL 2010
- Patwardhan, S., Pedersen, T. 2006. Using WordNet-based context vectors to estimate the semantic relatedness of concepts. EACL 2006 ("second-order context")

## No comments:

## Post a Comment