Research Notes

Thursday, March 14, 2013

A Wikipedia-Based Multilingual Retrieval Model. Martin Potthast, Benno Stein, and Maik Anderka. ECIR 2008

Key idea

Use aligned Wiki articles (concepts) in two languages to map words/documents in different languages into a common concept space.

Comments

"A reasonable trade-off between retrieval quality and runtime is achieved for a concept space dimensionality between 1000 and 10000."

Wednesday, March 13, 2013

Wikipedia-based Semantic Interpretation for Natural Language Processing. Evgeniy Gabrilovich, Shaul Markovitch. JAIR 2009

(Noting here the details not mentioned in the entry for the IJCAI paper...)

Corpus preprocessing

Discard articles that have fewer than 100 non stop words or fewer than 5 incoming and outgoing links; discard articles that describe specific dates, as well as Wikipedia disambiguation pages, category pages and ``the like''; remove stop words and rare words (occurring in fewer than 3 articles), and stem the remaining words.
For text categorization: Document text is first tokenized, and title words are replicated twice to emphasize their importance. Then, stop words, numbers and mixed alphanumeric strings are removed, and the remaining words are stemmed. The bag of words is next merged with the set of features generated for the document ... and rare features occurring in fewer than 3 documents are removed. ... The generated features ... undergo feature selection using the information gain criterion.

Concept vector preprocessing

"The algorithm for pruning the inverted index operates as follows. We first sort all the concepts for a given word according to their TFIDF weights in decreasing order. We then scan the resulting sequence of concepts with a sliding window of length 100, and truncate the sequence when the difference in scores between the first and last concepts in the window drops below 5% of the highest-scoring concept for this word (which is positioned first in the sequence). This technique looks for fast drops in the concept scores, which would signify that the concepts in the tail of the sequence are only loosely associated with the word"

Ideas

Use link structure---called "second order model"
Keep only general concepts

Comments

"using a larger knowledge base is beneficial for ESA"
"We evaluated the effect of using second-order interpretation for computing semantic relatedness of texts, but it only yielded negligible improvements. We hypothesize that the reason for this finding is that computing semantic relatedness essentially uses all available Wikipedia concepts, so second-order interpretation can only slightly modify the weights of existing concepts ... [in] the application of ESA to text categorization, we trim the interpretation vectors for the sake of efficiency, and only consider a few highest-scoring concepts ... In this scenario, second-order interpretation does have a positive effect and actually improves the accuracy of text categorization ... This happens because only a few selected Wikipedia concepts are used ... and the second-order approach selectively adds highly related concepts identified by analyzing Wikipedia links."

Evaluation

Data sets for text categorization:

Reuters-21578 with the ModApte split
20 Newsgroups (20NG): noisy.
Movie Reviews (Movies) (Pang, Lee, & Vaithyanathan, 2002)---sentiment (rather than topical) classification
Reuters Corpus Volume I (RCV1) (Lewis, Yang, Rose, & Li, 2004)
OHSUMED, a subset of MEDLINE.

Reporting results

"precision-recall break-even point (BEP) to measure text categorization performance"
"we report both micro- and macro-averaged BEP, since their categories differ in size significantly. Micro-averaged BEP operates at the document level and is primarily affected by categorization performance on larger categories ... macro-averaged BEP averages results for individual categories, and thus small categories with few training examples have large impact on the overall performance."
"[for] a fixed train/test split [we] used macro sign test (S-test) (Yang & Liu, 1999) to assess the statistical significance of differences in classifier performance... [for] 4-fold cross-validation [we] used paired t-test ... [we] used the non-parametric Wilcoxon signed-ranks test (Demsar, 2006) to compare ... classifiers over multiple data sets.

Interesting

Analysis using examples: "For example, given ... phrase "scientific article" ... ESA determines ... the following Wikipedia concepts ... among the top 20"

Related work

Distributional similarity methods (Lee, 1999) compute the similarity of a pair of words w1 and w2 by comparing the distributions of other words given these two, e.g., by comparing vectors of probabilities P(v|w1) and P(v|w2) for a large vocabulary V of words (v \in V ).
LSA [manipulates] a vector of ... latent concepts ... obtained through SVD ... of a word-by-document matrix. ... CYC represents semantics of words through an elaborate network of interconnected and richly-annotated concepts ... [and] depend[s] on manual encoding of inference rules.
With the exception of LSA, most prior approaches to semantic interpretation explicitly represent semantics of individual words, and require an extra level of sophistication to represent longer texts.
Sahami and Heilman (2006) ... send two snippets as queries to a search engine, and compares the bags of words for the two sets of returned documents. ... it is only applicable to short texts, because sending a long text as a query to a search engine is likely to return few or even no results at all.
"The above-mentioned based techniques are inherently limited to individual words, and their adaptation for comparing longer texts requires an extra level of complexity (Mihalcea et al., 2006)."
Text categorization: Zelikovitz and Hirsh (2000) [use] unlabeled examples as intermediaries in comparing testing examples with the training ones. ... when an unknown test instance does not appear to resemble any labeled training instances, unlabeled examples that are similar to both may be used as "bridges." ... [when] the training and the test document have few or no words in common ... unlabeled documents are utilized to define a cosine similarity metric, which is then used by the KNN algorithm for actual text categorization.
Potthast, Stein, and Anderka (2008) and Sorg and Cimiano (2008) adapted ESA for multi-lingual and cross-lingual information retrieval.
Zesch, Mueller, and Gurevych (2008) proposed to use Wiktionary for computing semantic relatedness.

Thursday, March 7, 2013

Recent Advances in Methods of Lexical Semantic Relatedness – a Survey. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. NLE 2012

Corpora: Wikipedia, Wiktionary, Wordnet, various biomedical corpora
Methods:

based on Path, Information Content, Gloss, Vector

all methods use structure, mainly from Wordnet/Wikipedia

Some methods that treat Wiki articles as concepts (and use no other structure)

based on distributional similarity

PMI, Chi-squared test
Dice, Jaccard and Cosine (search engine based)

hybrid

combination: run each method separately, and then combine scores e.g. by linear combination
integration: run each method separately, and then use scores as features in hybrid model

Notes

"distributional similarity methods ... have been used as a proxy for [methods of semantic relatedness]."
Distinguish concept and word; model relatedness between concepts, and between words separately; usually a (polysemous) word w has several associated concepts C(w), and the relatedness between words w1 and w2 would some function of the relatedness between the concepts in C(w1) and C(w2).
Distinguish similarity and relatedness between words/concepts; model them separately; also model distance.

"two words are distributionally similar if (1) they tend to occur in each other’s context; or (2) the contexts each tends to occur in are similar; or (3) that if one word is substituted for another in a context, its “plausibility” is unchanged [(measured using search engines)]. Different methods have adopted different definitions of contexts ..."
Method surveys: Weeds (2003), Turney and Pantel (2010)
"Budantisky and Hirst (2006) argued that there are three essential differences between [semantic relatedness and distributional similarity] ... Firstly, semantic relatedness is inherently a relation on concepts, while distributional similarity is a relation on words; secondly, semantic relatedness is typically symmetric, whereas distributional similarity can be potentially asymmetric; finally, semantic relatedness depends on a structured lexicographic or knowledge bases, distributional similarity is relative to a corpus."

Evaluation

In-vitro

"In-vitro evaluation ... [i.e.] correlation with human judgement ... does not assess how well the method performs on real data ... Spearman correlation is a more robust measure ... [but] it may yield skewed results on datasets with many tied ranks."
"we argue that vector based methods are generally superior to other[s]"

In-vivo

text similarity, word choice (e.g. TOEFL), WSD, sense clustering, IR: {document ranking, query expansion}, coreference resolution, ontology construction and matching, Malapropism detection

"there is no strong evidence of a positive correlation between the ... [performance] in in-vitro evaluation ... and in in-vivo evaluation"
Data sets: Rubenstein and Goodenough, Finkelstein et al., and many others; all were originally used for similarity (not relatedness)

Tools

Wikipedia: Parse::MediaWikiDump, Ponzetto and Strube (2007)
DEXTRACT: creating evaluation datasets
WordNet::Similarity

Monday, March 4, 2013

Information about Krippendorff's alpha: http://cswww.essex.ac.uk/Research/nle/arrau/alpha.html

Monday, February 11, 2013

Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures. Colette Joubarne, Diana Inkpen. Advances in AI 2011

Claims

many languages without sufficient corpora to achieve valid measures of semantic similarity.
manually-assigned similarity scores from one language can be transferred to another language,
automatic word similarity measure based on second-order co-occurrences in the Google n-gram corpus, for English, German, and French

Semantic similarity estimation from multiple ontologies. Montserrat Batet, David Sánchez, Aida Valls, Karina Gibert. Appl Intell 2013

Claims

enable similarity estimation across multiple ontologies
solve missing values, when partial knowledge is available
capture the strongest semantic evidence that results in the most accurate similarity assessment, when dealing with overlapping knowledge

Key ideas

Consider sub-cases

both concepts appear in one ontology
concepts appear in different ontologies
missing concepts
etc.

requires a taxonomy structure (other relations not useful?)

Related work

mapping the local terms of distinct ontologies into an existent single one
creating a new ontology by integrating existing ones
compute the similarity between terms as a function of some ontological features
ontologies are connected by a new imaginary root node
matching concept labels of different ontologies
graph-based ontology alignment ... by means of path-based
similarity measures.
combines path length and common specificity.

Experiments

general purpose and biomedical benchmarks of word pairs
baseline: related works in multi-ontology similarity assessment.

Friday, February 8, 2013

A Graph-Theoretic Framework for Semantic Distance. Vivian Tsang, Suzanne Stevenson. CL 2010

Problem: similarity of texts (not single words)
Claims

"[we do] integration of distributional and ontological factors in measuring semantic distance between two sets of concepts (mapped from two texts) [within a network flow formalism]"

Key ideas

"Our goal is to measure the distance between two subgraphs (representing two texts to be compared), taking into account both the ontological distance between the component concepts and their frequency distributions. To achieve this, we measure the amount of “effort” required to transform one profile to match the other graphically: The more similar they are, the less effort it takes to transform one into the other. (This view is similar to that motivating the use of “earth mover’s distance” in computer vision [Levina and Bickel 2001].)"
"[our] notion of semantic distance as transport effort of concept frequency over the relations (edges) of an ontology differs significantly from ... [using] concept vectors of frequency. ... our approach can [compare] texts that use related but non-equivalent concepts." (seems to the main argument in favor of graph-based iterative methods)
"viewed as a supply–demand problem, in which we find the minimum cost flow (MCF) from the supply profile to the demand profile ... Each edge ... has a cost ... Each node [has a] supply ... [or] demand ... The goal is to find a flow from supply nodes to demand nodes that satisfies the supply/demand constraints of each node and minimizes the overall “transport cost.”"

Comments

Requires an ontology
"Distributional" refers to term frequencies within the compared text (not in some corpus)

Interesting papers

Using network flows

Pang, Bo and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. ACL 2004
Barzilay, Regina and Mirella Lapata. Collective content selection for concept-to-text generation. HLT/EMNLP 2005

Mihalcea, Rada. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. HLT/EMNLP 2005

Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation. Xianpei Han Jun Zhao. ACL 2010

Claims

"proposes a reliable semantic relatedness measure between concepts ... which can capture both the explicit semantic relations between concepts and the implicit semantic knowledge embedded in [multiple] graphs and networks."

Key ideas

“two concepts are semantic related if they are both semantic related to the neighbor concepts of each other”

Interesting papers

Amigo, E., Gonzalo, J., Artiles, J. and Verdejo, F. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 2008

Disambiguating Identity Web References using Web 2.0 Data and Semantics. Matthew Rowe, Fabio Ciravegna. Journal of Web Semantics 2010

Comments

Use ideas such as "Average First-Passage Time" of a graph

Interesting papers

L. Lovasz, Random walks on graphs: A survey. Combinatorics 1993
M. Saerens, F. Fouss, L. Yen, P. Dupont, The principal components analysis of a graph, and its relationships to spectral clustering. ECML 2004