Research Notes: March 2013

Thursday, March 14, 2013

A Wikipedia-Based Multilingual Retrieval Model. Martin Potthast, Benno Stein, and Maik Anderka. ECIR 2008

Key idea

Use aligned Wiki articles (concepts) in two languages to map words/documents in different languages into a common concept space.

Comments

"A reasonable trade-off between retrieval quality and runtime is achieved for a concept space dimensionality between 1000 and 10000."

Wednesday, March 13, 2013

Wikipedia-based Semantic Interpretation for Natural Language Processing. Evgeniy Gabrilovich, Shaul Markovitch. JAIR 2009

(Noting here the details not mentioned in the entry for the IJCAI paper...)

Corpus preprocessing

Discard articles that have fewer than 100 non stop words or fewer than 5 incoming and outgoing links; discard articles that describe specific dates, as well as Wikipedia disambiguation pages, category pages and ``the like''; remove stop words and rare words (occurring in fewer than 3 articles), and stem the remaining words.
For text categorization: Document text is first tokenized, and title words are replicated twice to emphasize their importance. Then, stop words, numbers and mixed alphanumeric strings are removed, and the remaining words are stemmed. The bag of words is next merged with the set of features generated for the document ... and rare features occurring in fewer than 3 documents are removed. ... The generated features ... undergo feature selection using the information gain criterion.

Concept vector preprocessing

"The algorithm for pruning the inverted index operates as follows. We first sort all the concepts for a given word according to their TFIDF weights in decreasing order. We then scan the resulting sequence of concepts with a sliding window of length 100, and truncate the sequence when the difference in scores between the first and last concepts in the window drops below 5% of the highest-scoring concept for this word (which is positioned first in the sequence). This technique looks for fast drops in the concept scores, which would signify that the concepts in the tail of the sequence are only loosely associated with the word"

Ideas

Use link structure---called "second order model"
Keep only general concepts

Comments

"using a larger knowledge base is beneficial for ESA"
"We evaluated the effect of using second-order interpretation for computing semantic relatedness of texts, but it only yielded negligible improvements. We hypothesize that the reason for this finding is that computing semantic relatedness essentially uses all available Wikipedia concepts, so second-order interpretation can only slightly modify the weights of existing concepts ... [in] the application of ESA to text categorization, we trim the interpretation vectors for the sake of efficiency, and only consider a few highest-scoring concepts ... In this scenario, second-order interpretation does have a positive effect and actually improves the accuracy of text categorization ... This happens because only a few selected Wikipedia concepts are used ... and the second-order approach selectively adds highly related concepts identified by analyzing Wikipedia links."

Evaluation

Data sets for text categorization:

Reuters-21578 with the ModApte split
20 Newsgroups (20NG): noisy.
Movie Reviews (Movies) (Pang, Lee, & Vaithyanathan, 2002)---sentiment (rather than topical) classification
Reuters Corpus Volume I (RCV1) (Lewis, Yang, Rose, & Li, 2004)
OHSUMED, a subset of MEDLINE.

Reporting results

"precision-recall break-even point (BEP) to measure text categorization performance"
"we report both micro- and macro-averaged BEP, since their categories differ in size significantly. Micro-averaged BEP operates at the document level and is primarily affected by categorization performance on larger categories ... macro-averaged BEP averages results for individual categories, and thus small categories with few training examples have large impact on the overall performance."
"[for] a fixed train/test split [we] used macro sign test (S-test) (Yang & Liu, 1999) to assess the statistical significance of differences in classifier performance... [for] 4-fold cross-validation [we] used paired t-test ... [we] used the non-parametric Wilcoxon signed-ranks test (Demsar, 2006) to compare ... classifiers over multiple data sets.

Interesting

Analysis using examples: "For example, given ... phrase "scientific article" ... ESA determines ... the following Wikipedia concepts ... among the top 20"

Related work

Distributional similarity methods (Lee, 1999) compute the similarity of a pair of words w1 and w2 by comparing the distributions of other words given these two, e.g., by comparing vectors of probabilities P(v|w1) and P(v|w2) for a large vocabulary V of words (v \in V ).
LSA [manipulates] a vector of ... latent concepts ... obtained through SVD ... of a word-by-document matrix. ... CYC represents semantics of words through an elaborate network of interconnected and richly-annotated concepts ... [and] depend[s] on manual encoding of inference rules.
With the exception of LSA, most prior approaches to semantic interpretation explicitly represent semantics of individual words, and require an extra level of sophistication to represent longer texts.
Sahami and Heilman (2006) ... send two snippets as queries to a search engine, and compares the bags of words for the two sets of returned documents. ... it is only applicable to short texts, because sending a long text as a query to a search engine is likely to return few or even no results at all.
"The above-mentioned based techniques are inherently limited to individual words, and their adaptation for comparing longer texts requires an extra level of complexity (Mihalcea et al., 2006)."
Text categorization: Zelikovitz and Hirsh (2000) [use] unlabeled examples as intermediaries in comparing testing examples with the training ones. ... when an unknown test instance does not appear to resemble any labeled training instances, unlabeled examples that are similar to both may be used as "bridges." ... [when] the training and the test document have few or no words in common ... unlabeled documents are utilized to define a cosine similarity metric, which is then used by the KNN algorithm for actual text categorization.
Potthast, Stein, and Anderka (2008) and Sorg and Cimiano (2008) adapted ESA for multi-lingual and cross-lingual information retrieval.
Zesch, Mueller, and Gurevych (2008) proposed to use Wiktionary for computing semantic relatedness.

Thursday, March 7, 2013

Recent Advances in Methods of Lexical Semantic Relatedness – a Survey. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. NLE 2012

Corpora: Wikipedia, Wiktionary, Wordnet, various biomedical corpora
Methods:

based on Path, Information Content, Gloss, Vector

all methods use structure, mainly from Wordnet/Wikipedia

Some methods that treat Wiki articles as concepts (and use no other structure)

based on distributional similarity

PMI, Chi-squared test
Dice, Jaccard and Cosine (search engine based)

hybrid

combination: run each method separately, and then combine scores e.g. by linear combination
integration: run each method separately, and then use scores as features in hybrid model

Notes

"distributional similarity methods ... have been used as a proxy for [methods of semantic relatedness]."
Distinguish concept and word; model relatedness between concepts, and between words separately; usually a (polysemous) word w has several associated concepts C(w), and the relatedness between words w1 and w2 would some function of the relatedness between the concepts in C(w1) and C(w2).
Distinguish similarity and relatedness between words/concepts; model them separately; also model distance.

"two words are distributionally similar if (1) they tend to occur in each other’s context; or (2) the contexts each tends to occur in are similar; or (3) that if one word is substituted for another in a context, its “plausibility” is unchanged [(measured using search engines)]. Different methods have adopted different definitions of contexts ..."
Method surveys: Weeds (2003), Turney and Pantel (2010)
"Budantisky and Hirst (2006) argued that there are three essential differences between [semantic relatedness and distributional similarity] ... Firstly, semantic relatedness is inherently a relation on concepts, while distributional similarity is a relation on words; secondly, semantic relatedness is typically symmetric, whereas distributional similarity can be potentially asymmetric; finally, semantic relatedness depends on a structured lexicographic or knowledge bases, distributional similarity is relative to a corpus."

Evaluation

In-vitro

"In-vitro evaluation ... [i.e.] correlation with human judgement ... does not assess how well the method performs on real data ... Spearman correlation is a more robust measure ... [but] it may yield skewed results on datasets with many tied ranks."
"we argue that vector based methods are generally superior to other[s]"

In-vivo

text similarity, word choice (e.g. TOEFL), WSD, sense clustering, IR: {document ranking, query expansion}, coreference resolution, ontology construction and matching, Malapropism detection

"there is no strong evidence of a positive correlation between the ... [performance] in in-vitro evaluation ... and in in-vivo evaluation"
Data sets: Rubenstein and Goodenough, Finkelstein et al., and many others; all were originally used for similarity (not relatedness)

Tools

Wikipedia: Parse::MediaWikiDump, Ponzetto and Strube (2007)
DEXTRACT: creating evaluation datasets
WordNet::Similarity

Monday, March 4, 2013

Information about Krippendorff's alpha: http://cswww.essex.ac.uk/Research/nle/arrau/alpha.html