## Wednesday, March 13, 2013

#### Wikipedia-based Semantic Interpretation for Natural Language Processing. Evgeniy Gabrilovich, Shaul Markovitch. JAIR 2009

(Noting here the details not mentioned in the entry for the IJCAI paper...)
• Corpus preprocessing
• Discard articles that have fewer than 100 non stop words or fewer than 5 incoming and outgoing links; discard articles that describe specific dates, as well as Wikipedia disambiguation pages, category pages and the like''; remove stop words and rare words (occurring in fewer than 3 articles), and stem the remaining words.
• For text categorization: Document text is first tokenized, and title words are replicated twice to emphasize their importance. Then, stop words, numbers and mixed alphanumeric strings are removed, and the remaining words are stemmed. The bag of words is next merged with the set of features generated for the document ... and rare features occurring in fewer than 3 documents are removed. ... The generated features ... undergo feature selection using the information gain criterion.
• Concept vector preprocessing
• "The algorithm for pruning the inverted index operates as follows. We first sort all the concepts for a given word according to their TFIDF weights in decreasing order. We then scan the resulting sequence of concepts with a sliding window of length 100, and truncate the sequence when the difference in scores between the first and last concepts in the window drops below 5% of the highest-scoring concept for this word (which is positioned first in the sequence). This technique looks for fast drops in the concept scores, which would signify that the concepts in the tail of the sequence are only loosely associated with the word"
• Ideas
• Use link structure---called "second order model"
• Keep only general concepts
• "using a larger knowledge base is beneficial for ESA"
• "We evaluated the effect of using second-order interpretation for computing semantic relatedness of texts, but it only yielded negligible improvements. We hypothesize that the reason for this finding is that computing semantic relatedness essentially uses all available Wikipedia concepts, so second-order interpretation can only slightly modify the weights of existing concepts ... [in] the application of ESA to text categorization, we trim the interpretation vectors for the sake of efficiency, and only consider a few highest-scoring concepts ... In this scenario, second-order interpretation does have a positive effect and actually improves the accuracy of text categorization ... This happens because only a few selected Wikipedia concepts are used ... and the second-order approach selectively adds highly related concepts identified by analyzing Wikipedia links."
• Evaluation
• Data sets for text categorization:
• Reuters-21578 with the ModApte split
• 20 Newsgroups (20NG): noisy.
• Movie Reviews (Movies) (Pang, Lee, & Vaithyanathan, 2002)---sentiment (rather than topical) classification
• Reuters Corpus Volume I (RCV1) (Lewis, Yang, Rose, & Li, 2004)
• OHSUMED, a subset of MEDLINE.
• Reporting results
• "precision-recall break-even point (BEP) to measure text categorization performance"
• "we report both micro- and macro-averaged BEP, since their categories differ in size significantly. Micro-averaged BEP operates at the document level and is primarily affected by categorization performance on larger categories ... macro-averaged BEP averages results for individual categories, and thus small categories with few training examples have large impact on the overall performance."
• "[for] a fixed train/test split [we] used macro sign test (S-test) (Yang & Liu, 1999) to assess the statistical significance of differences in classifier performance... [for] 4-fold cross-validation [we] used paired t-test ... [we] used the non-parametric Wilcoxon signed-ranks test (Demsar, 2006) to compare ... classifiers over multiple data sets.
• Interesting
• Analysis using examples: "For example, given ... phrase "scientific article" ... ESA determines ... the following Wikipedia concepts ... among the top 20"
• Related work
• Distributional similarity methods (Lee, 1999) compute the similarity of a pair of words w1 and w2 by comparing the distributions of other words given these two, e.g., by comparing vectors of probabilities P(v|w1) and P(v|w2) for a large vocabulary V of words (v \in V ).
• LSA [manipulates] a vector of ... latent concepts ... obtained through SVD ... of a word-by-document matrix. ... CYC represents semantics of words through an elaborate network of interconnected and richly-annotated concepts ... [and] depend[s] on manual encoding of inference rules.
• With the exception of LSA, most prior approaches to semantic interpretation explicitly represent semantics of individual words, and require an extra level of sophistication to represent longer texts.
• Sahami and Heilman (2006) ... send two snippets as queries to a search engine, and compares the bags of words for the two sets of returned documents. ... it is only applicable to short texts, because sending a long text as a query to a search engine is likely to return few or even no results at all.
• "The above-mentioned based techniques are inherently limited to individual words, and their adaptation for comparing longer texts requires an extra level of complexity (Mihalcea et al., 2006)."
• Text categorization: Zelikovitz and Hirsh (2000) [use] unlabeled examples as intermediaries in comparing testing examples with the training ones. ... when an unknown test instance does not appear to resemble any labeled training instances, unlabeled examples that are similar to both may be used as "bridges." ... [when] the training and the test  document have few or no words in common ... unlabeled documents are utilized to define a cosine similarity metric, which is then used by the KNN algorithm for actual text categorization.
• Potthast, Stein, and Anderka (2008) and Sorg and Cimiano (2008) adapted ESA for multi-lingual and cross-lingual information retrieval.
• Zesch, Mueller, and Gurevych (2008) proposed to use Wiktionary for computing semantic relatedness.