Research Notes: 2013

Tuesday, July 16, 2013

Language Models for Keyword Search over Data Graphs. Yosi Mass, Yehoshua Sagiv. WSDM 2012

Problem

given a keyword query, find entities in a graph of entities

the graph is probably derived from a database; and it is presumed that the user will find SQL difficult to use.
examples of such databases include Wikipedia, IMDB, and Mondial.

Tuesday, July 9, 2013

Characterizing the Influence of Domain Expertise on Web Search Behavior. Ryen W. White, Susan T. Dumais, Jaime Teevan. WSDM 2009

Look up

maximum-margin averaged perceptron (Collins, M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. EMNLP 2002)

Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search. Navin Goyal, Yury Lifshits, Hinrich Schütze. WSDM 2008

Ideas

"combinatorial" setting: only comparisons between similarity scores allowed.
two randomized algorithms for nearest-neighbor search, requiring O(n^2) and O(nlogn) space for preprocessed data

Sunday, June 30, 2013

Cross-Lingual Latent Topic Extraction. Duo Zhang, Qiaozhu Mei, ChengXiang Zhai. ACL 2010

Key ideas

Input: unaligned document sets in two languages, a bilingual dictionary
Output:

a set of aligned topics (word distributions) in the two languages, that can characterize the shared topics
a topic coverage distribution for each language (coverage of each topic in that language)

Method:

Start with ML objective of PLSA
Add a term to incorporate dictionary constraints (DC)
Dictionary modeled as a weighted bipartite graph (weight = translation probability)
ML using Generalized EM (because DC maximization has no closed-form solution)

We can't maximize DC; instead just try to improve over current

Friday, April 26, 2013

Accurate Methods for the Statistics of Surprise and Coincidence. Ted Dunning. Computational Linguistics 1993.

Ideas

"ordinary words are 'rare', any statistical work with texts must deal with the reality of rare events ... Unfortunately, the foundational assumption of most common statistical analyses used in computational linguistics is that the events being analyzed are relatively common."
Counting a word w can be viewed as a series of Bernoulli trials. Each token is tested to see if it is w. Assuming a uniform probability p that it is w, the count is distributed binomially and, for np(1-p)>5 (where n=number of tokens), it is distributed almost normally. But this is not true when np(1-p)<1.
Given outcomes k, propose a model with parameters w. The likelihood of the parameter value w is P(k|w). A hypothesis is a subset of the parameter space W.
The likelihood ratio of the hypothesis (parameter subspace) W0 = LR = max_{w \in W0} P(k|w) / max_{w \in W} P(k|w).
Fact: -2 log LR ~ \Chi^2(dim W - dim W0).

References

More info on parametric and distribution-free tests: Bradley (1968), and Mood, Graybill, and Boes (1974).
Likelihood ratio tests: Mood et al. (1974)

Thursday, April 25, 2013

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension. Daniel Andrade, Takuya Matsuzaki, Jun'ichi Tsuji. TALIP 2012

Ideas

Use only statistically significant context---determined using Bayesian estimate of PMI
"calculate a similarity score ... using the probability that the same pivots [(words from the seed lexicon)] will be extracted for both the query word and the translation candidate."
"several context [features] ... a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence."
"In order to make these context positions comparable across Japanese and English ... we use several heuristics to adjust the dependency trees appropriately."

Comments

"the degree of association is defined as a measurement for finding words that co-occur, or which do not co-occur, more often than we would expect by pure chance [e.g.] Log-Likelihood-Ratio ... As an alternative, we suggest to use the statistical significance of a positive association"
Heuristics to make dependency trees comparable are language-pair-specific (for EN-JP only)

Related work

Standard approach

Fix corpora in two languages, and pivot words (seed lexicon)
For each query word, construct vector of pivot words, and compare.

Construct vector: some measure of association between query word and pivot word
Compare: some similarity measure suitable for association measure

"Context" is a bag-of-words, usually a sentence (or doc?).

Variations of standard approach

Peters and Picchi 1997: PMI
Fung 1998: tf.idf and cosine similarity
Rapp 1999: log-likelihood ratio and Manhattan distance
Koehn and Knight 2002: Spearman correlation
Pekar et al 2006: conditional probability
Laroche and Langlais 2010: log-odds-ratio and cosine similarity

Variations incorporating syntax

Rapp 1999: use word order (assumes word ordering is similar for both languages)
Pekar et al 2006: use verb-noun dependency
Otero & Campos 2008: POS tag the corpus and use lexico-syntactic patterns as features; e.g. extract (see, SUBJ, man) from "A man sees a dog." and use (see, SUBJ, *) to find translations for "man".
Garera et al 2009: use predecessors and successors in dependency graph (and do not use bag-of-words at all)

Variations incorporating non-pivot words (to overcome the "seed lexicon bottleneck")

Gaussier et al 2004: construct vector of all words (and not just pivot words) for the query word and each pivot word. Now construct vector of pivot words, and instead of association measure between query and pivot, use the similarity between all-words query vector and all-words pivot vector.
Dejean et al 2002: use domain-specific multilingual thesaurus

Variations incorporating senses

Ismail and Manandhar 2010: construct query vector "given" another word (the sense-disambiguator (SD) word, say). For a query word, one can construct different vectors given different SD words. For each vector, find translation.

Probabilistic approach

Haghighi et al 2008: use a generative model where source and target words are generated from a common latent subspace. Maximize likelihood in the graphical model to learn the source-target matchings.

"suffers from high computational costs ... They did not compare [with] ... standard context vector approaches, which makes it difficult to estimate the possible gains from their method."

Graph-based approach

Laws et al 2010

one graph per language, words as nodes, 3 types of nodes (adjectives, verbs, nouns) and 3 types of edges (adjectival modification, verb-object relation, noun coordination), edge weights represent strength of correlation
seed lexicon for connecting the two graphs
node pair similarity computed using SimRank-like algorithm

Attended a literature review on Question Answering by Akihiro Katsura. Some interesting references.

Green, Chomsky, et al. 1961. The BASEBALL system.

rule-based

Isozaki et al. 2009.

machine learning-based

Methodology of QA

Question analysis

Xue et al. SIGIR 2008. Retrieval models for QA archives.

Text retrieval

Jones et al. IPM 2000. ---Okapi/BM25
Berger et al. SIGIR 2000.---Bridging the lexical chasm (OOV problem)
Brill et al. TREC 2001.---Data intensive QA

Answer candidate extraction

Lafferty et al. ICML 2001.---CRF
Ravichandran & Hovy. ACL 2002.---Surface text patterns for QA

Answer selection

Clarke et al. SIGIR 2001.---Redundancy in QA

Other related work

Cao et al. WWW 2008.---Recommending questions
Wang et al. SIGIR 2009.---Similar questions
Wang et al. SIGIR 2009.---answer ranking
Jeon et al. SIGIR 2006.---answer ranking with non-textual features (votes, etc.)

Wednesday, April 24, 2013

Identifying Word Translations from Comparable Documents Without a Seed Lexicon. Reinhard Rapp, Serge Sharoff, Bogdan Babych. LREC 2012

Idea

Assume only document-aligned comparable corpora (and no seed lexicon---"typically comprising at least 10,000 words")
Characterize each article by a set of keywords
"Formulate translation identification as a variant of the word alignment problem in a noisy setting"

actually solved using a neural net-style algorithm by Rumelhart & McClelland (1987)

Comments

"If ... in language A two words co-occur more often than expected by chance, then their translated equivalents in language B should also co-occur more frequently than expected."

Experiments

Preprocessing

lemmatization (of corpora and evaluation pairs)
"we use the log-likelihood score as a measure of keyness [or salience of words in a document], since it has been shown to be robust to small [documents] ... the threshold of 15.13 for the log-likelihood score is a conservative recommendation for statistical significance."
"[we] applied a threshold of five [occurrences] ... [and] added all words of the ... gold standard(s) [even if they were below the threshold]"

Gold standard

"The source language words in the gold standard were supposed to be systematically derived from a large corpus, covering a wide range of frequencies, parts of speech, and variances of their
distribution. In addition, the corpus from which the gold standard was derived was supposed to be completely separate from the development set (Wikipedia)."
"list of words extracted from the British National Corpus (BNC) by Adam Kilgarriff for the purpose of examining distributional variability." http://kilgarriff.co.uk/bnc-readme.html

A Linguistically Grounded Graph Model for Bilingual Lexicon Extraction. Florian Laws, Lukas Michelbacher, Beate Dorow, Christian Scheible, Ulrich Heid, Hinrich Schutze. COLING 2010

TOREAD

Tuesday, April 23, 2013

Addressing polysemy in bilingual lexicon extraction from comparable corpora. Darja Fiser, Nikola Ljubesic, Ozren Kubelka. LREC 2012

Idea

Get source word senses (using sense tagger), construct context vectors for each sense, and then find target translation.

To compute sense-specific vectors: split occurrences of source word into groups, and build context vectors separately for each group.
Translate context vectors into target language using seed lexicon

Combine info from several taggers to improve accuracy.

Take only those words where the tags of both taggers agree.

Comments (on the classical approach, using a context vector of words)

"The main idea behind [the classical] approach is the assumption that a source word and its translation appear in similar contexts in their respective languages, so that in order to identify them their contexts are compared via a seed dictionary (Fung, 1998; Rapp, 1999)"
"[the classical approach] approach gives good results for a specialized domain even though the seed dictionary is quite small (Fiser et al., 2011)."
"... for closely related languages, ... the same quality of the results can be achieved by exploiting the lexical overlap between the languages instead of using a seed dictionary (Ljubesic and Fiser, 2011).

Thursday, March 14, 2013

A Wikipedia-Based Multilingual Retrieval Model. Martin Potthast, Benno Stein, and Maik Anderka. ECIR 2008

Key idea

Use aligned Wiki articles (concepts) in two languages to map words/documents in different languages into a common concept space.

Comments

"A reasonable trade-off between retrieval quality and runtime is achieved for a concept space dimensionality between 1000 and 10000."

Wednesday, March 13, 2013

Wikipedia-based Semantic Interpretation for Natural Language Processing. Evgeniy Gabrilovich, Shaul Markovitch. JAIR 2009

(Noting here the details not mentioned in the entry for the IJCAI paper...)

Corpus preprocessing

Discard articles that have fewer than 100 non stop words or fewer than 5 incoming and outgoing links; discard articles that describe specific dates, as well as Wikipedia disambiguation pages, category pages and ``the like''; remove stop words and rare words (occurring in fewer than 3 articles), and stem the remaining words.
For text categorization: Document text is first tokenized, and title words are replicated twice to emphasize their importance. Then, stop words, numbers and mixed alphanumeric strings are removed, and the remaining words are stemmed. The bag of words is next merged with the set of features generated for the document ... and rare features occurring in fewer than 3 documents are removed. ... The generated features ... undergo feature selection using the information gain criterion.

Concept vector preprocessing

"The algorithm for pruning the inverted index operates as follows. We first sort all the concepts for a given word according to their TFIDF weights in decreasing order. We then scan the resulting sequence of concepts with a sliding window of length 100, and truncate the sequence when the difference in scores between the first and last concepts in the window drops below 5% of the highest-scoring concept for this word (which is positioned first in the sequence). This technique looks for fast drops in the concept scores, which would signify that the concepts in the tail of the sequence are only loosely associated with the word"

Ideas

Use link structure---called "second order model"
Keep only general concepts

Comments

"using a larger knowledge base is beneficial for ESA"
"We evaluated the effect of using second-order interpretation for computing semantic relatedness of texts, but it only yielded negligible improvements. We hypothesize that the reason for this finding is that computing semantic relatedness essentially uses all available Wikipedia concepts, so second-order interpretation can only slightly modify the weights of existing concepts ... [in] the application of ESA to text categorization, we trim the interpretation vectors for the sake of efficiency, and only consider a few highest-scoring concepts ... In this scenario, second-order interpretation does have a positive effect and actually improves the accuracy of text categorization ... This happens because only a few selected Wikipedia concepts are used ... and the second-order approach selectively adds highly related concepts identified by analyzing Wikipedia links."

Evaluation

Data sets for text categorization:

Reuters-21578 with the ModApte split
20 Newsgroups (20NG): noisy.
Movie Reviews (Movies) (Pang, Lee, & Vaithyanathan, 2002)---sentiment (rather than topical) classification
Reuters Corpus Volume I (RCV1) (Lewis, Yang, Rose, & Li, 2004)
OHSUMED, a subset of MEDLINE.

Reporting results

"precision-recall break-even point (BEP) to measure text categorization performance"
"we report both micro- and macro-averaged BEP, since their categories differ in size significantly. Micro-averaged BEP operates at the document level and is primarily affected by categorization performance on larger categories ... macro-averaged BEP averages results for individual categories, and thus small categories with few training examples have large impact on the overall performance."
"[for] a fixed train/test split [we] used macro sign test (S-test) (Yang & Liu, 1999) to assess the statistical significance of differences in classifier performance... [for] 4-fold cross-validation [we] used paired t-test ... [we] used the non-parametric Wilcoxon signed-ranks test (Demsar, 2006) to compare ... classifiers over multiple data sets.

Interesting

Analysis using examples: "For example, given ... phrase "scientific article" ... ESA determines ... the following Wikipedia concepts ... among the top 20"

Related work

Distributional similarity methods (Lee, 1999) compute the similarity of a pair of words w1 and w2 by comparing the distributions of other words given these two, e.g., by comparing vectors of probabilities P(v|w1) and P(v|w2) for a large vocabulary V of words (v \in V ).
LSA [manipulates] a vector of ... latent concepts ... obtained through SVD ... of a word-by-document matrix. ... CYC represents semantics of words through an elaborate network of interconnected and richly-annotated concepts ... [and] depend[s] on manual encoding of inference rules.
With the exception of LSA, most prior approaches to semantic interpretation explicitly represent semantics of individual words, and require an extra level of sophistication to represent longer texts.
Sahami and Heilman (2006) ... send two snippets as queries to a search engine, and compares the bags of words for the two sets of returned documents. ... it is only applicable to short texts, because sending a long text as a query to a search engine is likely to return few or even no results at all.
"The above-mentioned based techniques are inherently limited to individual words, and their adaptation for comparing longer texts requires an extra level of complexity (Mihalcea et al., 2006)."
Text categorization: Zelikovitz and Hirsh (2000) [use] unlabeled examples as intermediaries in comparing testing examples with the training ones. ... when an unknown test instance does not appear to resemble any labeled training instances, unlabeled examples that are similar to both may be used as "bridges." ... [when] the training and the test document have few or no words in common ... unlabeled documents are utilized to define a cosine similarity metric, which is then used by the KNN algorithm for actual text categorization.
Potthast, Stein, and Anderka (2008) and Sorg and Cimiano (2008) adapted ESA for multi-lingual and cross-lingual information retrieval.
Zesch, Mueller, and Gurevych (2008) proposed to use Wiktionary for computing semantic relatedness.

Thursday, March 7, 2013

Recent Advances in Methods of Lexical Semantic Relatedness – a Survey. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. NLE 2012

Corpora: Wikipedia, Wiktionary, Wordnet, various biomedical corpora
Methods:

based on Path, Information Content, Gloss, Vector

all methods use structure, mainly from Wordnet/Wikipedia

Some methods that treat Wiki articles as concepts (and use no other structure)

based on distributional similarity

PMI, Chi-squared test
Dice, Jaccard and Cosine (search engine based)

hybrid

combination: run each method separately, and then combine scores e.g. by linear combination
integration: run each method separately, and then use scores as features in hybrid model

Notes

"distributional similarity methods ... have been used as a proxy for [methods of semantic relatedness]."
Distinguish concept and word; model relatedness between concepts, and between words separately; usually a (polysemous) word w has several associated concepts C(w), and the relatedness between words w1 and w2 would some function of the relatedness between the concepts in C(w1) and C(w2).
Distinguish similarity and relatedness between words/concepts; model them separately; also model distance.

"two words are distributionally similar if (1) they tend to occur in each other’s context; or (2) the contexts each tends to occur in are similar; or (3) that if one word is substituted for another in a context, its “plausibility” is unchanged [(measured using search engines)]. Different methods have adopted different definitions of contexts ..."
Method surveys: Weeds (2003), Turney and Pantel (2010)
"Budantisky and Hirst (2006) argued that there are three essential differences between [semantic relatedness and distributional similarity] ... Firstly, semantic relatedness is inherently a relation on concepts, while distributional similarity is a relation on words; secondly, semantic relatedness is typically symmetric, whereas distributional similarity can be potentially asymmetric; finally, semantic relatedness depends on a structured lexicographic or knowledge bases, distributional similarity is relative to a corpus."

Evaluation

In-vitro

"In-vitro evaluation ... [i.e.] correlation with human judgement ... does not assess how well the method performs on real data ... Spearman correlation is a more robust measure ... [but] it may yield skewed results on datasets with many tied ranks."
"we argue that vector based methods are generally superior to other[s]"

In-vivo

text similarity, word choice (e.g. TOEFL), WSD, sense clustering, IR: {document ranking, query expansion}, coreference resolution, ontology construction and matching, Malapropism detection

"there is no strong evidence of a positive correlation between the ... [performance] in in-vitro evaluation ... and in in-vivo evaluation"
Data sets: Rubenstein and Goodenough, Finkelstein et al., and many others; all were originally used for similarity (not relatedness)

Tools

Wikipedia: Parse::MediaWikiDump, Ponzetto and Strube (2007)
DEXTRACT: creating evaluation datasets
WordNet::Similarity

Monday, March 4, 2013

Information about Krippendorff's alpha: http://cswww.essex.ac.uk/Research/nle/arrau/alpha.html

Monday, February 11, 2013

Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures. Colette Joubarne, Diana Inkpen. Advances in AI 2011

Claims

many languages without sufficient corpora to achieve valid measures of semantic similarity.
manually-assigned similarity scores from one language can be transferred to another language,
automatic word similarity measure based on second-order co-occurrences in the Google n-gram corpus, for English, German, and French

Semantic similarity estimation from multiple ontologies. Montserrat Batet, David Sánchez, Aida Valls, Karina Gibert. Appl Intell 2013

Claims

enable similarity estimation across multiple ontologies
solve missing values, when partial knowledge is available
capture the strongest semantic evidence that results in the most accurate similarity assessment, when dealing with overlapping knowledge

Key ideas

Consider sub-cases

both concepts appear in one ontology
concepts appear in different ontologies
missing concepts
etc.

requires a taxonomy structure (other relations not useful?)

Related work

mapping the local terms of distinct ontologies into an existent single one
creating a new ontology by integrating existing ones
compute the similarity between terms as a function of some ontological features
ontologies are connected by a new imaginary root node
matching concept labels of different ontologies
graph-based ontology alignment ... by means of path-based
similarity measures.
combines path length and common specificity.

Experiments

general purpose and biomedical benchmarks of word pairs
baseline: related works in multi-ontology similarity assessment.

Friday, February 8, 2013

A Graph-Theoretic Framework for Semantic Distance. Vivian Tsang, Suzanne Stevenson. CL 2010

Problem: similarity of texts (not single words)
Claims

"[we do] integration of distributional and ontological factors in measuring semantic distance between two sets of concepts (mapped from two texts) [within a network flow formalism]"

Key ideas

"Our goal is to measure the distance between two subgraphs (representing two texts to be compared), taking into account both the ontological distance between the component concepts and their frequency distributions. To achieve this, we measure the amount of “effort” required to transform one profile to match the other graphically: The more similar they are, the less effort it takes to transform one into the other. (This view is similar to that motivating the use of “earth mover’s distance” in computer vision [Levina and Bickel 2001].)"
"[our] notion of semantic distance as transport effort of concept frequency over the relations (edges) of an ontology differs significantly from ... [using] concept vectors of frequency. ... our approach can [compare] texts that use related but non-equivalent concepts." (seems to the main argument in favor of graph-based iterative methods)
"viewed as a supply–demand problem, in which we find the minimum cost flow (MCF) from the supply profile to the demand profile ... Each edge ... has a cost ... Each node [has a] supply ... [or] demand ... The goal is to find a flow from supply nodes to demand nodes that satisfies the supply/demand constraints of each node and minimizes the overall “transport cost.”"

Comments

Requires an ontology
"Distributional" refers to term frequencies within the compared text (not in some corpus)

Interesting papers

Using network flows

Pang, Bo and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. ACL 2004
Barzilay, Regina and Mirella Lapata. Collective content selection for concept-to-text generation. HLT/EMNLP 2005

Mihalcea, Rada. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. HLT/EMNLP 2005

Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation. Xianpei Han Jun Zhao. ACL 2010

Claims

"proposes a reliable semantic relatedness measure between concepts ... which can capture both the explicit semantic relations between concepts and the implicit semantic knowledge embedded in [multiple] graphs and networks."

Key ideas

“two concepts are semantic related if they are both semantic related to the neighbor concepts of each other”

Interesting papers

Amigo, E., Gonzalo, J., Artiles, J. and Verdejo, F. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 2008

Disambiguating Identity Web References using Web 2.0 Data and Semantics. Matthew Rowe, Fabio Ciravegna. Journal of Web Semantics 2010

Comments

Use ideas such as "Average First-Passage Time" of a graph

Interesting papers

L. Lovasz, Random walks on graphs: A survey. Combinatorics 1993
M. Saerens, F. Fouss, L. Yen, P. Dupont, The principal components analysis of a graph, and its relationships to spectral clustering. ECML 2004

Thursday, February 7, 2013

A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia. Ziqi Zhang, Anna Lisa Gentile, Lei Xia, José Iria, Sam Chapman. LREC 2010

Key ideas

Model many kinds of features on a graph
Convert edge weights into probabilities; use p(t)(i|j) to model relatedness (where t is the number of steps in the walk)

Interesting papers

Hughes, T., Ramage, D. Lexical semantic relatedness with random graph walks. EMNLP-CONLL 2007
Weale, T., Brew, C., and Fosler-Lussier, E. (2009). Using the Wiktionary Graph Structure for Synonym Detection. ACL-IJCNLP 2009 (applies page rank to sem-rel)

Lexical Semantic Relatedness with Random GraphWalks. Thad Hughes and Daniel Ramage. EMNLP-CONLL 2007

Key ideas

"[We] compute word-specific probability distributions over how often a particle visits all other nodes in the graph when “starting” from a specific word. We compute the relatedness of two words as the similarity of their stationary distributions."
Construct graph-cum-Markov chain (stochastic matrix) for each edge-type. Add the matrices and normalize. (Is there a case for weighted combination?)
"to compute [word-specific] stationary distribution ... we [start at the word, and] at every step of the walk, we will return to [it] with probability \beta . Intuitively, ... nodes close to [the word] should be given higher weight ... also guarantees that the stationary distribution exists and is unique (Bremaud, 1999)."
"because [the matrix] is sparse, each iteration of the above computation is ... linear in the total number of edges. Introducing an edge type that is dense would dramatically increase running time."

Claims

"the application of random walk Markov chain theory to measuring lexical semantic relatedness"
"[past work] has only considered using one stationary distribution per specially-constructed graph as a probability estimator ... we [use] distinct stationary distributions resulting from random walks centered at different positions in the word graph."

Evaluation

"For consistency with previous literature, we use rank correlation (Spearman’s coefficient) ... because [it models the] ordering of the scores ... many applications that make use of lexical relatedness scores (e.g. as features to a machine learning algorithm) would better be served by scores on a linear scale with human judgments."

Interesting papers

Julie Weeds and David Weir. Co-occurrence retrieval: A flexible framework for lexical distributional similarity. CL 2005 (survey of co-occurrence-based distributional similarity measures of semantic relatedness)
P. Berkhin. A survey on pagerank computing. Internet Mathematics 2005
Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. Artificial Intelligence and Statistics 2001 (surveys divergence measures for distributions)

Random Walks for Text Semantic Similarity. Daniel Ramage, Anna N. Rafferty, and Christopher D. Manning. ACL-IJCNLP 2009

Claims

"random graph walk algorithm for semantic similarity of texts [(not words)] ... [faster than a] mathematically equivalent model based on summed similarity judgments of individual words.
"walks effectively aggregate information over multiple types of links and multiple input words"

Key ideas

"determine an initial distribution ... for a [given text passage,] ... simulate a random walk ... we compare the resulting stationary distributions from two such walks".
"We compute the stationary distribution ... with probability \beta of returning to the initial distribution at each time step as the limit as t goes to \infty. ... the resulting stationary distribution can be factored as the weighted sum of the stationary distributions of each word represented in the initial distribution ... [it can be shown that] the stationary distribution is ... the weighted sum of the stationary distribution of each underlying word"

Evaluation

"We add the additional baseline of always guessing the majority class label because the data set is skewed toward “paraphrase”."
Another measure for comparing distributions: the dice measure extended to weighted features (Curran, 2004).

Comments

"The random walk framework smoothes an initial distribution of words into a much larger lexical space. In one sense, this is similar to the technique of query expansion used in information retrieval ... this expansion is analogous to taking only a single step of the random walk."
"The random walk framework can be used to evaluate changes to lexical resources ... this provides a more semantically relevant evaluation of updates to a resource than, for example, counting how many new words or links between words have been added."

Interesting papers

T. H. Haveliwala. Topic-sensitive pagerank. WWW 2002 (computational steps are similar; see teleport vector)
R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. AAAI 2006 (includes a survey of semantic similarity measures)
E. Minkov and W. W. Cohen. Learning to rank typed graph walks: Local and global approaches. WebKDD and SNA-KDD 2007.

WikiWalk: Random walks on Wikipedia for Semantic Relatedness. Eric Yeh, Daniel Ramage, Christopher D. Manning, Eneko Agirre, Aitor Soroa. ACL-IJCNLP 2009

Claims

"we base our random walk algorithms after the ones described in (Hughes and Ramage, 2007) and (Agirre et al., 2009), but use Wikipedia-based methods to construct the graph."
"evaluates methods for building the graph, including link selection strategies, and two methods for representing input texts as distributions over the graph nodes"
"previous work on Wikipedia has made limited use of [link] information,"
"Our results show the importance of pruning the dictionary" (i.e. words from Wikipedia chosen as nodes)

Key ideas

computation of relatedness for a word pair has three steps

each input word/text ... [is converted to a] teleport vector.
Personalized PageRank is ... [run to get] the stationary distribution
the stationary distributions [are compared]

Interesting papers

E. Agirre and A. Soroa. Personalizing pagerank for word sense disambiguation. EACL 2009
M. D. Lee, B. Pincombe, and M. Welsh. An empirical evaluation of models of text document similarity. Cognitive Science Society, 2005 (data set)

Wednesday, February 6, 2013

A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa. NAACL-HLT 2009

Claims

a supervised combination of [our methods] yields the best published results on all datasets
we pioneer cross-lingual similarity
A discussion on the differences between learning similarity and relatedness scores

Cross lingual similarity

Wordnet-based: Since it a multilingual aligned WordNet, the monolingual methods are directly applicable
Distributional: translate target into source language (using machine translation) and then use monolingual method

Results

[among distributional methods], the method based on context windows provides the best results for similarity, and the bag-of-
words representation [does best] for relatedness.
upper-bounding combined performance: "we took the output[s] of three systems ... we implemented an oracle that chooses [among the outputs] ... the rank that is most similar to the rank of the pair in the gold-standard. ... gives as an indication of the correlations that could be achieved by choosing for each pair the rank output by the best classifier for that pair."

On evaluation

"Pearson correlation suffers much when the scores of two systems are not linearly correlated, [e.g.] due to the different nature of the techniques applied ... Spearman correlation provides an evaluation metric that is independent of such data-dependent transformations"

Lexical Co-occurrence, Statistical Significance, and Word Association. Dipak L. Chaudhari, Om P. Damani, Srivatsan Laxman. EMNLP 2011

Claims

We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences.
We ... construct a significance test that allows us to detect different kinds of co-occurrences within a single unified framework

Key ideas

Existing co-occurrence measures ... assume that each document is drawn from a multinomial distribution based on global unigram frequencies ... [The problem with this] is

the overbearing influence of the unigram frequencies on the detection of word associations. For example, the association between anomochilidae (dwarf pipe snakes) and snake could go undetected ... since less than 0.1% of the pages containing snake also contained anomochilidae.
the expected span of a word pair is very sensitive to the associated unigram frequencies: the expected span of a word pair composed of low frequency unigrams is much larger than that with high frequency unigrams. This is contrary to how word associations appear in language, where semantic relationships manifest with small inter-word distances irrespective of the underlying unigram distributions.

To solve the above, "we employ a null model that represents each document as a bag of words"

A random permutation of the associated bag of words gives a linear representation for the document.
If the observed span distribution of a word-pair resembles that under the (random permutation) null model, then the relation between the words is not strong enough for one word to influence the placement of the other.

Experiments

New data sets (from the "free association" problem)

Edinburg (Kiss et al.,1973), Florida (Nelson et al., 1980), Goldfarb-Halpern (Goldfarb and Halpern, 1984), Kent (Kent and
Rosanoff, 1910), Minnesota (Russell and Jenkins, 1954), White-Abrams (White and Abrams, 2004)

Comments

The basic approach in this kind of modeling: "We need a null hypothesis that can account for an observed co-occurrence as a pure chance event and this in-turn requires a corpus generation model. Documents in a corpus can be assumed to be generated independent of each other."
Comprehensive list of co-occurrence measures

CSR, CWCD (Washtell and Markert, 2009), Dice (Dice, 1945), LLR (Dunning, 1993), Jaccard (Jaccard, 1912), Ochiai (Janson and Vegelius,1981), Pearson’s X^2 test, PMI (Church and Hanks, 1989), SCI (Washtell and Markert, 2009), T-test

Harnessing different knowledge sources to measure semantic relatedness under a uniform model. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. EMNLP 2011

Claims

introduces a method of harnessing different knowledge sources under a uniform model for measuring semantic relatedness between words or concepts.
we identify two issues that have not been addressed in the previous works. First, existing works typically employ a single knowledge source of semantic evidence ... Second, ... evaluated in general domains only ... evaluation ... in specific domains is ... important.

Key ideas

knowledge from different sources are mapped into a graph representation in 3 stages, and a general graph-based (random walk) algorithm is used for final relatedness computation
Random walk: "formalizes the idea that taking successive steps along the paths in a graph, the “easier” it is to arrive at a target node starting from a source node, the more related the two nodes are ... P(t)(j|i) [is] the probability of reaching other nodes from a starting
node on the graph after t steps ... [following] Rowe and Ciravegna (2010) ... set t=2 in order to preserve locally connected nodes ... Effectively, this formalizes the notion that two concepts related to a third concept is also semantically related [similar to] Patwardhan and Pedersen (2006)".
The stages involve “feature integration” as merging feature types from different knowledge sources into single types of features based on their similarity in semantics.

"the difference between cross-source feature combination and integration is that the former introduces more types of features, whereas the latter retains same number of feature types but increases feature values for each type. Both have the effect of establishing additional path (via features) between concepts, but in different ways."

Comments

"Zhang et al. (2010) argue that ... different knowledge sources may complement each other."
"evaluation of [semantic relatedness] methods in specific domains is increasingly important" (They also evaluate on (biomedical) domain-specific data sets)
"Wikipedia ... [has] reasonable coverage of many domains (Holloway et al., 2007; Halavais, 2008)."
Classifies SR approaches

path based: use wordnet-like semantic network
Information Content (IC) based: use taxonomy (a special case of network) and a corpus
statistical

distributional
co-occurrence-based

hybrid: combine the above, e.g. Riensche et al. (2007), Pozo et al. (2008), Han and Zhao (2010). Note: the idea of combining methods is distinguished from the idea of combining knowledge sources.

Evaluation

Data sets: general (Rubenstein and Goodenough, Miller and Charles, Finkelstein et al.) and biomedical (Petrakis et al. (2006), Pedersen et al. (2006))
Measure: Spearman correlation ("better metric ... (Zesch and Gurevych, 2010)")
"some datasets have a ... low sample size, ... correlation values [could have] occurred by chance. Therefore, we measure the statistical significance of correlation by computing the p-value for the correlation values"

Interesting papers

Random walk for semantic relatedness

Zhang, Z., Gentile, A., Xia, L., Iria, J., Chapman, S. A random graph walk based approach to compute semantic relatedness using knowledge from Wikipedia. LREC 2010. (compare this with Manning's paper)

Rowe, M., Ciravegna, F. Disambiguating identity web references using Web 2.0 data and semantics. The Journal of Web Semantics 2010

Hybrid methods

Tsang, V., Stevenson, S. A graph-theoretic framework for semantic distance. CL 2010
Han, X., Zhao, J. Structural semantic relatedness: a knowledge-based method to named entity disambiguation. ACL 2010

Patwardhan, S., Pedersen, T. 2006. Using WordNet-based context vectors to estimate the semantic relatedness of concepts. EACL 2006 ("second-order context")

Tuesday, February 5, 2013

Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge. Samer Hassan and Rada Mihalcea. EMNLP 2009

Key Ideas

Introduce the problem of cross-lingual semantic relatedness.
Map words in different languages to their concept vectors (concepts are Wikipedia articles, similar to Gabrilovich and Markovitch, AAAI 2007). Map concepts using Wikipedia langlinks. The vectors are now comparable.

Comments

Wikipedia data accessed using Wikipedia Miner.
Experiments

WS-30 (G. Miller and W. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes 1998) and WS-353 (L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: the concept revisited. WWW 2001) semantic similarity evaluation sets were translated and used for evaluation.
Detailed description of creation of data sets and evaluation sets (including instructions given to annotators).
Also devise an "obvious" baseline which illustrates where their method helps.

EvaluatingWordNet-based Measures of Lexical Semantic Relatedness. Alexander Budanitsky, Graeme Hirst. CL 2006

Comments on the problem

Distinguishing semantic similarity and relatedness

"... semantic relatedness is a more general concept than similarity; similar entities are semantically related by virtue of their similarity (bank–trust company), but dissimilar entities may also be semantically related by lexical relationships such as meronymy (car–wheel) and antonymy (hot–cold), ..."
"the more-general idea of relatedness, not just similarity ... not just ... relationships in WordNet ... but also associative and ad hoc relationships ... just about any kind of functional relation or frequent association in the world. ... Morris and Hirst (2004, 2005) have termed these non-classical lexical semantic relationships ... shown in experiments ... that around 60% of the lexical relationships ... in a text are of this nature."

"[A study found that] the words sex, drinking, and drag racing were semantically related, by all being “dangerous behaviors”, in the context of an article about teenagers emulating what they see in movies. Thus lexical semantic relatedness is sometimes constructed in context and cannot always be determined purely from an a priori lexical resource ... However, [such] ad hoc relationships accounted for only a small fraction of those reported [in the study]"
"... in this paper the term concept will refer to a particular sense of a given word. ... when we say that two words are “similar”, ... they denote similar concepts; ... [and] not ... similar~~ity of~~ distributional or co-occurrence behavior of the words, ...While similarity of denotation might be inferred from similarity of distributional or co-occurrence behavior (Dagan 2000; Weeds 2003), the two are distinct ideas."

"All approaches to measuring semantic relatedness that use a lexical resource construe the resource, in one way or another, as a network or directed graph, and then base the measure of relatedness on properties of paths in this graph." (Compare with probabilistic approaches.)

Relating semantic relatedness and distributional similarity

"Weeds (2003), in her study of 15 distributional-similarity measures, found that words distributionally similar to hope (noun) included confidence, dream, feeling, and desire; Lin (1998b) found pairs such as earnings–profit, biggest–largest, nylon–silk, and pill–tablet. ... if two concepts are similar or related, it is likely that their role in the world will be similar, so similar things will be said about them, and so the contexts of occurrence of the corresponding words will be similar. And conversely (albeit with less certainty), if the contexts of occurrence of two words are similar, then similar things are being said about each, so they are playing similar roles in the world and hence are semantically similar — at least to the extent of these roles."
Differences between the two

"while semantic relatedness is inherently a relation on concepts, ... distributional similarity is a (corpus-dependent) relation on words."
"whereas semantic relatedness is symmetric, distributional similarity is a potentially asymmetrical relationship. If distributional similarity is conceived of as substitutability, ... then asymmetries arise ...; for example, ... fruit substitutes for apple better than apple substitutes for fruit."
"Imbalance in the corpus and data sparseness is an additional source of anomalous results even for “good” measures."

Evaluation issues

"severe limitation on the data means that this was not really a fair test of the principles underlying the [distributional] hypothesis; a fair test would require data allowing the comparison of any ... two words in WordNet, but obtaining such [corpus] data for less-frequent words ... would be a massive task."

Comments on experiments

Lists 3 kinds of evaluation

"theoretical examination .. for ... mathematical properties thought desirable, such as whether it is a metric ..., whether it has singularities, whether its parameter-projections are smooth functions, ..."
"comparison with human judgments. Insofar as human judgments of similarity and relatedness are deemed to be correct by definition, this clearly gives the best assessment of the “goodness” of a measure."
"evaluate ... with respect to ... performance in the framework of a particular application."

"While comparison with human judgments is the ideal way to evaluate a measure of similarity or semantic relatedness, in practice the tiny amount of data available (and only for similarity, not relatedness) is quite inadequate." and "Finkelstein [-353] ... is still very small, and, as Jarmasz and Szpakowicz (2003) point out, is culturally and politically biased."
"... often what we are really interested in is the relationship between the concepts for which the words are merely surrogates; the human judgments that we need are of the relatedness of word-senses, not words. So the experimental situation would need to set up contexts that bias the sense selection for each target word and yet don’t bias the subject’s judgment of their a priori relationship, an almost self-contradictory situation." (and hence justifying extrinsic evaluation)
Application to malapropism detection

Monday, February 4, 2013

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Evgeniy Gabrilovich and Shaul Markovitch. IJCAI 2007

Comments

Classifies work in the field into three main directions:

text fragments as bags of words in vector space (distributional similarity)
text fragments as bags of concepts (using Latent Semantic Analysis)
using lexical resources (Wordnet etc.) (also use concepts but not based on world knowledge, and work only at the level of individual words; also, it relies on human-organized knowledge)

Distinguishes similarity and relatedness, e.g.

"Budanitsky and Hirst [2006] argued that the notion of relatedness is more general than that of similarity, as the former subsumes many different kind of specific relations, including meronymy, antonymy, functional association, and others. They further maintained that computational linguistics applications often require measures of relatedness rather than the more narrowly defined measures of similarity."
"When only the similarity relation is considered, using lexical resources was often successful enough, ... However, when the entire language wealth is considered in an attempt to capture more general semantic relatedness, lexical techniques yield substantially inferior results"

Interesting papers

Statistical methods

S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. JASIS 1990
Lillian Lee. Measures of distributional similarity. ACL 1999.
Ido Dagan, Lillian Lee, and Fernando C. N. Pereira. Similarity-based models of word cooccurrence probabilities. ML 1999

Expert-based methods (e.g. using Wordnet)

Alexander Budanitsky and Graeme Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. CL 2006
Mario Jarmasz. Roget’s thesaurus as a lexical resource for natural language processing. Master's Thesis 2003

Others

Justin Zobel and Alistair Moffat. Exploring the similarity space. ACM SIGIR Forum, 1998. [for justifying the cosine metric?]

BabelRelate! A Joint Multilingual Approach to Computing Semantic Relatedness. Roberto Navigli and Simone Paolo Ponzetto. AAAI 2012

Key claims

Our approach is based on ... a ... multilingual knowledge base, which is used to compute semantic graphs in a variety of languages. ... information from these graphs is then combined to produce ... disambiguated translations [which] are connected by means of strong semantic relations.
... what we explore here is the joint contribution obtained by using a multilingual knowledge base for this [(cross language semantic relatedness)] task.
Given a pair of words in two languages we use BabelNet to collect their translations, compute semantic graphs in a variety of languages, and then combine the empirical evidence from these different languages by intersecting their respective graphs.

Interesting papers

Hassan, S., and Mihalcea, R. Cross-lingual semantic relatedness using encyclopedic knowledge. EMNLP 2009 (introduces knowledge-based approach to computing semantic relatedness across different languages)
Agirre, E.; Alfonseca, E.; Hall, K.; Kravalova, J.; Pasca, M.; Soroa, A. A study on similarity and relatedness using distributional and WordNet-based approaches. NAACL-HLT 2009 (seminal finding that knowledge-based approaches to semantic relatedness can compete and even outperform distributional methods in a cross-lingual setting)
Nastase, V.; Strube, M.; B ̈ rschinger, B.; Zirn, C.; and Elghafari, A. WikiNet: A very large scale multi-lingual concept network. LREC 2010.
de Melo, G., and Weikum, G. MENTA: Inducing multilingual taxonomies from Wikipedia. CIKM 2010

WikiRelate! Computing Semantic Relatedness Using Wikipedia. Michael Strube and Simone Paolo Ponzetto. AAAI 2006

Key ideas

Use the Wikipedia category hierarchy instead of the WordNet hierarchy.

Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Philip Resnik. JAIR 1999

Key Ideas
Comments

Semantic similarity as a special case of semantic relatedness (relation is IS-A)

For example, car-gasoline are related, but car-bicycle are similar.

"... measures of similarity ... are seldom accompanied by an independent characterization of the phenomenon they are measuring ... The worth of a similarity measure is in its fidelity to human behavior, as measured by predictions of human performance on experimental tasks."

Note

polysemy as a special case of homonymy (the different meanings have a common aspect, and are hence called `senses')

For example, `man' is polysemous (species, gender, adult) while `bank' is homonymous (river-edge, money-place).

An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. Yuhua Li, Zuhair A. Bandar, and David McLean. IEEE KDE 2003

Key Ideas
Comments

"Similarity between two words is often represented by similarity between concepts associated with the two words."
"Evidence from psychological experiments demonstrate that similarity is context-dependent and may be asymmetric ... Experimental results investigating the effects of asymmetry suggest that the average difference in ratings for a word pair is less than 5 percent"

Friday, February 1, 2013

Learning Discriminative Projections for Text Similarity Measures. Wen-tau Yih, Kristina Toutanova, John C. Platt, Christopher Meek. CoNLL 2011

Claims:

We propose a new projection learning framework, Similarity Learning via Siamese Neural Network (S2Net), to discriminatively learn the concept vector representations of input text objects.

Comment:

Input is pairs of words that are known to be similar/dissimilar.

Web-Scale Distributional Similarity and Entity Set Expansion. Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, Vishnu Vyas. EMNLP 2009

Claims

propose an algorithm for large-scale term similarity computation

Comments

Lists applications of semantic similarity: word classification, word sense disambiguation, context-spelling correction, fact extraction, semantic role labeling, query expansion, textual advertising
they apply the learned similarity matrix to the task of automatic
set expansion

Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches. Shuming Shi, Huibin Zhang, Xiaojie Yuan, Ji-Rong Wen. COLING 2010

Claims

perform an empirical comparison of [previous research work] [on semantic class mining]
propose a frequency-based rule to select appropriate approaches for different types of terms.

Comments

Jargon: semantically similar words are also called "peer terms or coordinate terms".
States that "DS [distributional similarity] approaches basically exploit second-order co-occurrences to discover strongly associated concepts." How is that?
Extrinsic evaluation by set expansion

A Mixture Model with Sharing for Lexical Semantics. Joseph Reisinger, Raymond Mooney. EMNLP 2010

Claims

Multi-prototype representations [are good for] words with several unrelated meanings (e.g. bat and club), but are not suitable for representing the common ... structure [shared across senses] found in highly polysemous words such as line or run. We introduce a mixture model for capturing this---mixture of a Dirichlet Process clustering model and a background model.
we derive a multi-prototype representation capable of capturing varying degrees of sharing between word senses, and demonstrate its effectiveness in the word-relatedness task in the presence of highly polysemous words.

Comments

Positions lexical semantics as the umbrella task with subtasks such as word relatedness and selectional preferences

Thursday, January 31, 2013

Fast Large-Scale Approximate Graph Construction for NLP. Amit Goyal, Hal Daum´e III, Raul Guerra. EMNLP 2012

Claims:

In FLAG, we first propose a novel distributed online-PMI algorithm
We propose novel variants of PLEB to address the issue of reducing the pre-processing time for PLEB.
Finally, we show the applicability of large-scale graphs built from FLAG on two applications: the Google-Sets problem and learning concrete and abstract words.

Sketch Algorithms for Estimating Point Queries in NLP. Amit Goyal, Hal Daum´e III, Graham Cormode. EMNLP 2012

Claims

We propose novel variants of existing sketches by extending the idea of conservative update to them.
We empirically compare and study the errors in approximate counts for several sketches.
We use sketches to solve three important NLP problems: pseudo-words, semantic orientation (pos/neg), distributional similarity (using PMI and LLR).

Automatic Evaluation of Topic Coherence. David Newman, Jey Han Lau, Karl Grieser, Timothy Baldwin. NAACL-HLT 2010

Claims

we develop methods for evaluating the quality of a given topic, in terms of its coherence to a human (intrinsic qualitative evaluation).
we ask humans to [judge] topics, propose models to predict
topic coherence, demonstrate that our methods achieve nearly perfect agreement with humans

Multi-Prototype Vector-Space Models of Word Meaning. Joseph Reisinger, Raymond J. Mooney. NAACL-HLT 2010

Claims

We present a new resource-lean vector-space model that represents a word’s meaning by a set of distinct “sense specific” vectors.
The model supports judging the similarity of both words in isolation and words in context.

Wednesday, January 30, 2013

A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the Web. Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka. EMNLP 2009

Key ideas

Past work modelled similarity between two words in terms of context overlap, where context consisted of other words known to be closely related to the word (derived either from a corpus or an ontology like wordnet). On the other hand, the authors claim:

We propose a relational model to compute the semantic similarity between two words. Intuitively, if the relations that exist between a and b are typical relations that hold between synonymous word pairs, then we get a high similarity score for a and b.

Define relations as patterns such as "X is a Y". For each word pair, compute a feature vector with a weight for each pattern (relation). Do this for a set of seed pairs, and compute a "prototype" vector. For a new word pair, declare similar if its vector is similar to the prototype vector (i.e. is n^T p is high).
Many patterns represent same/similar relations. They solve this problem at 2 levels:

They cluster similar patterns together, and use the clusters as features (instead of patterns).
Since the clusters may also be similar, use a correlation matrix in the dot product, i.e. instead of n^T p, use n^T C p.

Comments

Presents a view of the SS task as an integral part of various tasks including synonym generation (same as lexicon induction?), thesaurus generation, WSD, IR---query expansion, cluster labeling, etc.

Machine Learning that Matters. Kiri L. Wagstaff. ICML 2012

Key message: An analysis of what ails ML research today, especially w.r.t. its impact to real life problems
Comments on empirical analysis

Needed: domain interpretation of reported results

Which classes were well-classified; which were not
What are the common error types
Why particular data sets were chosen

Metrics

Instead of domain-independent metrics like accuracy or F-measure, domain-specific metrics might shed more light

For example, in classification of mushrooms, 80% might be good for botany, but we need more than 99% for deciding if a mushroom is poisonous to eat or not.

Don't just compare the performance of algorithms; analyze

how each algorithm is doing well
what is the effect of domain characteristics

Threshold ablation

Also discuss which threshold ranges or performance regimes are relevant to the domain
Do not summarize over all regimes, especially those irrelevant to the domain

Comments on impact

Take the method all the way through, to deployment
"What matters is achieving performance sufficient to make an impact on the world. As an analogy, consider a sick child in a rural setting. A neighbor who runs two miles to fetch the doctor need not achieve Olympic-level running speed (performance), so long as the doctor arrives in time to address the sick child’s needs (impact)."
The proposed solution might be complex internally, but easy to use externally, i.e. a lay person should be able to apply it to his problem without having to know a lot about ML.

Interesting citations

The changing science of machine learning. Pat Langley. Machine Learning 2011.

Sunday, January 20, 2013

Bilingual lexicon extraction from comparable corpora using in-domain terms. Azniah Ismail, Suresh Manandhar. COLING 2010.

Problem: Bilingual lexicon induction from comparable corpora without using orthographic features or large seed dictionaries.
Key ideas:

Context vectors have many noise words; choosing the right words (and omitting the others) should improve accuracy.
How to choose: Given source word s, find a highly associated source word a (having high LLR). Find in-domain terms by intersecting their context words (also high-LLR words, but a larger list ). Assume translation of a exists in target language. Get in-domain terms for each target word t. Get translation of s given a, by comparing in-domain terms vector. [Note: This requires translations of the in-domain terms in the target language.]
Rank-binning similarity measure to overcome need for dictionary (not sure how this works.)

Experiments: Interesting performance comparison---with different seed dictionaries.
Interesting papers:

Wilson Yiksen Wong. Learning lightweight ontologies from text across different domains using the web as background knowledge. Ph.D. Thesis. 2009

Wednesday, January 16, 2013

Notes on COLING 2012 - Part 3

Grammarless Parsing for Joint Inference. Jason Naradowsky Tim Vieira, David A. Smith

Problem: Jointly do grammar and NER (rather do one after the other, in the hope that they may help each other, e.g. an NE span suggests there is a noun phrase)
Approach: New to the methods applied in this area. Need background to make sense.
Interesting papers:

Finkel, J. R. and Manning, C. D. Joint parsing and named entity recognition. NAACL-HLT 2009
Sarawagi, S. and Cohen, W. W. Semi-Markov conditional random fields for information extraction. NIPS 2004

Text Reuse Detection Using a Composition of Text Similarity Measures. Daniel Bär, Torsten Zesch, Iryna Gurevych.

Problem: Measure similarity of two pieces of text (for e.g. plagiarism detection)
Key idea: Previous efforts used content-based measures; they use in addition structure and style as features.

content: words, synonyms, semantically related words, LSA representations
structure: stopword/POS n-grams
style: type/token ratio, function word frequency, token/sentence length
Use above as features for a machine-learned classifier (Naive Bayes, and decision tree)

Comments

Experiments on each corpus discussed separately, including error analysis.
Report confusion matrix when discussing classification performance.

Interesting papers:

Lin, D. An information-theoretic definition of similarity. ICML 1998
Gabrilovich, E. and Markovitch, S. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. IJCAI 2007
Artstein, R. and Poesio, M. Inter-Coder Agreement for Computational Linguistics. CL 2008

Tuesday, January 15, 2013

Notes on COLING 2012 - Part 2

Inducing Crosslingual Distributed Representations of Words. Alexandre Klementiev, Ivan Titov, Binod Bhattarai.

Problem: Learning a semantic space where points represent words, and similar words are nearby
Key ideas:

Use deep learning (neural network-based) to learn low-dimensional (d) representation of words (d fixed arbitrarily).
Do the above in a multi-talk learning setting to learn the low-d representation that holds across languages.
Use a parallel corpus to learn a similarity matrix between words---used for training the multi-task+neural-net model.

Found several interesting papers that might be worth reading (esp. starred ones)

* Täckström, O., McDonald, R., and Uszkoreit, J. Cross-lingual word clusters for direct transfer of linguistic structure. NAACL 2012
* Turian, J., Ratinov, L., and Bengio, Y. Word representations: a simple and general method for semi-supervised learning. ACL 2010
* Fouss, F., Pirotte, A., Renders, J., and Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE KDE 2007
Cavallanti, G., Cesa-bianchi, N., and Gentile, C. Linear algorithms for online multitask classification. JMLR 2010
Socher, R., Huang, E. H., Pennin, J., Ng, A. Y., and Manning, C. D. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. NIPS 2011
Huang, E., Socher, R., Manning, C., and Ng, A. Improving word representations via global context and multiple word prototypes. ACL 2012
Callison-Burch, C., Koehn, P Monz, C., Post, M., Soricut, R., and Specia, L. Findings of the 2012 workshop on statistical machine translation. WMT ACL 2012. (see for preprocessing steps)
Shi, L., Mihalcea, R., and Tian, M. Cross language text classification by model translation and semi-supervised learning. EMNLP 2010
Titov, I. Domain adaptation by constraining inter-domain variability of latent feature representation. ACL 2011
Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. ICML 2011
Zhang, D., Mei, Q., and Zhai, C. Cross-lingual latent topic extraction. ACL 2010
Fortuna, B. and Shawe-Taylor, J. The use of machine translation tools for cross-lingual text mining. Workshop on Learning with Multiple Views, ICML 2005

Long-tail Distributions and Unsupervised Learning of Morphology. Qiuye Zhao, Mitch Marcus

Problem: Learning unsupervised morphological analyzers. Previous work assumed power-law distributions for rank-frequency of morph units. They propose log-normal distribution instead.
Comments: Current approaches to morph analysis have moved beyond ILP and finite state machines. Need to do background reading to understand this work, e.g. Chan, E. Structures and distributions in morphology learning. PhD thesis 2008.

Graph-based Multi-tweet Summarization Using Social Signals. LIU XiaoHua LI Yi Tong WEI FuRu ZHOU Ming.

Problem: Given a set of tweets, find one that is representative of the lot.
Approach: Uses scoring functions and features tailored for the problem taking into account saliency, readibility, tweeter diversity, and uses existing work on multi-document summarization and on tweets.
Comments: Check out user study.

To Exhibit is not to Loiter: A Multilingual, Sense-Disambiguated Wiktionary for Measuring Verb Similarity. Christian M. Meyer, Iryna Gurevych

Problem:

Given links between words, identify which senses of the words are actually (supposed to have been) linked.
Given links between senses of words, infer new links, e.g. between words in different languages.
Given links between senses of words, compute verb similarity.

Key Ideas: Start with a dictionary with partial sense information. Disambiguate (remove incorrect) and infer (add new) links.
Comments: Check out resources created.
Interesting papers mentioned

computing semantic relatedness by measuring path lengths (Budanitsky and Hirst, 2006)

Monday, January 14, 2013

Notes on COLING 2012 - Part 1

Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking. Estelle DELPECH, Béatrice DAILLE, Emmanuel MORIN, Claire LEMAIRE

Problem: Extract translations of phrases (not just single words). Focus on fertile translations---target has more words than source.
Key ideas:

split source term into morphemes (helps handle the multi-word case, and also fertility.)
translate morphemes (A key assumption here is that the parts of the source phrase are compositional)
recompose into target phrase. This creates several candidates (e.g. by permutation, which are ranked.

Multi-way Tensor Factorization for Unsupervised Lexical Acquisition. Tim Van de Cruys, Laura Rimell, Thierry Poibeau, Anna Korhonen

Problem: Cluster verbs in a corpus, based on (a) what arguments it can take (b) what arguments it prefers (among those possible), and (c) do the first two jointly.
Key idea: Use non-negative tensor factorization (Shashua, A. and Hazan, T. Non-negative tensor factorization with applications to statistics and computer vision. ICML 2005) to cluster the verbs.

Incremental Learning of Affix Segmentation. Wondwossen Mulugeta, Michael Gasser, Baye Yimam

Problem: Affix segmentation (or morpho-analysis) for Amharic (whose morphology seems as complex as Indian languages).
Approach: Directly used Inductive Logic Programming as described in [Manandhar, S. , Džeroski, S. and Erjavec, T. Learning multilingual morphology with CLOG. ILP 1998]

Given data of the form: stem([s,e,b,e,r,k,u,l,h],[s,e,b,e,r] [1,1,1,2]). [seber is the stem of seberkulh]
Learn a set of rules of the form "p :- q" meaning "Do p, if q is true". Example of p: stem(Word, Stem, [1, 2, 7, 0]):-
set_affix(Word, Stem, [y], [], [u], []),
feature([1, 2, 7, 0], [simplex, imperfective, tppn, noobj]),
template(Stem, [1, 0, 1, 1]).
The order of training data matters a lot. First simpler examples should be given, followed by more complex ones.