Research Notes: January 2013

Thursday, January 31, 2013

Fast Large-Scale Approximate Graph Construction for NLP. Amit Goyal, Hal Daum´e III, Raul Guerra. EMNLP 2012

Claims:

In FLAG, we first propose a novel distributed online-PMI algorithm
We propose novel variants of PLEB to address the issue of reducing the pre-processing time for PLEB.
Finally, we show the applicability of large-scale graphs built from FLAG on two applications: the Google-Sets problem and learning concrete and abstract words.

Sketch Algorithms for Estimating Point Queries in NLP. Amit Goyal, Hal Daum´e III, Graham Cormode. EMNLP 2012

Claims

We propose novel variants of existing sketches by extending the idea of conservative update to them.
We empirically compare and study the errors in approximate counts for several sketches.
We use sketches to solve three important NLP problems: pseudo-words, semantic orientation (pos/neg), distributional similarity (using PMI and LLR).

Automatic Evaluation of Topic Coherence. David Newman, Jey Han Lau, Karl Grieser, Timothy Baldwin. NAACL-HLT 2010

Claims

we develop methods for evaluating the quality of a given topic, in terms of its coherence to a human (intrinsic qualitative evaluation).
we ask humans to [judge] topics, propose models to predict
topic coherence, demonstrate that our methods achieve nearly perfect agreement with humans

Multi-Prototype Vector-Space Models of Word Meaning. Joseph Reisinger, Raymond J. Mooney. NAACL-HLT 2010

Claims

We present a new resource-lean vector-space model that represents a word’s meaning by a set of distinct “sense specific” vectors.
The model supports judging the similarity of both words in isolation and words in context.

Wednesday, January 30, 2013

A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the Web. Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka. EMNLP 2009

Key ideas

Past work modelled similarity between two words in terms of context overlap, where context consisted of other words known to be closely related to the word (derived either from a corpus or an ontology like wordnet). On the other hand, the authors claim:

We propose a relational model to compute the semantic similarity between two words. Intuitively, if the relations that exist between a and b are typical relations that hold between synonymous word pairs, then we get a high similarity score for a and b.

Define relations as patterns such as "X is a Y". For each word pair, compute a feature vector with a weight for each pattern (relation). Do this for a set of seed pairs, and compute a "prototype" vector. For a new word pair, declare similar if its vector is similar to the prototype vector (i.e. is n^T p is high).
Many patterns represent same/similar relations. They solve this problem at 2 levels:

They cluster similar patterns together, and use the clusters as features (instead of patterns).
Since the clusters may also be similar, use a correlation matrix in the dot product, i.e. instead of n^T p, use n^T C p.

Comments

Presents a view of the SS task as an integral part of various tasks including synonym generation (same as lexicon induction?), thesaurus generation, WSD, IR---query expansion, cluster labeling, etc.

Machine Learning that Matters. Kiri L. Wagstaff. ICML 2012

Key message: An analysis of what ails ML research today, especially w.r.t. its impact to real life problems
Comments on empirical analysis

Needed: domain interpretation of reported results

Which classes were well-classified; which were not
What are the common error types
Why particular data sets were chosen

Metrics

Instead of domain-independent metrics like accuracy or F-measure, domain-specific metrics might shed more light

For example, in classification of mushrooms, 80% might be good for botany, but we need more than 99% for deciding if a mushroom is poisonous to eat or not.

Don't just compare the performance of algorithms; analyze

how each algorithm is doing well
what is the effect of domain characteristics

Threshold ablation

Also discuss which threshold ranges or performance regimes are relevant to the domain
Do not summarize over all regimes, especially those irrelevant to the domain

Comments on impact

Take the method all the way through, to deployment
"What matters is achieving performance sufficient to make an impact on the world. As an analogy, consider a sick child in a rural setting. A neighbor who runs two miles to fetch the doctor need not achieve Olympic-level running speed (performance), so long as the doctor arrives in time to address the sick child’s needs (impact)."
The proposed solution might be complex internally, but easy to use externally, i.e. a lay person should be able to apply it to his problem without having to know a lot about ML.

Interesting citations

The changing science of machine learning. Pat Langley. Machine Learning 2011.

Sunday, January 20, 2013

Bilingual lexicon extraction from comparable corpora using in-domain terms. Azniah Ismail, Suresh Manandhar. COLING 2010.

Problem: Bilingual lexicon induction from comparable corpora without using orthographic features or large seed dictionaries.
Key ideas:

Context vectors have many noise words; choosing the right words (and omitting the others) should improve accuracy.
How to choose: Given source word s, find a highly associated source word a (having high LLR). Find in-domain terms by intersecting their context words (also high-LLR words, but a larger list ). Assume translation of a exists in target language. Get in-domain terms for each target word t. Get translation of s given a, by comparing in-domain terms vector. [Note: This requires translations of the in-domain terms in the target language.]
Rank-binning similarity measure to overcome need for dictionary (not sure how this works.)

Experiments: Interesting performance comparison---with different seed dictionaries.
Interesting papers:

Wilson Yiksen Wong. Learning lightweight ontologies from text across different domains using the web as background knowledge. Ph.D. Thesis. 2009

Wednesday, January 16, 2013

Notes on COLING 2012 - Part 3

Grammarless Parsing for Joint Inference. Jason Naradowsky Tim Vieira, David A. Smith

Problem: Jointly do grammar and NER (rather do one after the other, in the hope that they may help each other, e.g. an NE span suggests there is a noun phrase)
Approach: New to the methods applied in this area. Need background to make sense.
Interesting papers:

Finkel, J. R. and Manning, C. D. Joint parsing and named entity recognition. NAACL-HLT 2009
Sarawagi, S. and Cohen, W. W. Semi-Markov conditional random fields for information extraction. NIPS 2004

Text Reuse Detection Using a Composition of Text Similarity Measures. Daniel Bär, Torsten Zesch, Iryna Gurevych.

Problem: Measure similarity of two pieces of text (for e.g. plagiarism detection)
Key idea: Previous efforts used content-based measures; they use in addition structure and style as features.

content: words, synonyms, semantically related words, LSA representations
structure: stopword/POS n-grams
style: type/token ratio, function word frequency, token/sentence length
Use above as features for a machine-learned classifier (Naive Bayes, and decision tree)

Comments

Experiments on each corpus discussed separately, including error analysis.
Report confusion matrix when discussing classification performance.

Interesting papers:

Lin, D. An information-theoretic definition of similarity. ICML 1998
Gabrilovich, E. and Markovitch, S. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. IJCAI 2007
Artstein, R. and Poesio, M. Inter-Coder Agreement for Computational Linguistics. CL 2008

Tuesday, January 15, 2013

Notes on COLING 2012 - Part 2

Inducing Crosslingual Distributed Representations of Words. Alexandre Klementiev, Ivan Titov, Binod Bhattarai.

Problem: Learning a semantic space where points represent words, and similar words are nearby
Key ideas:

Use deep learning (neural network-based) to learn low-dimensional (d) representation of words (d fixed arbitrarily).
Do the above in a multi-talk learning setting to learn the low-d representation that holds across languages.
Use a parallel corpus to learn a similarity matrix between words---used for training the multi-task+neural-net model.

Found several interesting papers that might be worth reading (esp. starred ones)

* Täckström, O., McDonald, R., and Uszkoreit, J. Cross-lingual word clusters for direct transfer of linguistic structure. NAACL 2012
* Turian, J., Ratinov, L., and Bengio, Y. Word representations: a simple and general method for semi-supervised learning. ACL 2010
* Fouss, F., Pirotte, A., Renders, J., and Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE KDE 2007
Cavallanti, G., Cesa-bianchi, N., and Gentile, C. Linear algorithms for online multitask classification. JMLR 2010
Socher, R., Huang, E. H., Pennin, J., Ng, A. Y., and Manning, C. D. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. NIPS 2011
Huang, E., Socher, R., Manning, C., and Ng, A. Improving word representations via global context and multiple word prototypes. ACL 2012
Callison-Burch, C., Koehn, P Monz, C., Post, M., Soricut, R., and Specia, L. Findings of the 2012 workshop on statistical machine translation. WMT ACL 2012. (see for preprocessing steps)
Shi, L., Mihalcea, R., and Tian, M. Cross language text classification by model translation and semi-supervised learning. EMNLP 2010
Titov, I. Domain adaptation by constraining inter-domain variability of latent feature representation. ACL 2011
Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. ICML 2011
Zhang, D., Mei, Q., and Zhai, C. Cross-lingual latent topic extraction. ACL 2010
Fortuna, B. and Shawe-Taylor, J. The use of machine translation tools for cross-lingual text mining. Workshop on Learning with Multiple Views, ICML 2005

Long-tail Distributions and Unsupervised Learning of Morphology. Qiuye Zhao, Mitch Marcus

Problem: Learning unsupervised morphological analyzers. Previous work assumed power-law distributions for rank-frequency of morph units. They propose log-normal distribution instead.
Comments: Current approaches to morph analysis have moved beyond ILP and finite state machines. Need to do background reading to understand this work, e.g. Chan, E. Structures and distributions in morphology learning. PhD thesis 2008.

Graph-based Multi-tweet Summarization Using Social Signals. LIU XiaoHua LI Yi Tong WEI FuRu ZHOU Ming.

Problem: Given a set of tweets, find one that is representative of the lot.
Approach: Uses scoring functions and features tailored for the problem taking into account saliency, readibility, tweeter diversity, and uses existing work on multi-document summarization and on tweets.
Comments: Check out user study.

To Exhibit is not to Loiter: A Multilingual, Sense-Disambiguated Wiktionary for Measuring Verb Similarity. Christian M. Meyer, Iryna Gurevych

Problem:

Given links between words, identify which senses of the words are actually (supposed to have been) linked.
Given links between senses of words, infer new links, e.g. between words in different languages.
Given links between senses of words, compute verb similarity.

Key Ideas: Start with a dictionary with partial sense information. Disambiguate (remove incorrect) and infer (add new) links.
Comments: Check out resources created.
Interesting papers mentioned

computing semantic relatedness by measuring path lengths (Budanitsky and Hirst, 2006)

Monday, January 14, 2013

Notes on COLING 2012 - Part 1

Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking. Estelle DELPECH, Béatrice DAILLE, Emmanuel MORIN, Claire LEMAIRE

Problem: Extract translations of phrases (not just single words). Focus on fertile translations---target has more words than source.
Key ideas:

split source term into morphemes (helps handle the multi-word case, and also fertility.)
translate morphemes (A key assumption here is that the parts of the source phrase are compositional)
recompose into target phrase. This creates several candidates (e.g. by permutation, which are ranked.

Multi-way Tensor Factorization for Unsupervised Lexical Acquisition. Tim Van de Cruys, Laura Rimell, Thierry Poibeau, Anna Korhonen

Problem: Cluster verbs in a corpus, based on (a) what arguments it can take (b) what arguments it prefers (among those possible), and (c) do the first two jointly.
Key idea: Use non-negative tensor factorization (Shashua, A. and Hazan, T. Non-negative tensor factorization with applications to statistics and computer vision. ICML 2005) to cluster the verbs.

Incremental Learning of Affix Segmentation. Wondwossen Mulugeta, Michael Gasser, Baye Yimam

Problem: Affix segmentation (or morpho-analysis) for Amharic (whose morphology seems as complex as Indian languages).
Approach: Directly used Inductive Logic Programming as described in [Manandhar, S. , Džeroski, S. and Erjavec, T. Learning multilingual morphology with CLOG. ILP 1998]

Given data of the form: stem([s,e,b,e,r,k,u,l,h],[s,e,b,e,r] [1,1,1,2]). [seber is the stem of seberkulh]
Learn a set of rules of the form "p :- q" meaning "Do p, if q is true". Example of p: stem(Word, Stem, [1, 2, 7, 0]):-
set_affix(Word, Stem, [y], [], [u], []),
feature([1, 2, 7, 0], [simplex, imperfective, tppn, noobj]),
template(Stem, [1, 0, 1, 1]).
The order of training data matters a lot. First simpler examples should be given, followed by more complex ones.