Research Notes

tag:blogger.com,1999:blog-28300250970908557462024-09-13T09:09:01.466+05:30Research Notesgtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.comBlogger57125tag:blogger.com,1999:blog-2830025097090855746.post-41013883361406734382014-10-01T14:50:00.000+05:302014-10-01T15:31:36.536+05:30Relation between precision and recall in binary classification

Let $tp, fp, fn,$ and $tn$ be the number of true positives, false positives, false negatives and true negatives obtained on some set by a binary classifier. Let $P$ and $R$ be the precision and recall given by $P=\frac{tp}{tp+fp}, R = \frac{tp}{tp+fn}$. Let $N=tp+fp+fn+tn$ be the total size of the set. The precision and recall for the negative class are $P'=\frac{tn}{tn+fn}, R'=\frac{tn}{tn+fp}$

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-52296920632279230612013-07-16T17:14:00.000+05:302013-07-16T17:14:42.550+05:30

Language Models for Keyword Search over Data Graphs. Yosi Mass, Yehoshua Sagiv. WSDM 2012 Problem given a keyword query, find entities in a graph of entities the graph is probably derived from a database; and it is presumed that the user will find SQL difficult to use. examples of such databases include Wikipedia, IMDB, and Mondial.

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-85167905379202650522013-07-09T15:25:00.000+05:302013-07-10T14:09:21.723+05:30

Characterizing the Influence of Domain Expertise on Web Search Behavior. Ryen W. White, Susan T. Dumais, Jaime Teevan. WSDM 2009  Look up maximum-margin averaged perceptron (Collins, M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. EMNLP 2002)  Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search.

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-5500178193884685512013-06-30T22:25:00.001+05:302013-06-30T22:25:03.922+05:30

Cross-Lingual Latent Topic Extraction. Duo Zhang, Qiaozhu Mei, ChengXiang Zhai. ACL 2010 Key ideas Input: unaligned document sets in two languages, a bilingual dictionary Output:  a set of aligned topics (word distributions) in the two languages, that can characterize the shared topics  a topic coverage distribution for each language (coverage of each topic in that language) Method:

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-75443602705229701522013-04-26T02:16:00.002+05:302013-04-26T02:20:02.768+05:30

Accurate Methods for the Statistics of Surprise and Coincidence. Ted Dunning. Computational Linguistics 1993. Ideas "ordinary words are 'rare', any statistical work with texts must deal with the reality of rare events ... Unfortunately, the foundational assumption of most common statistical analyses used in computational linguistics is that the events being analyzed are relatively common."

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-69990421815696806892013-04-25T14:36:00.000+05:302013-04-25T22:11:04.436+05:30

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension. Daniel Andrade, Takuya Matsuzaki, Jun'ichi Tsuji. TALIP 2012 Ideas Use only statistically significant context---determined using Bayesian estimate of PMI "calculate a similarity score ... using the probability that the same pivots [(words from the seed lexicon)] will be extracted for both the query word and

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-17899877257464859682013-04-25T14:27:00.002+05:302013-04-25T16:49:44.769+05:30

Attended a literature review on Question Answering by Akihiro Katsura. Some interesting references. Green, Chomsky, et al. 1961. The BASEBALL system. rule-based Isozaki et al. 2009. machine learning-based Methodology of QA Question analysis Xue et al. SIGIR 2008. Retrieval  models for QA archives. Text retrieval Jones et al. IPM 2000. ---Okapi/BM25 Berger et al. SIGIR 2000.--

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-59027628184444940992013-04-24T22:06:00.000+05:302013-07-24T15:42:05.138+05:30

Identifying Word Translations from Comparable Documents Without a Seed Lexicon. Reinhard Rapp, Serge Sharoff, Bogdan Babych. LREC 2012 Idea Assume only document-aligned comparable corpora (and no seed lexicon---"typically comprising at least 10,000 words") Characterize each article by a set of keywords "Formulate translation identification as a variant of the word alignment problem in a noisy

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-38091809973309738292013-04-23T01:46:00.000+05:302013-04-25T16:49:44.771+05:30

Addressing polysemy in bilingual lexicon extraction from comparable corpora. Darja Fiser, Nikola Ljubesic, Ozren Kubelka. LREC 2012 Idea Get source word senses (using sense tagger), construct context vectors for each sense, and then find target translation. To compute sense-specific vectors: split occurrences of source word into groups, and build context vectors separately for each group. &

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-15616686292752082152013-03-14T15:55:00.000+05:302013-04-03T14:36:30.312+05:30

A Wikipedia-Based Multilingual Retrieval Model. Martin Potthast, Benno Stein, and Maik Anderka. ECIR 2008 Key idea Use aligned Wiki articles (concepts) in two languages to map words/documents in different languages into a common concept space. Comments "A reasonable trade-off between retrieval quality and runtime is achieved for a concept space dimensionality between 1000 and 10000."

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-79932837738013553452013-03-13T18:07:00.000+05:302013-04-03T14:33:24.354+05:30

Wikipedia-based Semantic Interpretation for Natural Language Processing. Evgeniy Gabrilovich, Shaul Markovitch. JAIR 2009 (Noting here the details not mentioned in the entry for the IJCAI paper...) Corpus preprocessing Discard articles that have fewer than 100 non stop words or fewer than 5 incoming and outgoing links; discard articles that describe specific dates, as well as Wikipedia

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-24971394994887116192013-03-07T12:41:00.000+05:302013-04-03T14:36:44.218+05:30

Recent Advances in Methods of Lexical Semantic Relatedness – a Survey. Ziqi Zhang, Anna Lisa Gentile, Fabio Ciravegna. NLE 2012 Corpora: Wikipedia, Wiktionary, Wordnet, various biomedical corpora Methods:  based on Path, Information Content, Gloss, Vector all methods use structure, mainly from Wordnet/Wikipedia Some methods that treat Wiki articles as concepts (and use no other

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-74152543981614073032013-03-04T19:48:00.003+05:302013-04-25T18:23:36.222+05:30

Information about Krippendorff's alpha: http://cswww.essex.ac.uk/Research/nle/arrau/alpha.html

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-4054596870866317492013-02-11T14:06:00.000+05:302013-04-03T14:36:56.386+05:30

Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures. Colette Joubarne, Diana Inkpen. Advances in AI 2011 Claims many languages without sufficient corpora to achieve valid measures of semantic similarity.  manually-assigned similarity scores from one language can be transferred to another language, 

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-69816139295908165502013-02-08T16:42:00.004+05:302013-04-03T14:37:08.950+05:30

A Graph-Theoretic Framework for Semantic Distance. Vivian Tsang, Suzanne Stevenson. CL 2010 Problem: similarity of texts (not single words) Claims "[we do] integration of distributional and ontological factors in measuring semantic distance between two sets of concepts (mapped from two texts) [within a network flow formalism]" Key ideas "Our goal is to measure the distance between two

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-88343413573421926752013-02-08T11:49:00.001+05:302013-04-03T14:37:37.224+05:30

Disambiguating Identity Web References using Web 2.0 Data and Semantics. Matthew Rowe, Fabio Ciravegna. Journal of Web Semantics 2010 Comments Use ideas such as "Average First-Passage Time" of a graph Interesting papers L. Lovasz, Random walks on graphs: A survey. Combinatorics 1993 M. Saerens, F. Fouss, L. Yen, P. Dupont, The principal components analysis of a graph, and its relationships

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-52715781439348786932013-02-07T14:15:00.000+05:302013-04-03T14:38:51.805+05:30

A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia. Ziqi Zhang, Anna Lisa Gentile, Lei Xia, José Iria, Sam Chapman. LREC 2010 Key ideas Model many kinds of features on a graph Convert edge weights into probabilities; use p(t)(i|j) to model relatedness (where t is the number of steps in the walk) Interesting papers Hughes, T., Ramage, D.

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-12281560731948282232013-02-06T14:40:00.000+05:302013-04-03T15:22:51.041+05:30

A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa. NAACL-HLT 2009 Claims a supervised combination of [our methods] yields the best published results on all datasets we pioneer cross-lingual similarity A discussion on the differences between learning similarity and

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-49733233269487613952013-02-05T19:08:00.002+05:302013-04-03T14:39:37.848+05:30

Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge. Samer Hassan and Rada Mihalcea. EMNLP 2009 Key Ideas Introduce the problem of cross-lingual semantic relatedness. Map words in different languages to their concept vectors (concepts are Wikipedia articles, similar to Gabrilovich and Markovitch, AAAI 2007). Map concepts using Wikipedia langlinks. The vectors are now comparable.

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-59212194688291945522013-02-04T16:02:00.001+05:302013-04-03T14:40:09.698+05:30

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Evgeniy Gabrilovich and Shaul Markovitch. IJCAI 2007 Comments Classifies work in the field into three main directions: text fragments as bags of words in vector space (distributional similarity) text fragments as bags of concepts (using Latent Semantic Analysis) using lexical resources (Wordnet etc.) (also use

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-59980609608880167282013-02-04T15:00:00.000+05:302013-04-03T14:40:38.062+05:30

Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Philip Resnik. JAIR 1999 Key Ideas Comments Semantic similarity as a special case of semantic relatedness (relation is IS-A) For example, car-gasoline are related, but car-bicycle are similar. "... measures of similarity ... are seldom accompanied by an

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-71835278922632978532013-02-01T12:03:00.004+05:302013-04-03T14:41:05.421+05:30

Learning Discriminative Projections for Text Similarity Measures. Wen-tau Yih, Kristina Toutanova, John C. Platt, Christopher Meek. CoNLL 2011 Claims: We propose a new projection learning framework, Similarity Learning via Siamese Neural Network (S2Net), to discriminatively learn the concept vector representations of input text objects. Comment: Input is pairs of words that are known to

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-29387311042999183772013-01-31T17:23:00.000+05:302013-04-03T14:41:32.935+05:30

Fast Large-Scale Approximate Graph Construction for NLP. Amit Goyal, Hal Daum´e III, Raul Guerra. EMNLP 2012 Claims: In FLAG, we first propose a novel distributed online-PMI algorithm We propose novel variants of PLEB to address the issue of reducing the pre-processing time for PLEB. Finally, we show the applicability of large-scale graphs built from FLAG on two applications: the

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-50069164136530270792013-01-30T17:43:00.000+05:302013-04-03T14:42:16.412+05:30

A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the Web. Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka. EMNLP 2009 Key ideas Past work modelled similarity between two words in terms of context overlap, where context consisted of other words known to be closely related to the word (derived either from a corpus or an

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0tag:blogger.com,1999:blog-2830025097090855746.post-83296791509539781172013-01-30T11:56:00.000+05:302013-04-03T14:42:22.993+05:30

Machine Learning that Matters. Kiri L. Wagstaff. ICML 2012 Key message:  An analysis of what ails ML research today, especially w.r.t. its impact to real life problems Comments on empirical analysis Needed: domain interpretation of reported results Which classes were well-classified; which were not What are the common error types Why particular data sets were chosen Metrics Instead of

gtholpadihttp://www.blogger.com/profile/00817283539149247363noreply@blogger.com0