Tuesday, November 6, 2012

Talk on crowdsourcing at CSA

Interesting talk today by Sriganesh Madhvanath from HP Labs on the research problems in crowdsourcing. Many aspects were introduced:
  • What should be automated, and what should be crowdsourced?
  • How can we guarantee SLO's (service level objectives) when offering services to customers based on crowdsourcing?
  • The three axes for trade-off: Error (Accuracy) vs. Time vs. Cost (trying to reduce one causes the others to increase).
  • How to reduce the task into small sizes, which can be crowdsourced?
  • How to design the user interface to improve the chances of task completion?
  • How to combine the results of the crowdsourcing (majority vote, etc.)? 
  • How to learn the quality of workers? How to recruit for a given task?
  • etc.

Wednesday, July 25, 2012

Hindi Stop Words List

I found two sources of Hindi stop word lists online: sarai.net, and UniNE. Srivaths R, an undergrad student (from NITK) I am working with, extended the first list to this.

Friday, May 25, 2012

Online Vocabulary Adaptation using Limited Adaptation Data

C. E. Liu, K. Thambiratnam, F. Seide. InterSpeech 2007.

Application: Given an audio clip, and some text metadata, generate terms that should be used to index the clip.

Problem: Given the text metadata, adapt the vocabulary using an external corpus (e.g. the internet), and then choose indexing terms from the adapted vocabulary. More precisely, look at an external corpus and guess which words have been mentioned in the audio but are not present in the vocabulary of the indexer (and also, of course, in the metadata).

  • Use text metadata to query internet search engine. Pick useful words from retrieved document set, and update the vocabulary. Also use the retrieved document set for adapting the language model.
  • Distinguish between term frequency TF_td = c(t,d)/\sum{t'} c(t',d), and tapered term frequency TTF_td = 1+log(TF_td). In the Stanford NLP course, they define TF_td = 1 + log(c(t,d)). Which one is used when?
  • Similar to the above, they define TFIDF and TTFIDF.
  • Problem reduces to this: for each word in each retrieved document, classify it as a candidate OOV term (i.e. predict that it has been mentioned in the audio) or not. To do this, build a classifier as usual, using audio transcripts as the ground truth training data. The feature vector for a word consisted of its TF, TFIDF, TTF, TTFIDF, POS, etc.

Friday, April 6, 2012

Notes from ECIR 2012

quantum computing
- lattice (set theory)
- spectral theorem

designing search
- design depends on the context, which comprises
    - user (expert, novice, disabled, etc.)
    - task (adhoc, targeted, transactional, etc.)
    - environment (at home, on the move, on a desktop, on an phone, etc.)
- prioritize design goals
    - who am i designing for?
    - who am i _not_ designing for?
- faceted browsing: facets should be
    - exhaustive
    - consistent (e.g. (N.India, S.India) is ok, but (N.India, Bangalore) or (N.India, > Rs.10000) is not ok)
    - Orthogonal / Non-overlapping
    - Of roughly the same size

Paolo Boldi
    - centrality measures; empirical study of how it correlates to node importance
    - has open software on his website for this (HyperANF at ...unimi.it)
    - geodesic, centrality
    - centrality measures (first 3 are geometric indexes)
        - based on degree
        - based on no. of paths
        - based on how close it is to others
        - spectral indexes   
    - Lin centrality, harmonic centrality
    - Kendall's tau
    - computing centrality using diffusion
    - Pregel implementation
    - whitelist
    - naturalistic study

- neyman pearson lemma
- probability ranking principle (stephen robertson)

Explaining query modifications: An alternative interpretation of term addition and removal
- how and why do users modify queries
- what do they assume about how the search engine works

Predicting the Future Impact of News Events
- use various attributes to predict the various attributes
- use std ML techniques (SVR, feature selection)

Detection of News Feeds Items Appropriate for Children
- classify an article as 'appropriate' or 'not'
- use BBC data (labeled 'for children', and 'for adults')
- std readability measures (use info gain for feature selection)
    - ARI, flesch-kincaid

Yoelle Marak
- usage data
    - query log
    - click data
    (- mouse track
     -eye track)
- searchwiki
- demographics of web search. wsdm 2011
- anatomy of the long tail. wsdm 2010

- B-cubed precision and recall: to evaluate soft-clustering

- kdd 2006. "... center ... graph ... extraction ..."

- trustRank (pagerank with bias towards inlinks from reliable pages)

- unique sets ratio

- Geometric MAP, instead if MAP

- statistical significance testing
    - parametric tests (eg. T-test)
    - non-parametric tests
- TREC. Banks. Variance due to topics.
- Chi-square test: to check if disribution is Normal
- Shapiro-Wilks test

- BoxCox transformation
- ACE algo (alternating conditional expectation)

- axiomatic approach. fang et al. sigir 2004

- 'bursty' distribution

- Lemur, Terrier systems

Leif Azzopardi et al. top-K retrieval.
    Chen and Karger. SIGIR 2006. Top-K retrieval.
    Prob Ranking Principle. Robertson.
    MMR. Carbonell and Goldstein. 1998
    modern portfolio theory. wang and zhu. ecir 2009
    quantum prob ranking principle. zuccon and azzopardi. ecir 2010
    comparison of ranking principles. zuccon, azzopardi, van rijsbergen. ictir 2011
    gollapudi and sharma. 2009. facilities placement and top-k
- language model with dirichlet smoothing
- alpha-nDCG@10
- diversification
    - kuland and kee, santos et al

latent variable model
- max margin latent variable model, for inducing relevance functions
- log linear model p(x) = e^h(x) / sum_x' e^h(x')
- subgradient method for parameter estimation
- data sets (image)    - SUN, MIR Flickr
- Global SVM, Transductive SVM

1000 search engines
- problem of search
    - many data sources: www, wiki, news, patents, tweet
    - many result types: docs, temp, curr, people
    - many relevances: topical, recency
- probabilistic relational algebra (PRA)

context aware recommendation
- tensor factorization
- ndcg, auc, loss functions
- stochastic GD, alternating least squares, bundle methods

- tagme.dl.unipi.it
- milne and witten 2008
- earthmover distance (on graphs?)
- "combinatorial"
- wsdm 2012 paper on result clustering
- Lingo, Lingo 3G
- Carpineto, Osinski, et al. ACM Comp Surv 2009
- Carpineto et al. SIGIR 2010.
    - Optimal seach results clustering
    - kSSL (method for evaluating search result clustering), AMBIENT (data set)
- relative mean difference
- TreeNet tool
- Boosted decision trees

* Listen to all talks, to guage the audience. Then tune own presentation accordingly. (Dont say things that everyone knows. Dont forget to say things that most dont know.)

Friday, January 13, 2012

SIGIR Poster on Multi-Document Summarization

I read the poster MSSF: A Multi-Document Summarization Framework based on Submodularity (SIGIR 2011). It talked solving the problem using a budgeted maximum coverage problem based on submodularity. Some points to read about
  • Let E be a finite set. The function f:Pow(E)->R is submodular if, for any S \subset T \subset E, and s \in E \ T we have
f(T U {s}) - f(T) <= f(S U {s}) - f(S).
Submodular functions are closed under non-negative linear combinations.
  • Budgeted Maximum Coverage Problem (NP-Hard): Given a set E of sentences, where each subset S has an influence i(S) and a cost c(S), and a budget b, find a subset S of E which has the larget possible influence while the total cost does not exceed b.
  • For example, b is the maximum number of sentences in the summary, c(S) = number of sentences in S, and i(S) is a submodular function measuring the influence of the summary S, then there exist algorithms to solve this problem approximately. Different functions i(S) were proposed in the poster, for different settings (simple summarization, query-focused ~, update ~, comparitive ~). For the simple case
i(S) = \sum{s_i \in E\S} \sum{s_j \in S} sim(s_i, s_j) - \sum{s_i,s_j\in S, s_i != s_j} sim(s_i, s_j) = f1(S) - f2(S)
  • f1(S) measures to what degree S represents E, and f2(S) measures redundancies within S. f1(S) and -f2(S) are both submodular, so i(S) is also submodular.
Things to look up
  • Budgeted Maximum Coverage Problem. Khuller, Moss, Naor. 1999, Minoux 1978.
  • Maximum Marginal Relevance, Update summarization, Comparitive summarization, Query-focused summarization

Thursday, January 5, 2012

I am attending a School on Network Science at ECE, IISc, organized as part of the IMI, and funded by ICTS. I jotted down some points for revision/future look-up.

Networks, Rumours, and Epidemics [Dr Ayalvadi Ganesh]
  • Grimmett and Frieze
  • Damon Mosk-Aoyoma and Devavrat Shah. Distributed Computation of Separable Functions. IEEE Trans. Info. Theory. 2008
  • Consensus. de Groot et al.
  • Voter model. Hassin and Peleg, James Cruise and Ganesh Ayalvadi
  • Wright-Fisher model, Morale model
  • Wisdom of Crowds. Golub and Jackson
  • Acemoglu, Dahleh, Lobel, Ozdaglar
  • Kermack and McKendrick. Epidemic model using ODEs
  • Ganesh Ayalvadi, Moussalie, Tousley. Effect effect of network topology on epidemic spread. IEEE InfoComm. 2005
  • Draief, Ganesh, Moussalie. thresholds for Virus Spread on Nertworks. Ann. Appl. Prob. 2008
  • Coupon Collector Problem
  • Stochastic Domination, Strassen's Theorem
  • Markov Chains---irreducible, aperiodoc, ergodic, balance equations, invariant distribution, etc.
  • Markov Process (continuous time)
  • Poisson Process---thinning
  • Martingales, stopping time, Optional Stopping Theorem
  • Binary Entropy function, Conditional entropy
  • Random walks---coalescing RW
  • Coupling from the past in MCMC
  • Erdos-Renyi graphs, scale-free graphs, expander/ing graphs, unimodular graphs
  • Linear recurrences, harmonic functions
  • Lattices---infinite lattice
  • Perron-Frobenius Theorem---sprectral radius, Perron eigenvalue
  • Eigenvalue of e^A, where A is a matrix
  • Generalized isoperimetric constant
Ising Models [Andrea Montanari]
  • Hammersley-Clifford Theorem
  • Inclusion-Exclusion Principle
  • Union bound
  • Measures, Concentration
  • Gibbs sampling, MCMC, Metropolis
  • Constrastive Divergence (Hinton)
  • Glauber Markov Chain
  • tanh, atanh---properties
  • "One-dimensional recursion"
  • Inequalities---Markov, Chebyshev, ...
  • Sterling number---as related to enumerating number of graphs with p vertices and Cp edges 
  • Belief propagation, Generalized BP
  • Contraction (on graphs)