Sunday, December 21, 2008

FIRE 2008 at Kolkata

FIRE (Forum For Information Retrieval Evaluation) Workshop 2008 at Kolkata, Dec 13-15

Donna Harman - failure analysis
Xerox - Nicola Cancedda - CCA for EU langs
Kalervo Jaervelin - Finland - morph analysis
Noriko Kando, Tatsuya Sakai - NTCIR, Galaxy of Words - interaction in IR
Doug Ouard - Univ. of Maryland
Carol Peters - CLEF
Mark Sanderson - Diversity in results (evaluation using clustering)
CLIA (Cross-Lingual Information Access) - an-India project (IITB/K, IIITH,ISI,JU,AU) - building IR for Indian lang + building resources for that (dic,corp,Ont,Rules etc.)

JHU - n-grams,skipgrams, lang-indep
Univ. of Neuchatel (Jacques Savoy) (recommended reading by Donna Harman) - various techniques - Okapi,BM25, DFR (prob), LM (stat), tf.idf, data fusion

Measures - MAP is the most preferred, Prec. vs. Recall also used, Performance never discussed
People - Manoj Chinnakotla, Vishal Vachnani, Ashish Almeida, Pavitro Mitra (IITKgp)

Upcoming confs
NE workshop at ACL-IJCNLP - NE task - Dates - Task details - Jan 31 - Paper submsn - May 01 -
Workshop in CLIA - talks and papers - - Dates: Mar 06 - paper subm
Discourse Anaphora and Anaphora Resolution Colloquium (DAARC2009) - Nov 5-6 2009, Goa - - Dates: April 25 - pap subm,

Monday, March 24, 2008

IIIT Hyd's work

Indian language font/script related
- identifying language of given text
- identifying encoding of given text
- developing a metaformat for uniformity
- converting from one encoding to another
Techniques used - TF*IDF, Glyph assimilation, IT3 (phonetic transliteration scheme)

Language and encoding identification (Anil Kumar Singh)
- choices -> (1) What kind of language modeling should be used for representing training texts? (2) Which similarity measure should be used for comparing the models obtained from the texts?
- Models -> (1) n-grams model
- Similarity measures -> (1) out of rank - sum[for all n-grams in test data: diff(rank of n-gram in test data,rank in training data)](2) mutual cross entropy (3) translator approaches (4) compare profiles of n-gram frequencies (5)
- orthographical features (based on letter sequences and frequencies)
- add-k smoothing
- pruning
- Monte Carlo sampling
- cross entropy
- Prediction by partial matching (Teahan and Harper 2001)
- Cavnar -> top 300 n-grams indicate language of text, bottom n-grams indicate topic of text

POS tagging for Gujarati using CRF's
- 26 POS tags, 600 tagged sentences, 5000 untagged sentences, 10000 training corpus (??)
- use CRF methods for tagging (why?)
- errors attributed to not enough training data

Similarity measures for sentence alignment
- weighted sentence length (charc,wordc,sig) -> Poisson (how?)
- word correspondence (based on distribution of words in the language??)
- NPC matching
- common word count
- syno/hyper nym intersection (using WordNet) -> for more abstract similarity measurement
- F-measure??

Tuesday, February 19, 2008

Some possible kannada problems on the internet

1. Linguistic search for keywords
1.1 Subproblem - identifying proper nouns
1.2 Subproblem - transliteration ambiguity
2. Font rendering technology - improvement and standardization
2.1 Encoding standardization

What already exists -
- Searching kannada documents for english terms (by transliteration) -> Google (
- Searching kannada documents for kannada terms (in Unicode) -> Google, Wikipedia (

Friday, February 15, 2008

Machine translation applications - first findings

Home users
  1. Automatically translate web pages
  2. Translate chat
Organizational users (companies, government)
  1. Classifying documents as “needing human translation” or “not”; estimating effort needed for translation
  2. Localization support [e.g. for instruction manuals]
  3. Translation of email, documents, reports etc.
Professional users (translators)
  1. Support tools for translators who do post-editing
1. Spoken language translation (where is it used?)
  1. From European languages to Chinese/Japanese/Arabic and vice versa.
  2. From one European language to another
  3. Other languages include - Korean
In general, even the best MT systems in use today are mainly useful to get a general idea/gist of the text. Grammar and preservation of meaning can not be guaranteed. The main use cases for such limited functionality could be -
  • Automatic translation of websites, but only where the objective is doing something on the website [e.g. booking tickets/hotel rooms, shopping for goods which shoppers already know about], or getting some information. It is not suited for reading articles or literary works. The sentences should be small (and hence easier to translate). [e.g. titles of menus, small descriptions of the services offered by the site etc., news snippets]
  • Tools that assist human translators [e.g. localization support tools].
  • Chatting
  • Online service for naive users – for applications similar to the above, except that the text is in some other system where there is no translation feature provided [e.g. where the chat client does not provide translation].
Applications of MT system components
  1. Spell-check, grammar-check
  2. Dictionary/thesaurus – mono and bi-lingual
  3. Multi-lingual search (thematic search, query translation)
Applications where MT is a component
  1. Speech translation
  2. OCR