Research Notes: March 2008

Indian language font/script related
- identifying language of given text
- identifying encoding of given text
- developing a metaformat for uniformity
- converting from one encoding to another
Techniques used - TF*IDF, Glyph assimilation, IT3 (phonetic transliteration scheme)

Language and encoding identification (Anil Kumar Singh)
- choices -> (1) What kind of language modeling should be used for representing training texts? (2) Which similarity measure should be used for comparing the models obtained from the texts?
- Models -> (1) n-grams model
- Similarity measures -> (1) out of rank - sum[for all n-grams in test data: diff(rank of n-gram in test data,rank in training data)](2) mutual cross entropy (3) translator approaches (4) compare profiles of n-gram frequencies (5)
- orthographical features (based on letter sequences and frequencies)
- add-k smoothing
- pruning
- Monte Carlo sampling
- cross entropy
- Prediction by partial matching (Teahan and Harper 2001)
- Cavnar -> top 300 n-grams indicate language of text, bottom n-grams indicate topic of text

POS tagging for Gujarati using CRF's
- 26 POS tags, 600 tagged sentences, 5000 untagged sentences, 10000 training corpus (??)
- use CRF methods for tagging (why?)
- errors attributed to not enough training data

Similarity measures for sentence alignment
- weighted sentence length (charc,wordc,sig) -> Poisson (how?)
- word correspondence (based on distribution of words in the language??)
- NPC matching
- common word count
- syno/hyper nym intersection (using WordNet) -> for more abstract similarity measurement
- F-measure??

Research Notes

Monday, March 24, 2008

IIIT Hyd's work