Research Notes

I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.

The Chap. 3 (Dictionaries and Tolerant Retrieval) is about data structures for storing terms, and spelling correction/suggestion.

A simple way to store terms is in a binary search tree.
If the terms keep changing (additions/deletions), use a B-tree instead.

Wild-card queries need special data structures.

For queries like `sch*', the binary/B tree suffices.
For queries like `*tze', a reversed tree can be used. Comment: Start with the original term set, reverse each term, and use the usual algorithm for tree construction.
For queries like `sch*tze', use both the trees to get terms with matching prefix/suffix, and then intersect the two sets.
For general queries like `sch*zen*ger', we need other data structures.

Permuterm indexes: Augment each term `abc' to `abc$'. Construct a B-tree index where all rotations (abc$, $abc, c$ab, bc$a, ...) point to `abc'. Rotate the query to `sch'$zen*ger*', look in the B-tree index for `ger$sch*', get all matches. In this set, match the complete wild card expression to get correct matches. Comment: It seems that this can be done directly with the tree + reversed tree combination. What is the advantage of the permuterm index?
k-gram indexes: For each term `disco', add its k-grams $di, dis, isc, sco, co$ (assuming k=3) to the index. Each k-gram points to words that contain it. Given a query `sch*zen*ger', use the query `$sc AND sch AND zen AND ger AND er$' to get matches. In this set, match the complete wild card expression to get correct matches. Comment: All matches for the Boolean query need not be correct; e.g. `ded*' gives `$de AND ded' which matches `deluded', which is not a match for the original wild-card expression.

Spelling correction requires a notion of closeness between strings, and a way to choose the best one among nearby strings.

Closeness between strings: This is measured in terms of edit distance or Levenshtein distance, which is computed using dynamic programming.
Finding the nearest strings

One simple way is to look through all terms in the dictionary, and compute the edit distance to each, and take the closest ones.
When the dictionary is big, assume the first character is correct and then look. Or, consider all rotations of the query, and look for a fixed size prefix of each rotation in a permuterm index, and collect all the terms found. Or, consider all k-grams of the query, look for them in a k-gram index, and intersect the postings lists (which contain terms, btw).
We can be tolerant and say "I don't need all k-grams to match" (which is what an intersection of posting lists does). So, we could say e.g. "I want at least 2 k-grams to match". For this, go through the posting lists and for each term, compare its set of k-grams with the k-grams of the query. More generally, use the Jaccard coefficient between the two sets of k-grams (of the term in the postings list, and the query) to decide whether to keep the term or not.
All the above methods treat each query term in isolation. For context-sensitive spelling correction, do the following. Find the nearest strings for each query term in isolation (as explained above). Try all combinations of the candidates for the different query terms. By `try', we mean query the index with each combination of terms, and rank them based on the number of document matched (highest first). Comment: The set of candidates for each query term can be pruned by `try'ing them too. Also, the pruning can be done stage-wise, i.e. if the query is `were eegles dare', first get candidates for `were' and prune to get {`were', `where', `wire'}. Then get candidates for each of {`were eegles', `where eegles', `wire eegles'} and prune to get {`were eagles', `where eagles', ...). Finally get candidates for each of {`were eagles dare', `where eagles dare', ...} and prune to get the final list.

Phonetic closeness between strings: This is usually measured using Soundex algorithms. The basic idea is as follows. Hash every term into a 4-character string and build an inverted index for that (postings are the terms). Hash the query too, and search in the index. An example of a hash function is:

Retain the first letter. Change AEIOUHWY->0, BFPV->1, CGJKQSXZ->2, DT->3, L->4, MN->5, R->6. Compress consecutive identical digits to one. Then remove 0s, and return the first 4 positions (padded with 0 if required). For example, Manning -> M5055052 -> M505052 -> M5552 -> M555. Comment: It seems that Soundex is not a very good idea in general, especially for non-European languages/scripts.

Probabilistic models for spelling correction by Toutanova and Moore (2002) are the state of the art, and include ways to model edit distance, phonetic similarity, closeness on the keyboard, and data from popular spelling mistakes.

Research Notes

Thursday, December 29, 2011

No comments:

Post a Comment