Research Notes: December 2011

Friday, December 30, 2011

I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.

Chap. 5 is about index compression. There are 3 motivations for this-

reduce disk space required to store the index
reduce transfer time between memory and disk
make it possible to store parts of the index in main memory (with on-the-fly decompression when it is used)

Statistical properties of the index are interesting, as they enable estimation of quantities such as number of terms, number of postings, average number of postings, etc. This in turn helps in the choice of appropriate compression algorithms. Two laws are discussed, both power laws.

Heap's Law: to estimate the number of (unique) terms (M)

M = k T^b, where T is the number of tokens, and k and b are constants.
b>0. We get a straight line with positive slope in the log-log scale.
Roughly speaking, this implies that the vocabulary keeps growing as the size of the corpus grows---it does not level off.

Zipf's Law: to estimate the size of the postings lists (f, for frequency/no. of occurrences in the collection)

f = k r^b, where r is the rank of the term (ordered by frequency), and k and b are constants (b<0).
Roughly speaking, this implies that the r^th most frequent word occurs 1/r times the most frequent word.

Dictionary compression allows us to keep the dictionary in memory, leading to faster query processing (by avoiding one disk seek).

The simplest case is no compression. Store records of the form <term, pointer to postings list>, where the term field is restricted to a certain length. Look for terms using binary search.
Dictionary as a string: Store the terms in the form <t1,t2,t3,...> as one long string. Then store records of the form <termPtr, postingsPtr>. termPtr is an offset from the beginning of the string where a particular term begins. The number of characters to be read is calculated by subtracting from the next termPtr. Look for terms using binary search.

Blocks of strings: Fix a block size k. Store strings as <l1,t1,l2,t2,l3,t3,l4,t4,...> where li is the length of ti stored in a single byte, and (l1,t1,...lk,tk) constitutes a block. Then store records of the form <blkPtr, postingsPtr1, ..., postingsPtrk>. blkPtr is an offset from the beginning of the string where a particular block begins. Look for terms using binary search to identify the block, and using sequential search within the block.

Front coding: Suppose the block coding string is <6,drawee, 6,drawer, 7,drawing, 5,drawl>, then instead use <6,draw*ee, 2,#er, 3,#ing, 1,#l>. That is, identify a prefix for a subsequence of the dictionary string, and use a special character (#) for that prefix.
Minimal perfect hashing: Choose a minimal perfect hash function h() for the vocabulary. Store records of the form <n, postingsPtr> where n = h(t) is the integer that the term hashes to. Looking for a term is a straightforward hash calculation. This method is not suited when the set of terms keeps changing.
If the dictionary cannot be stored in memory even after compression, store them on disk and add the first term of each disk page to a B tree. Looking for a term will require at most one disk seek.

Postings compression saves disk space, allows faster reads from disk, and allows caching of postings of popular terms in memory.

With no compression, the postings list looks like <docId1, docId2, ...>. If there are N documents in the collection, storing each docId requires logN bits. However, storing the gap between consecutive docIds requires fewer bits. For example, suppose there are 1024 documents, then storing <3, 7, 36, 49> requires log1024*4 = 40 bits. But if we knew that consecutive docIds are never more than 32 apart, then storing the offsets as <3, +4, +29, +13> requires 10 + 3*log32 = 25 bits.
Variable Byte encoding: In the example above, the offsets are stored in 5-bits. In principle, an offset can be as large as N, and require logN bits. But most offsets are much smaller. Hence, a variable length encoding of offsets is used as follows. Given the postings list <824, 829, 215406>, the `gap list' is <-, +5, +214577>. In binary, this looks like <110 0111000, 101, 1101 0001100 0110001>. In the encoding, it is stored as <00000110 10111000, 10000101, 00001101 00001100 10110001>. The 0 and 1 indicate whether this is the last byte of the encoding or not. The underlined sections how where 7-bit pieces from the binary form of the number are stored in the encoding.

The unit of encoding can be changed from 8 bytes to 4 or 16 or 32. Larger units lead to lesser bit manipulation but results in less effective compression.

Gamma coding: Suppose the offset is 13 (i.e. 1101). Then store 1110101. This is got by removing the leading bit and taking the rest (101)---called the offset, and prepending it with the length of the offset (3) stored in unary (1110). While reading a gamma code, read upto the first 0---this is the length of the offset that follows. Read the offset and prepend a 1 to it.

The optimality of the encoding depends on the data. It can be shown that gamma codes take at most twice as many bits as the optimal encoding for any data.
Gamma codes are prefix free (no code is a prefix of another, so that no delimiters are required) and parameter free (no parameters to fit/estimate before encoding can begin; also no re-estimation required when the postings lists change).

Gamma codes gives better compression than variable byte encoding but are more expensive to decode.

Thursday, December 29, 2011

I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.

The Chap. 3 (Dictionaries and Tolerant Retrieval) is about data structures for storing terms, and spelling correction/suggestion.

A simple way to store terms is in a binary search tree.
If the terms keep changing (additions/deletions), use a B-tree instead.

Wild-card queries need special data structures.

For queries like `sch*', the binary/B tree suffices.
For queries like `*tze', a reversed tree can be used. Comment: Start with the original term set, reverse each term, and use the usual algorithm for tree construction.
For queries like `sch*tze', use both the trees to get terms with matching prefix/suffix, and then intersect the two sets.
For general queries like `sch*zen*ger', we need other data structures.

Permuterm indexes: Augment each term `abc' to `abc$'. Construct a B-tree index where all rotations (abc$, $abc, c$ab, bc$a, ...) point to `abc'. Rotate the query to `sch'$zen*ger*', look in the B-tree index for `ger$sch*', get all matches. In this set, match the complete wild card expression to get correct matches. Comment: It seems that this can be done directly with the tree + reversed tree combination. What is the advantage of the permuterm index?
k-gram indexes: For each term `disco', add its k-grams $di, dis, isc, sco, co$ (assuming k=3) to the index. Each k-gram points to words that contain it. Given a query `sch*zen*ger', use the query `$sc AND sch AND zen AND ger AND er$' to get matches. In this set, match the complete wild card expression to get correct matches. Comment: All matches for the Boolean query need not be correct; e.g. `ded*' gives `$de AND ded' which matches `deluded', which is not a match for the original wild-card expression.

Spelling correction requires a notion of closeness between strings, and a way to choose the best one among nearby strings.

Closeness between strings: This is measured in terms of edit distance or Levenshtein distance, which is computed using dynamic programming.
Finding the nearest strings

One simple way is to look through all terms in the dictionary, and compute the edit distance to each, and take the closest ones.
When the dictionary is big, assume the first character is correct and then look. Or, consider all rotations of the query, and look for a fixed size prefix of each rotation in a permuterm index, and collect all the terms found. Or, consider all k-grams of the query, look for them in a k-gram index, and intersect the postings lists (which contain terms, btw).
We can be tolerant and say "I don't need all k-grams to match" (which is what an intersection of posting lists does). So, we could say e.g. "I want at least 2 k-grams to match". For this, go through the posting lists and for each term, compare its set of k-grams with the k-grams of the query. More generally, use the Jaccard coefficient between the two sets of k-grams (of the term in the postings list, and the query) to decide whether to keep the term or not.
All the above methods treat each query term in isolation. For context-sensitive spelling correction, do the following. Find the nearest strings for each query term in isolation (as explained above). Try all combinations of the candidates for the different query terms. By `try', we mean query the index with each combination of terms, and rank them based on the number of document matched (highest first). Comment: The set of candidates for each query term can be pruned by `try'ing them too. Also, the pruning can be done stage-wise, i.e. if the query is `were eegles dare', first get candidates for `were' and prune to get {`were', `where', `wire'}. Then get candidates for each of {`were eegles', `where eegles', `wire eegles'} and prune to get {`were eagles', `where eagles', ...). Finally get candidates for each of {`were eagles dare', `where eagles dare', ...} and prune to get the final list.

Phonetic closeness between strings: This is usually measured using Soundex algorithms. The basic idea is as follows. Hash every term into a 4-character string and build an inverted index for that (postings are the terms). Hash the query too, and search in the index. An example of a hash function is:

Retain the first letter. Change AEIOUHWY->0, BFPV->1, CGJKQSXZ->2, DT->3, L->4, MN->5, R->6. Compress consecutive identical digits to one. Then remove 0s, and return the first 4 positions (padded with 0 if required). For example, Manning -> M5055052 -> M505052 -> M5552 -> M555. Comment: It seems that Soundex is not a very good idea in general, especially for non-European languages/scripts.

Probabilistic models for spelling correction by Toutanova and Moore (2002) are the state of the art, and include ways to model edit distance, phonetic similarity, closeness on the keyboard, and data from popular spelling mistakes.

Wednesday, December 28, 2011

I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.

I am in the Chap. 4 right now, on Index Construction.

The basic idea is as follows: Break your text into terms. Get a stream of <term, docID> pairs. Sort them on term first and then on docID. Add the terms to a dictionary. For each term, add the docIDs to a postings list (sorted by docID). The inverted index is ready.
For larger document collections, use Blocked Sort-Based Indexing (BSBI): Break the data into chunks that can fit in memory. Then do the above for each chunk. Finally, merge the chunk indexes. [Comment: If you are using termIDs instead of terms, you might have to do an initial pass over the data to construct the dictionary. Alternatively, augment an existing dictionary when handling each successive chunk.]
When the vocabulary is too large to fit in memory, use Single-Pass In-Memory Indexing (SPIMI): Similar to BSBI, but uses terms instead of termIDs, so that when the merge is done, the terms are comparable. Also, instead of creating a stream of <term, docID> pairs, maintain a dynamically allocated postings list for each term, and add new occurrences to that. Also, all chunk data structures are compressed before being written to disk.
When everything is too large, use Distributed Indexing (using MapReduce): In the Map Phase, get the <term, docID> pairs. The framework sorts this for you. In the Reduce phase, create the postings list. Note: The output of Map is segmented as e.g. [a-g], [h-l],[m-r],[t-z]. One segment (say [a-g]) of all Mapper outputs is sent to one Reducer. This way, that Reducer can build the complete postings lists for all terms in its segment.

When the data keeps changing, use Dynamic Indexing.

A simple solution is: reconstruct the index periodically.
If changes need to be seen immediately, use an auxiliary index. Queries hit both indexes, and the results are "combined". When the aux becomes too large, merge it with the main index. Comment: I did not understand the merging process described in the book. Given a single index file containing all postings, why do we need T/n merges?

Other flavors of indexes

With positional information: <termID, docID, (pos1, pos2,...)>.
With compression: store only gaps between docIDs (assumes sorted posting list).
With Ranking information: docIDs in order of relevance to term. Comment: Complicates posting list updates---cannot simply append new documents.
With access control: Represent each doc by the users who may access it. Invert this corpus to get a postings list per user. Intersect this with the result list for queries from that user.