Research Notes

I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.

Chap. 5 is about index compression. There are 3 motivations for this-

reduce disk space required to store the index
reduce transfer time between memory and disk
make it possible to store parts of the index in main memory (with on-the-fly decompression when it is used)

Statistical properties of the index are interesting, as they enable estimation of quantities such as number of terms, number of postings, average number of postings, etc. This in turn helps in the choice of appropriate compression algorithms. Two laws are discussed, both power laws.

Heap's Law: to estimate the number of (unique) terms (M)

M = k T^b, where T is the number of tokens, and k and b are constants.
b>0. We get a straight line with positive slope in the log-log scale.
Roughly speaking, this implies that the vocabulary keeps growing as the size of the corpus grows---it does not level off.

Zipf's Law: to estimate the size of the postings lists (f, for frequency/no. of occurrences in the collection)

f = k r^b, where r is the rank of the term (ordered by frequency), and k and b are constants (b<0).
Roughly speaking, this implies that the r^th most frequent word occurs 1/r times the most frequent word.

Dictionary compression allows us to keep the dictionary in memory, leading to faster query processing (by avoiding one disk seek).

The simplest case is no compression. Store records of the form <term, pointer to postings list>, where the term field is restricted to a certain length. Look for terms using binary search.
Dictionary as a string: Store the terms in the form <t1,t2,t3,...> as one long string. Then store records of the form <termPtr, postingsPtr>. termPtr is an offset from the beginning of the string where a particular term begins. The number of characters to be read is calculated by subtracting from the next termPtr. Look for terms using binary search.

Blocks of strings: Fix a block size k. Store strings as <l1,t1,l2,t2,l3,t3,l4,t4,...> where li is the length of ti stored in a single byte, and (l1,t1,...lk,tk) constitutes a block. Then store records of the form <blkPtr, postingsPtr1, ..., postingsPtrk>. blkPtr is an offset from the beginning of the string where a particular block begins. Look for terms using binary search to identify the block, and using sequential search within the block.

Front coding: Suppose the block coding string is <6,drawee, 6,drawer, 7,drawing, 5,drawl>, then instead use <6,draw*ee, 2,#er, 3,#ing, 1,#l>. That is, identify a prefix for a subsequence of the dictionary string, and use a special character (#) for that prefix.
Minimal perfect hashing: Choose a minimal perfect hash function h() for the vocabulary. Store records of the form <n, postingsPtr> where n = h(t) is the integer that the term hashes to. Looking for a term is a straightforward hash calculation. This method is not suited when the set of terms keeps changing.
If the dictionary cannot be stored in memory even after compression, store them on disk and add the first term of each disk page to a B tree. Looking for a term will require at most one disk seek.

Postings compression saves disk space, allows faster reads from disk, and allows caching of postings of popular terms in memory.

With no compression, the postings list looks like <docId1, docId2, ...>. If there are N documents in the collection, storing each docId requires logN bits. However, storing the gap between consecutive docIds requires fewer bits. For example, suppose there are 1024 documents, then storing <3, 7, 36, 49> requires log1024*4 = 40 bits. But if we knew that consecutive docIds are never more than 32 apart, then storing the offsets as <3, +4, +29, +13> requires 10 + 3*log32 = 25 bits.
Variable Byte encoding: In the example above, the offsets are stored in 5-bits. In principle, an offset can be as large as N, and require logN bits. But most offsets are much smaller. Hence, a variable length encoding of offsets is used as follows. Given the postings list <824, 829, 215406>, the `gap list' is <-, +5, +214577>. In binary, this looks like <110 0111000, 101, 1101 0001100 0110001>. In the encoding, it is stored as <00000110 10111000, 10000101, 00001101 00001100 10110001>. The 0 and 1 indicate whether this is the last byte of the encoding or not. The underlined sections how where 7-bit pieces from the binary form of the number are stored in the encoding.

The unit of encoding can be changed from 8 bytes to 4 or 16 or 32. Larger units lead to lesser bit manipulation but results in less effective compression.

Gamma coding: Suppose the offset is 13 (i.e. 1101). Then store 1110101. This is got by removing the leading bit and taking the rest (101)---called the offset, and prepending it with the length of the offset (3) stored in unary (1110). While reading a gamma code, read upto the first 0---this is the length of the offset that follows. Read the offset and prepend a 1 to it.

The optimality of the encoding depends on the data. It can be shown that gamma codes take at most twice as many bits as the optimal encoding for any data.
Gamma codes are prefix free (no code is a prefix of another, so that no delimiters are required) and parameter free (no parameters to fit/estimate before encoding can begin; also no re-estimation required when the postings lists change).

Gamma codes gives better compression than variable byte encoding but are more expensive to decode.

Research Notes

Friday, December 30, 2011

No comments:

Post a Comment