I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.
Chap. 5 is about index compression. There are 3 motivations for this-
Gamma codes gives better compression than variable byte encoding but are more expensive to decode.Chap. 5 is about index compression. There are 3 motivations for this-
- reduce disk space required to store the index
- reduce transfer time between memory and disk
- make it possible to store parts of the index in main memory (with on-the-fly decompression when it is used)
- Heap's Law: to estimate the number of (unique) terms (M)
- M = k T^b, where T is the number of tokens, and k and b are constants.
- b>0. We get a straight line with positive slope in the log-log scale.
- Roughly speaking, this implies that the vocabulary keeps growing as the size of the corpus grows---it does not level off.
- Zipf's Law: to estimate the size of the postings lists (f, for frequency/no. of occurrences in the collection)
- f = k r^b, where r is the rank of the term (ordered by frequency), and k and b are constants (b<0).
- Roughly speaking, this implies that the r^th most frequent word occurs 1/r times the most frequent word.
- The simplest case is no compression. Store records of the form <term, pointer to postings list>, where the term field is restricted to a certain length. Look for terms using binary search.
- Dictionary as a string: Store the terms in the form <t1,t2,t3,...> as one long string. Then store records of the form <termPtr, postingsPtr>. termPtr is an offset from the beginning of the string where a particular term begins. The number of characters to be read is calculated by subtracting from the next termPtr. Look for terms using binary search.
- Blocks of strings: Fix a block size k. Store strings as <l1,t1,l2,t2,l3,t3,l4,t4,...> where li is the length of ti stored in a single byte, and (l1,t1,...lk,tk) constitutes a block. Then store records of the form <blkPtr, postingsPtr1, ..., postingsPtrk>. blkPtr is an offset from the beginning of the string where a particular block begins. Look for terms using binary search to identify the block, and using sequential search within the block.
- Front coding: Suppose the block coding string is <6,drawee, 6,drawer, 7,drawing, 5,drawl>, then instead use <6,draw*ee, 2,#er, 3,#ing, 1,#l>. That is, identify a prefix for a subsequence of the dictionary string, and use a special character (#) for that prefix.
- Minimal perfect hashing: Choose a minimal perfect hash function h() for the vocabulary. Store records of the form <n, postingsPtr> where n = h(t) is the integer that the term hashes to. Looking for a term is a straightforward hash calculation. This method is not suited when the set of terms keeps changing.
- If the dictionary cannot be stored in memory even after compression, store them on disk and add the first term of each disk page to a B tree. Looking for a term will require at most one disk seek.
- With no compression, the postings list looks like <docId1, docId2, ...>. If there are N documents in the collection, storing each docId requires logN bits. However, storing the gap between consecutive docIds requires fewer bits. For example, suppose there are 1024 documents, then storing <3, 7, 36, 49> requires log1024*4 = 40 bits. But if we knew that consecutive docIds are never more than 32 apart, then storing the offsets as <3, +4, +29, +13> requires 10 + 3*log32 = 25 bits.
- Variable Byte encoding: In the example above, the offsets are stored in 5-bits. In principle, an offset can be as large as N, and require logN bits. But most offsets are much smaller. Hence, a variable length encoding of offsets is used as follows. Given the postings list <824, 829, 215406>, the `gap list' is <-, +5, +214577>. In binary, this looks like <110 0111000, 101, 1101 0001100 0110001>. In the encoding, it is stored as <00000110 10111000, 10000101, 00001101 00001100 10110001>. The 0 and 1 indicate whether this is the last byte of the encoding or not. The underlined sections how where 7-bit pieces from the binary form of the number are stored in the encoding.
- The unit of encoding can be changed from 8 bytes to 4 or 16 or 32. Larger units lead to lesser bit manipulation but results in less effective compression.
- Gamma coding: Suppose the offset is 13 (i.e. 1101). Then store 1110101. This is got by removing the leading bit and taking the rest (101)---called the offset, and prepending it with the length of the offset (3) stored in unary (1110). While reading a gamma code, read upto the first 0---this is the length of the offset that follows. Read the offset and prepend a 1 to it.
- The optimality of the encoding depends on the data. It can be shown that gamma codes take at most twice as many bits as the optimal encoding for any data.
- Gamma codes are prefix free (no code is a prefix of another, so that no delimiters are required) and parameter free (no parameters to fit/estimate before encoding can begin; also no re-estimation required when the postings lists change).