I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.

Chap. 5 is about index compression. There are 3 motivations for this-

Gamma codes gives better compression than variable byte encoding but are more expensive to decode.Chap. 5 is about index compression. There are 3 motivations for this-

- reduce disk space required to store the index
- reduce transfer time between memory and disk
- make it possible to store parts of the index in main memory (with on-the-fly decompression when it is used)

*power law*s.*Heap's Law*: to estimate the number of (unique) terms (M)- M = k T^b, where T is the number of tokens, and k and b are constants.
- b>0. We get a straight line with positive slope in the log-log scale.
- Roughly speaking, this implies that the vocabulary keeps growing as the size of the corpus grows---it does not level off.
*Zipf's Law*: to estimate the size of the postings lists (f, for frequency/no. of occurrences in the collection)- f = k r^b, where r is the rank of the term (ordered by frequency), and k and b are constants (b<0).
- Roughly speaking, this implies that the r^th most frequent word occurs 1/r times the most frequent word.

- The simplest case is
*no compression*. Store records of the form <term, pointer to postings list>, where the term field is restricted to a certain length. Look for terms using binary search. *Dictionary as a string*: Store the terms in the form <t1,t2,t3,...> as one long string. Then store records of the form <termPtr, postingsPtr>. termPtr is an offset from the beginning of the string where a particular term begins. The number of characters to be read is calculated by subtracting from the next termPtr. Look for terms using binary search.*Blocks of strings*: Fix a block size k. Store strings as <l1,t1,l2,t2,l3,t3,l4,t4,...> where li is the length of ti stored in a single byte, and (l1,t1,...lk,tk) constitutes a block. Then store records of the form <blkPtr, postingsPtr1, ..., postingsPtrk>. blkPtr is an offset from the beginning of the string where a particular block begins. Look for terms using binary search to identify the block, and using sequential search within the block.*Front coding*: Suppose the block coding string is <6,drawee, 6,drawer, 7,drawing, 5,drawl>, then instead use <6,draw*ee, 2,#er, 3,#ing, 1,#l>. That is, identify a prefix for a subsequence of the dictionary string, and use a special character (#) for that prefix.*Minimal perfect hashing*: Choose a minimal perfect hash function*h()*for the vocabulary. Store records of the form <n, postingsPtr> where n =*h(*t*)*is the integer that the term hashes to. Looking for a term is a straightforward hash calculation. This method is not suited when the set of terms keeps changing.- If the dictionary cannot be stored in memory even after compression, store them on disk and add the first term of each disk page to a B tree. Looking for a term will require at most one disk seek.

- With no compression, the postings list looks like <docId1, docId2, ...>. If there are N documents in the collection, storing each docId requires logN bits. However, storing the gap between consecutive docIds requires fewer bits. For example, suppose there are 1024 documents, then storing <3, 7, 36, 49> requires log1024*4 = 40 bits. But if we knew that consecutive docIds are never more than 32 apart, then storing the offsets as <3, +4, +29, +13> requires 10 + 3*log32 = 25 bits.
*Variable Byte encoding*: In the example above, the offsets are stored in 5-bits. In principle, an offset can be as large as N, and require logN bits. But most offsets are much smaller. Hence, a variable length encoding of offsets is used as follows. Given the postings list <824, 829, 215406>, the `gap list' is <-, +5, +214577>. In binary, this looks like <__110____0111000__,__101__,__1101____0001100____0110001__>. In the encoding, it is stored as <**0**0000__110__**1**__0111000__,**1**0000__101__,**0**000__1101__**0**__0001100__**1**__0110001__>. The**0**and**1**indicate whether this is the last byte of the encoding or not. The underlined sections how where 7-bit pieces from the binary form of the number are stored in the encoding.- The unit of encoding can be changed from 8 bytes to 4 or 16 or 32. Larger units lead to lesser bit manipulation but results in less effective compression.
*Gamma coding*: Suppose the offset is 13 (i.e. 1__101__). Then store*1110*__101__. This is got by removing the leading bit and taking the rest (__101__)---called the offset, and prepending it with the length of the offset (3) stored in unary (*1110*). While reading a gamma code, read upto the first 0---this is the length of the offset that follows. Read the offset and prepend a 1 to it.- The optimality of the encoding depends on the data. It can be shown that gamma codes take at most twice as many bits as the optimal encoding for any data.
- Gamma codes are
*prefix free*(no code is a prefix of another, so that no delimiters are required) and*parameter free*(no parameters to fit/estimate before encoding can begin; also no re-estimation required when the postings lists change).

## No comments:

## Post a Comment