I recently started reading the book Introduction to Information Retrieval by Manning, Raghavan and Schuetze. It's a wonderful book. I plan to jot down some notes here from time to time.
I am in the Chap. 4 right now, on Index Construction.
- The basic idea is as follows: Break your text into terms. Get a stream of <term, docID> pairs. Sort them on term first and then on docID. Add the terms to a dictionary. For each term, add the docIDs to a postings list (sorted by docID). The inverted index is ready.
- For larger document collections, use Blocked Sort-Based Indexing (BSBI): Break the data into chunks that can fit in memory. Then do the above for each chunk. Finally, merge the chunk indexes. [Comment: If you are using termIDs instead of terms, you might have to do an initial pass over the data to construct the dictionary. Alternatively, augment an existing dictionary when handling each successive chunk.]
- When the vocabulary is too large to fit in memory, use Single-Pass In-Memory Indexing (SPIMI): Similar to BSBI, but uses terms instead of termIDs, so that when the merge is done, the terms are comparable. Also, instead of creating a stream of <term, docID> pairs, maintain a dynamically allocated postings list for each term, and add new occurrences to that. Also, all chunk data structures are compressed before being written to disk.
- When everything is too large, use Distributed Indexing (using MapReduce): In the Map Phase, get the <term, docID> pairs. The framework sorts this for you. In the Reduce phase, create the postings list. Note: The output of Map is segmented as e.g. [a-g], [h-l],[m-r],[t-z]. One segment (say [a-g]) of all Mapper outputs is sent to one Reducer. This way, that Reducer can build the complete postings lists for all terms in its segment.
- A simple solution is: reconstruct the index periodically.
- If changes need to be seen immediately, use an auxiliary index. Queries hit both indexes, and the results are "combined". When the aux becomes too large, merge it with the main index. Comment: I did not understand the merging process described in the book. Given a single index file containing all postings, why do we need T/n merges?
- With positional information: <termID, docID, (pos1, pos2,...)>.
- With compression: store only gaps between docIDs (assumes sorted posting list).
- With Ranking information: docIDs in order of relevance to term. Comment: Complicates posting list updates---cannot simply append new documents.
- With access control: Represent each doc by the users who may access it. Invert this corpus to get a postings list per user. Intersect this with the result list for queries from that user.
No comments:
Post a Comment