Skip to content
hymloth edited this page Apr 29, 2012 · 1 revision

We have a distinction between external document ids with internal ones. An external id can be of any length (for example a mongoDB Object id or a UUID), so to save space, we encode them internally to be small.

We keep sorted sets of the form ( term: [(doc_id, term_frequency_in_this_doc_id),...] ). This way, we can intersect those sets while calculating the tf-idf score on the fly, by providing WEIGHTS (term document frequencies, which are actually the cardinality of each sorted set)

To do proximity ranking, we keep hashes of the form: term:{ doc_id: positions, doc_id: positions }.

We also keep similar hashes for the terms' positions in the title, as well as simple sets (term:(doc_ids..)) to perform intersections.

Clone this wiki locally