-
Notifications
You must be signed in to change notification settings - Fork 15
Internals
hymloth edited this page Apr 29, 2012
·
1 revision
We have a distinction between external document ids with internal ones. An external id can be of any length (for example a mongoDB Object id or a UUID), so to save space, we encode them internally to be small.
We keep sorted sets of the form ( term: [(doc_id, term_frequency_in_this_doc_id),...] ). This way, we can intersect those sets while calculating the tf-idf score on the fly, by providing WEIGHTS (term document frequencies, which are actually the cardinality of each sorted set)
To do proximity ranking, we keep hashes of the form: term:{ doc_id: positions, doc_id: positions }.
We also keep similar hashes for the terms' positions in the title, as well as simple sets (term:(doc_ids..)) to perform intersections.