Skip to content

Architecture Overview

dburry edited this page Nov 5, 2012 · 11 revisions

Architecture Overview

For the generally curious, and those who wish to be helpful, here’s an overview of how this library works internally. Note that such internal workings may be subject to change. This does not list every detail of how things work, use the rdoc pages generated from code comments for that (or simply view the source), this document just focuses on how things fit together.

The Index Data

The index consists of two database-backed models, generated into your app with the provided migration:

  • Word index (IndexedSearch::Word) - every unique word in the index, after parsing, case folding, etc. Also includes cached information about each word that makes querying faster and easier (such as what its soundex value is, how frequently the word is found, etc).

  • Result entry index (IndexedSearch::Entry) - many-to-many relationship between words and app model rows. One row, per word, per app model, per app model row. If a row exists, the word is there, otherwise it’s not - it’s a boolean index. Also includes some cached information about each hit, such as a score for how important each word is in each app model row (being in the name of the object is better than only being in the description, for example, of course being in both is even better).

Search Lookup Process

Looking up and displaying results is a 4 step process internally:

  • Parsing a literal query string into words or search terms. For example:

query = IndexedSearch::Query.new('whole string of whole words')
# query => ['whole', 'string', 'of', 'words'] (similar to an array, but with extra methods)
  • Look in the IndexedSearch::Word model table to find what indexed words match the parsed query. This uses all the enabled matchers internally to do this (IndexedSearch::Match::Base subclasses, enabled by being listed in IndexedSearch::Match.perform_match_types). For example:

results = IndexedSearch::Match::ResultList.new(query)
results = query.results
# results => similar to array of IndexedSearch::Match::Result
# query.results is a shortcut, with results cached in the query object
  • Query the big IndexedSearch::Entry model table to find the actual app model matches

scope = IndexedSearch::Entry.matching_query(results).ranked_rows(results).paged(page_size, page_number)
scope = IndexedSearch::Entry.find_results(query, page_size, page_number)
# scope => an entry table scope that has one row per found result (#find_results is a shortcut)
  • Loop over the resulting models and display them in a pretty way, for example here’s a simplified version:

scope.each { |s| puts s.model.to_s }
# models are cached in the entry objects so you can call them multiple times without pain
# note scope can be lazily loaded at the last possible moment in your view

Search Indexing Process

The search index can be built one of two ways: manually via rake tasks, and automatically as your data changes via indexers (which are really a kind of observer). Both ways use the mixin methods from the IndexedSearch::Index class:

  • Foo.create_search_index - creates a new index where one does not exist yet (do NOT call if index data already exists for this model/scope/record, or you will get bogus duplicate data).

  • Foo.update_search_index - reindexes an existing index (also creates if it doesn’t exist, or does no changes if already up to date)

  • Foo.delete_search_index - deletes an index (does no change if already doesn’t exist).

All of these can be called in the following ways:

  • class level - to operate on an entire app model at one time

  • instance level - to operate on just one row

  • scope level - to operate on any arbitrary group of rows

Internally creating and updating calls search_index_info on your app model, runs each of them through IndexedSearch::Query.split_into_words, and adds up the scores for every instance of every word into one master score for the whole record, per unique word. Updating additionally reads the existing index and sees if there should be any modifications done to it, and efficiently only does the changes necessary. Deleting is just an efficient removal of all relevant index data. Transactions are used to keep multiple changes to the two index tables all in sync.