Skip to content

Possible Future Directions

dburry edited this page Nov 4, 2012 · 20 revisions

Here are a few ideas I have for how this gem could be further improved:

Use an acts_as style api, instead of declaring methods

Benefits:

  • It is a shorter cleaner look in your model.

  • This is something many Rails developers are familiar with, it seems to be standard.

  • Easier to have certain defaults, and simply skip parameters to use them, instead of cluttering your model definition up with things like scope :search_index_scope {}

Drawbacks:

  • Currently if you do not need to update your index live as-it-changes on any model, you don’t even need to mixin any search stuff into your model classes in the actual running web app, just do it in the rake task that creates/updates them. But I’m not sure if this actually saves any resources anyway, the way Bundler/Rails loads the universe for you anyway… (unless we split index-reading into a separate gem from index-writing)

Support different indexing paradigms

Ideally a boolean index, and a proximity index. The former is the only one supported currently. The latter one would be somewhat larger for the same data set (see: github.com/dburry/indexed_search/wiki/Proximity-Searches).

This could be done by splitting the index reading/building process into separate gems based on the type of index, with a common api internally that the main gem unifies. This way you only have the code that you need, since in many cases it may not make sense to have both index styles at once in the same app.

We could also maybe support both kinds of indexes, each for different kinds of model data. For example, it might be desirable to have full phrase exact matching or similarity for the title or name of an object, yet cheaper boolean matching overall for everything else about an object.

Support different querying methods

Yes, this means different dialects of operators and things… Even though I’ve ranted how this shouldn’t be necessary (github.com/dburry/indexed_search/wiki/Operator-Affection) I do recognize that someday someone may need it anyway.

Again I imagine this would be provided by separate add-on gems when the time comes.

Support different databases, not just MySQL!

This one is fairly important to me, I just haven’t yet taken the time… While the gem pretty much tries to limit itself to common Arel stuff, there are a couple places where we just had to stoop to database-specific code to make it all work properly and keep it from becoming ridiculously slow.

Write a “similarity” matcher

See:

Note that we would need to index the letter pairs, so that full word table lookups wouldn’t be necessary to find good matches. Therefore we cannot use the text gem version, and must write our own implementation.

Write an “edit_distance” matcher

See:

To avoid full word table lookups to find the matches, we cannot actually use the gem, and must write our own implementation, such as generating a condensed version of this SQL:

SELECT *
FROM words
WHERE
-- added 1 character
word LIKE '_word' OR word LIKE 'w_ord' OR word LIKE 'wo_rd' OR word LIKE 'wor_d' OR word LIKE 'word_'
OR
-- deleted 1 character
word IN ('ord', 'wrd', 'wor')
OR
-- substituted 1 character
word LIKE '_ord' OR word LIKE 'w_rd' OR word LIKE 'wo_d' OR word LIKE 'wor_'
OR
-- transposed 2 adjacent characters
word IN('owrd', 'wrod', 'wodr')

It may also be possible that we’d have to index the results of that, to get appropriate performance (experimentation will tell).

Reference models by name/symbol, instead of an integer id

I originally made the design decision to match every model class to a unique integer number (see: IndexedSearch::Index.models_by_id), instead of using an ActiveRecord-STI-style “type” column to know what model each match is for. This was to save a lot of space (and a little time) and be more efficient, especially in large indexes.

However, this has led to there being a barrier to getting started with this gem, because there’s extra setup involved. And also increases error-prone-ness as a user adds models, because they might forget to update the mapping in the initializer file. And frankly, most people simply don’t have large enough indexes to need this optimization!

So.… if I were to reverse this design decision, by default, and do it like STI does, then a database-specific column type (like MySQL ENUM) could be used by the few users that need to get a bit of extra efficiency out of their database. Then only the pain of maintaining that mapping would be felt by those who need it (in the form of a migration changing that column type to add new ENUM values each time you add a new model to the index).

And, as a nice side effect, we’d also be one step closer to being able to function just fine with defaults with a blank config initializer file.

Support “whole field” matches

We need to make sure an exact match to a title or name of a record would rank first. Currently the ranking numbers on the word matches aren’t always enough to make exact title matches show up first, if things add up enough in other ways, especially if a longer title also matches all the same terms or some such.

There has also been need of matching whole fields in general, not just titles. So perhaps the same principle could be more general, not just applied to titles.

One possible way of implementing this is to “mark” name/title words as special in some way, then count your special word matches and compare with the total special words for that record… This would give you a cheap way of doing one full field exact match without adding a whole new index.

Of course another more generic way to solve this is to support proximity indexing on just the name/title field. This would open up a whole variety of options for full name/title similarity matching and ranking.

Support multiple matches within different fields or groups of fields

There has been need of making it so that I can specify that a certain query should match only within a certain field of the target records.

This would be useful, for example, when detecting similar records to a certain one, on a field by field basis. The current workaround to just loop through the top matches again and examine the fields again is not optimal.

This could also be more generic by specifying named buckets or groups in the indexing, instead of just by field.

Support indexing the value of some custom fields for scoping

The idea is for use in scoping before searching… so that you can limit the search results better when you only want to search certain portions of the index and not all of it. You can already do this by model, sorta, but not finer grained than that efficiently (you’d currently have to do a pre-lookup and scope by id list, which is not efficient for large data sets).

This would be useful for access control, for example you could index whatever criteria you use to control access, and then limit your searches to that based on your user roles.

Note this is not about searching and matching against what’s in those fields (there’s another entry for that), this is just about scoping to permit/deny what kind of records we’re going to match against.

Support a “certainty percentage” instead of just a “rank”

There has been need of detecting how “close” of a match it is, and limiting what results are displayed based on that metric.

In principle this is not like the typical google-search approach, where you display as many results as you can and just rank them really well. But there are certain applications where this may be useful, for example, in trying to detect slightly-different-but-likely-duplicate entries of your model.

There may be need of configuring how strict this is too, on a search by search basis. If you wanted to block a user from entering them without further review, you may want this very strict (i.e. only block if a very similar record exists)… but if you just wanted to offer them as suggestions, you might want it to be less strict (i.e. offer a somewhat wider selection of suggestions than would actually block, but not pages of them).