Some guidelines on how to create, test and add to the repository a new implementation of the fts_fuzzy_match algorithm in another language.
Table of Contents
Besides the original algorithm implementations by Forrest Smith, you'll find in this repository third implementations in various other languages. Depending on the target language you're porting the algorithm to, and the programming languages you're fluent in, you'll find some of these implementations more useful references to start with.
Before starting to work on your code, you're strongly advised to read the Reverse Engineering Sublime Text’s Fuzzy Match article by Forrest Smith, which inspired the fts_fuzzy_match algorithm in the first place.
Having read that, you are strongly advised to read the optimization notes provided by Philip Jones along with his C port of v0.2.0 (more on his C port further down).
Forrest Smith has created two versions of the algorithm:
- v0.1.0 — the original algorithm mentioned in the article.
- v0.2.0 — an improved version, based on feedback on the article from Sublime Text author, Jon Skinner.
You should consider implementing v0.1.0 of the algorithm first, and then start working on v0.2.0, since the latter is an extended version of the former, and it's easier to understand the updated version once you grasped its first incarnation.
If you do so, please add to this repository both version v0.1.0 and v0.2.0 of your ported algorithm, for they will be helpful references to other porters.
The original algorithm implementation by Forrest Smith can be found in the following folders:
/0.1.0/cpp/
— C++ algorithm v0.1.0/0.1.0/js/
— JavaScript algorithm v0.1.0/0.2.0/cpp/
— C++ algorithm v0.2.0
If you need the JavaScript implementation of v0.2.0, there's the implementation by @nrgwsth:
/0.2.0/js/
— JavaScript algorithm v0.2.0
Beside the original algorithm by Forrest Smith, mentioned above, you might also want to study the source code of the following ports:
/0.2.0/c/
— C algorithm v0.2.0 (slender)
The C port of algorithm v0.2.0 by Philip Jones contains some nice optimizations that render the code shorter and less entangled than the original, therefore easier to understand and port to other languages — especially non OOP languages.
Furthermore, the author kindly documented his code enhancements, explaining in detail how his changes improve upon the original and why his approach is simpler to understand and implement.
While porting the algorithm, try to stay as close to the original implementation as possible, and resist temptations to add your own improvements. The goal of this project is to present ports of the original algorithms, as they are, to various other languages. That said, any adaptations of the original code in order to adhere to the target language idiomatic approach is desirable, including optimization which come natural due to syntax features specific to the target language.
There's obviously room for a lot of improvements in the original fts_fuzzy_match algorithm, but its beauty and success lie in its simplicity. It's an entry level algorithm and introduction to fuzzy searching and matching, and it's meant as a starting point to get your feet wet with fuzzy matching, without the complexity of more advanced algorithms.
If you've created improved versions of the base algorithm, I'd love to add them to the project, just not in the fts_fuzzy_match/
folder, but in a new dedicated section of the project instead, created ad hoc for optimized and improved versions, and some documentation explaining their benefits and implementation details.
In order to test that your ported implementation of the algorithm works as expected, I've come up with the following testing strategy.
Every algorithm implementation should provide a test file that:
- Invokes the
fuzzy_match()
scored function against every entry in theue4_filenames.txt
dataset, passing "LLL" as the search pattern. - For every matching call, write to the
test_results.txt
file a string containing:- <match score> +
|
+ <matched entry> +\n
- <match score> +
- The test ends when 100 matches have been written to
test_results.txt
, or the dataset entries are exhausted (if the latter occurs, then there's a problem in your algorithm).
The resulting test_results.txt
file can then be diffed with the equivalent reference file generated by the original algorithm of the same version, if they are identical the test has passed:
Both expected_results.txt
files are copies of the test_results.txt
files generated by the original C++ algorithms by Forrest.
For examples of test code implementations and results validation, see:
- JavaScript:
0.1.0/js/node-test.js
— test code that generatestest_results.txt
.0.1.0/js/test.sh
— runs the test code then diffstest_results.txt
againstexpected_results.txt
.
- C++:
0.2.0/cpp/test.cpp
— test code that generatestest_results.txt
.
Note that the test code and execution script for any given language are identical for both v0.1.0 and v0.2.0 of the algorithm, only the internal implementation of the algorithm changes. If you're implementing an algorithm in a language for which there's already an implementation in either version (v0.1.0 or v0.2.0), you can just copy and reuse its test code and script, if available.