Scala has multiple assets that make it a good language for this small project: immutability and surety, functional paradigm, lambdas, parallel programming features, object oriented programming, portability, etc.
To compile and run the project, the best way is to use sbt, the Scala build tool. The project is written in Scala 2.13.1 and was tested it with sbt 1.2.8 but any of the greater should do the work.
You can download sbt from Scala website
Installing sbt is the best way to install Scala as pointed out by the Scala download page
If you have not installed sbt before, the installation will probably take some time in order to deploy all the dependencies on your system.
- To check if the project compile from the command line, go to the root of project with the command line and enter:
$sbt compile
- To run the project direclty from sbt in the root folder, enter:
$sbt run
- To produce a jar from sbt in the root folder and run the tests, enter:
$sbt assembly
Then, navigate to the project folder and execute the jar with Scala:
$ cd target/scala-X-XX
$ scala search-assembly-X-X.jar path/of/your/folder/of/txt/files
If everything has gone well you should see the following prompt:
Working dir: /path/to/the/project/target/scala-2.13
Please enter a command -->>
| load: ## Load all resources from the resources folder, no parameters.
| index: name path ## New document from file.
| search: List, of, strings ## Exact match. None alphanumeric chars are separator.
| fuzzy: List, of, strings ## Fuzzy search. Based on Levenshtein distance.
| quit: ## Quit the app.
|
If you choose this method to run the program, then you will have to set your own folder. The default folder of the app is for test purpose and work only when executed from sbt at the root folder of the project (or from an IDE).
To set and load the default resources folder then execute from the target/scala.X.XX folder
$ cd target/scala.X.XX
$ scala search-assembly-X-X.jar ../../resources
$
$ Log...
The CLI only understand and execute 5 commands:
- load : Will load the files from the folder passed as first parameter to the application. This is normally done by default at startup time, but it is possible to re-index
- index : index a new document placed at a specific path.
- search: Search a list of words with an exact match.
- fuzzy : Proceed a fuzzy search. This may take a long time if one of the word in the given list is complex. For small words, it should be relatively usable.
- quit : Quit the program.
A help prompt is given when a command fails.
The global design is quite simple. The entry point is in the class com.search.Search
, then CLI
is executed and can be found at com.search.api.CLI
The CLI interact with the engine described at com.search.engine.Engine which interact with model.
The model can be found at com.search.model
and contains three classes:
- SearchTree: a generic binary search tree
- Token: generic wrapper that hold the data manipulated by the tree
- StringItem: a specialization of the
Token
class dedicated to String. It has been designed to use Java String.
A very simple sequence diagram of a search would look like this:
CLI --> Engine --> SearchTree --> Token
Normally, only the engine is able to interact with the SearchTree and Token objects. The CLI would be the view in a MVC, the engine would be the controller, and the classes in the model package, the model.
The code is documented and contains some notes for further improvements. The override annotated methods are generally not commented.
Tests are given in order to provide clues on how the API of each class is intended to be used.
Implementing a good search engine is not a trivial thing, and I will not pretend that this program is well-designed. It will surely not be suitable for production purposes and is largely perfectible.
Some clues are (for an in memory storage):
- Reimplement the search tree to a red / black tree,
- Change the approach of proceeding a search (scalability, security, etc.),
- Add reference object to a document for each indexed token,
- Refactor the engine to proceed search on a global index,
- Implement a command pattern for the API package,
- Implement a sharding algorithm for distributed search,
- And much more...