Skip to content

Search engine for semi-structured data (text and structured data) that provides all kinds of intelligent search features (keyword search, autocompletion, faceted search, error-tolerant search, synonym search, semantic search) very efficiently also on very large data.

Notifications You must be signed in to change notification settings

hannahbast/completesearch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CompleteSearch

Build Status

CompleteSearch is a fast and interactive search engine for context-sensitive prefix search on a given collection of documents. It does not only provide search results, like a regular search engine, but also completions for the last (maybe only partially typed) query word that lead to a hit. This can be used to provide very efficient support for a variety of features: query autocompletion, faceted search, synonym search, error-tolerant search, semantic search. A list of publications on the techniques behind CompleteSearch and the many applications is provided at the end of this page.

For a demo on various datasets, just checkout this repository and follow the instructions below. With a single command line, you get a working demo on one from a selection of datasets (each of the size of a few million documents, so not paticularly large, but also not small). CompleteSearch easily scales to collections with tens or even hundreds of millions of documents, without losing its interactivity.

1. Checkout

Checkout the repository

git clone https://github.com/ad-freiburg/completesearch
cd completesearch

2. Demos

Just run the following command line, where for the value of DB you can choose between a number of demo datasets (one for every subdirectory of applications). A generic UI will then be available under the specified PORT. Note that the CompleteSearch backend simultaneously provides an API for answering search and completion queries, and servers as a simple HTTP server at the same time.

    export DB=movies && PORT=1622 && docker build -t completesearch . && docker run -it --rm -e DB=${DB} -p ${PORT}:8080 -v $(pwd)/applications:/applications -v $(pwd)/data/:/data -v $(pwd)/ui:/ui --name completesearch.${DB} completesearch -c "make DATA_DIR=/data/${DB} DB=${DB} csv pall start"

This command line downloads and uncompresses the CSV, builds the index, and starts the server, all in one go. If you have already downloaded the CSV, it will not be downloaded again (the Makefile target csv: then has no effect). If you have already built the index once, you can omit the Makefile target pall: (which stand for precompute all).

3. Relevant files

Read this section if you want to understand a little deeper of what's going on with the fancy command line above. The command line first builds a docker image from the code in this repository. So far so good. It then runs a docker container, which mounts three volumes, which we briefly explain next:

applications This folder contains the configuration for each application. Each configuration just contains two files. A Makefile that specifies how to build the index (this is highly customizable, see below). And a config.js for customizing the generic UI.

data This folder contains the CSV file with the original data (one record per line, in columns) and the index files. They all have a common prefix. See below for more information on the index.

ui This folder contains the code for the generic UI. If you just want to use CompleteSearch as backend and build your own UI, you don't have to mount this volume. It's nice, however, to always have a working UI available for testing, without any extra work.

4. The CompleteSearch index

Like all search engines, CompleteSearch builds an index with the help of which it can then answer queries efficiently. It is not an ordinary inverted index, but something more fancy: a half-inverted index or hybird (HYB) index. You don't have to understand this if you just want to use CompleteSearch. But if you are interested, you can learn more about it in the publications below.

To build the index, CompleteSearch requires two input files, one with suffix .words and one with suffix .docs. The first contains the contents of your documents split into words. The second contains the data that you want to display as search engine hits. The two are usually related, but not exactly the same. The format is very simple and is described by example here.

If you have special wishes, you can build these two input files yourself, from whatever your data is. Then you have full control over what CompleteSearch will and can do for you. However, in most applications, you can use our generic CSV parser. It takes a CSV file (one record per line, with a fixed number of columns per line) as input, and from that produce the .words and the .docs file.

The CSV parse is very powerful and highly customizable. You can see how it is used in the Makefile of the various example applications (in the subdirectories of the directory applications). A subset of the options is described in more detail here. For a complete list, look at the code that parses the options.

4. The CompleteSearch engine

The binary to start the CompleteSearch engine is called startCompletionServer. It is very powerful and has a lot of options. For some example uses, you can have a look at the Makefile in the director applications and at the included Makefile of one of the example applications. A detailed documentation of all the options can be found in a README.md in the directory src.

Once started, you can either ask queries using our generic and customizable UI (see above). Or you can ask the backend directly, via the HTTP API provided by startCompletionServer. The API is very simple and described at the end of this page. Play around with it for one the example applications to get a feeling for what it does. You can also look at the (rather simple) JavaScript code of the generic UI to get a feeling for how it works and what it can be used for.

5. (Optional) Setup a subdomain

To show off your CompleteSearch instance to your friends, you may want it to run under a fancy URL, and not http://my.weird.hostname.somewhere:76154. Let us assume you have an Apache webserver running on your machine. Then you can add the following section in your apache.conf or in a separte config file included by apache.conf. You have to replace servername by the fully qualified domain name (FQDN) of the machine on which your Apache webserver is running. You have to replace hostname by the FQDN of the machine on which the CompleteSearch frontend is running. This can be the same machine as servername, but does not have to be.

<VirtualHost *:80>
  ServerName example.cs.uni-freiburg.de
  ServerAlias dblp example.cs.uni-freiburg.de
  ServerAdmin webmaster@localhost

  ProxyPreserveHost On
  ProxyRequests Off

  ProxyPass / http://<hostname>:5000/
  ProxyPassReverse / http://<hostname>:5000>/

  ...
</VirtualHost>

6. Publications

Here are some of the publications explaining the techniques behind CompleteSearch and what it can be used for. This work was done at the Max-Planck-Institute for Informatics. It's already a while ago, but turns out that the features and the efficiency provided by CompleteSearch are still very much state of the art.

Type Less, Find More: Fast Autocompletion with a Succinct Index @ SIGIR 2006

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration @ CIDR 2007

ESTER: efficient search on text, entities, and relations @ SIGIR 2007

Efficient interactive query expansion with complete search @ CIKM 2007

Output-Sensitive Autocompletion Search @ Information Retrieval 2008

Semantic Full-Text Search with ESTER: Scalable, Easy, Fast @ ICDM 2008

About

Search engine for semi-structured data (text and structured data) that provides all kinds of intelligent search features (keyword search, autocompletion, faceted search, error-tolerant search, synonym search, semantic search) very efficiently also on very large data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 78.9%
  • PHP 8.5%
  • Python 6.5%
  • JavaScript 3.9%
  • Makefile 0.8%
  • C 0.8%
  • Other 0.6%