Skip to content

Practical Examples Using Cpp

masajiro edited this page Nov 25, 2019 · 2 revisions

Practical examples with a large-scale dataset for two types of primary NGT graphs (ANNG, ONNG) are described.

Dataset generation

First, to describe how to search large-scale datasets, an NGT dataset needs to be generated. Download the fastText dataset as follows.

curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip

The dataset above should be converted to a format that our sample scripts can read by using the following python script.

# dataset.py 
with open('wiki-news-300d-1M-subword.vec', 'r') as fi,\
     open('objects.tsv', 'w') as fov, open('words.tsv', 'w') as fow:
    n, dim = map(int, fi.readline().split())
    fov.write('{0}\t{1}\n'.format(n, dim))
    for line in fi:
        tokens = line.rstrip().split(' ')
        fow.write(tokens[0] + '\n')
        fov.write('{0}\n'.format('\t'.join(tokens[1:])))

ANNG Construction and Search

Below is an example of how to construct an ANNG with cosine similarity for metric space.

#include	"NGT/Index.h"
using namespace std;
int main(int argc, char **argv)
{
  ifstream	is("objects.tsv");
  string	line;
  getline(is, line);
  stringstream	linestream(line);
  NGT::Property	property;
  size_t n;
  linestream >> n;
  linestream >> property.dimension;
  property.objectType	= NGT::ObjectSpace::ObjectType::Float;
  property.distanceType	= NGT::Index::Property::DistanceType::DistanceTypeCosine;
  NGT::Index	index(property);
  while (getline(is, line)) {
    stringstream	linestream(line);
    vector<float>	object;
    while (!linestream.eof()) {
      float value;
      linestream >> value;
      object.push_back(value);
    }
    index.append(object);
  }
  index.createIndex(16);	// 16 is the number of threads to construct.
  index.saveIndex("fasttext.anng");
  return 0;
}

The ANNG can be searched with a query by using the following program.

// search-fasttext.cpp
#include	"NGT/Index.h"
#include	<sstream>
#include	<iomanip>
using namespace std;
int main(int argc, char **argv)
{
  ifstream		is("words.tsv");
  string		word;
  vector<string>	words;
  while (getline(is, word)) {
    words.push_back(word);
  }
  NGT::Index		index("fasttext.anng");
  int			queryID = 10001;
  vector<float>		query;
  index.getObjectSpace().getObject(queryID, query);
  NGT::SearchQuery	searchQuery(query);
  NGT::ObjectDistances	results;
  searchQuery.setResults(&results);
  searchQuery.setSize(10);
  searchQuery.setEpsilon(0.1);
  index.search(searchQuery);
  cout << "Query No." << queryID << endl;
  cout << "Rank\tID\tDistance" << setprecision(6) << fixed << endl;
  for (size_t i = 0; i < results.size(); i++) {
    cout << i + 1 << "\t" << results[i].id << "\t" << results[i].distance << "\t" << words[results[i].id - 1] << endl;
  }
  cout << endl;
  return 0;
}

Below are the search results.

Query No.10001
Rank	ID	Distance
1	10001	0.000000	Doctors
2	4632	0.244096	Doctor
3	79543	0.258944	Medics
4	2045	0.263412	doctors
5	339398	0.274972	Doctoring
6	20668	0.280508	Physicians
7	80647	0.292580	Dentists
8	24256	0.292752	Nurses
9	9481	0.322196	Scientists
10	623161	0.330500	Practioners

When a higher accuracy is needed, you can specify an epsilon value of SearchQuery higher than the default 0.1 as shown below.

  searchQuery.setEpsilon(0.15);

When a short query time is needed at the expense of accuracy, you can specify a smaller epsilon value.

ANNG generation to reconstruct ONNG

Since further improvement of the search performance by using an ONNG can be expected, how to generate ONNG is described. ONNG generation requires an ANNG with more edges as the excess edges are removed to optimize the graph. Note that constructing an ANNG with too many edges requires more time. The following example shows the creation of an ANNG with 100 edges. Since 100 edges are considered excessive for most datasets, as fewer edges would be sufficient to achieve a high accuracy, this number can be reduced on the basis of the resultant accuracy of the dataset.

#include	"NGT/Index.h"
using namespace std;
int main(int argc, char **argv)
{
  ifstream	is("objects.tsv");
  string	line;
  getline(is, line);
  stringstream	linestream(line);
  NGT::Property	property;
  size_t n;
  linestream >> n;
  linestream >> property.dimension;
  property.objectType		= NGT::ObjectSpace::ObjectType::Float;
  property.distanceType		= NGT::Index::Property::DistanceType::DistanceTypeCosine;
  property.edgeSizeForCreation	= 100;
  NGT::Index	index(property);
  while (getline(is, line)) {
    stringstream	linestream(line);
    vector<float>	object;
    while (!linestream.eof()) {
      float value;
      linestream >> value;
      object.push_back(value);
    }
    index.append(object);
  }
  index.createIndex(16);	// 16 is the number of threads to construct.
  index.saveIndex("fasttext.anng-100");
  return 0;
}

Conversion from ANNG to ONNG

An ONNG can be converted from an ANNG with the NGT::GraphOptimizer as follows.

#include	"NGT/Index.h"
#include	"NGT/GraphOptimizer.h"

using namespace std;
int main(int argc, char **argv)
{
  NGT::GraphOptimizer	optimizer;
  optimizer.numOfOutgoingEdges = 10;
  optimizer.numOfIncomingEdges = 100;
  optimizer.execute("fasttext.anng-100", "fasttext.onng");
  return 0;
}

execute() executes pass adjustment and dynamic degree adjustment optimization. numOfOutgoingEdges and numOfIncomingEdges specify the number of outgoing edges and incoming edges, respectively. In this example, the number of incoming edges is the same as the number of edges for ANNG creation, i.e., 100. However, since the number of real edges of an ANNG is more than the specified number, a larger number of incoming edges can be specified.

Search with ONNG

The search-fasttext.cpp above must be modified to use the ONNG by replacing 'fasttext.anng' with 'fasttext.onng' on the following line.

  NGT::Index		index("fasttext.onng");

Even if the search time increases compared with the result of the first ANNG, the search accuracy of the ONNG should be higher. Therefore, since the epsilon value for an ANNG needs to be increased to acquire higher accuracies, the search time of an ANNG increases more than that of an ONNG.