-
Notifications
You must be signed in to change notification settings - Fork 115
Practical Examples Using Cpp
Practical examples with a large-scale dataset for two types of primary NGT graphs (ANNG, ONNG) are described.
First, to describe how to search large-scale datasets, an NGT dataset needs to be generated. Download the fastText dataset as follows.
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip
The dataset above should be converted to a format that our sample scripts can read by using the following python script.
# dataset.py
with open('wiki-news-300d-1M-subword.vec', 'r') as fi,\
open('objects.tsv', 'w') as fov, open('words.tsv', 'w') as fow:
n, dim = map(int, fi.readline().split())
fov.write('{0}\t{1}\n'.format(n, dim))
for line in fi:
tokens = line.rstrip().split(' ')
fow.write(tokens[0] + '\n')
fov.write('{0}\n'.format('\t'.join(tokens[1:])))
Below is an example of how to construct an ANNG with cosine similarity for metric space.
#include "NGT/Index.h"
using namespace std;
int main(int argc, char **argv)
{
ifstream is("objects.tsv");
string line;
getline(is, line);
stringstream linestream(line);
NGT::Property property;
size_t n;
linestream >> n;
linestream >> property.dimension;
property.objectType = NGT::ObjectSpace::ObjectType::Float;
property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeCosine;
NGT::Index index(property);
while (getline(is, line)) {
stringstream linestream(line);
vector<float> object;
while (!linestream.eof()) {
float value;
linestream >> value;
object.push_back(value);
}
index.append(object);
}
index.createIndex(16); // 16 is the number of threads to construct.
index.saveIndex("fasttext.anng");
return 0;
}
The ANNG can be searched with a query by using the following program.
// search-fasttext.cpp
#include "NGT/Index.h"
#include <sstream>
#include <iomanip>
using namespace std;
int main(int argc, char **argv)
{
ifstream is("words.tsv");
string word;
vector<string> words;
while (getline(is, word)) {
words.push_back(word);
}
NGT::Index index("fasttext.anng");
int queryID = 10001;
vector<float> query;
index.getObjectSpace().getObject(queryID, query);
NGT::SearchQuery searchQuery(query);
NGT::ObjectDistances results;
searchQuery.setResults(&results);
searchQuery.setSize(10);
searchQuery.setEpsilon(0.1);
index.search(searchQuery);
cout << "Query No." << queryID << endl;
cout << "Rank\tID\tDistance" << setprecision(6) << fixed << endl;
for (size_t i = 0; i < results.size(); i++) {
cout << i + 1 << "\t" << results[i].id << "\t" << results[i].distance << "\t" << words[results[i].id - 1] << endl;
}
cout << endl;
return 0;
}
Below are the search results.
Query No.10001
Rank ID Distance
1 10001 0.000000 Doctors
2 4632 0.244096 Doctor
3 79543 0.258944 Medics
4 2045 0.263412 doctors
5 339398 0.274972 Doctoring
6 20668 0.280508 Physicians
7 80647 0.292580 Dentists
8 24256 0.292752 Nurses
9 9481 0.322196 Scientists
10 623161 0.330500 Practioners
When a higher accuracy is needed, you can specify an epsilon value of SearchQuery higher than the default 0.1 as shown below.
searchQuery.setEpsilon(0.15);
When a short query time is needed at the expense of accuracy, you can specify a smaller epsilon value.
Since further improvement of the search performance by using an ONNG can be expected, how to generate ONNG is described. ONNG generation requires an ANNG with more edges as the excess edges are removed to optimize the graph. Note that constructing an ANNG with too many edges requires more time. The following example shows the creation of an ANNG with 100 edges. Since 100 edges are considered excessive for most datasets, as fewer edges would be sufficient to achieve a high accuracy, this number can be reduced on the basis of the resultant accuracy of the dataset.
#include "NGT/Index.h"
using namespace std;
int main(int argc, char **argv)
{
ifstream is("objects.tsv");
string line;
getline(is, line);
stringstream linestream(line);
NGT::Property property;
size_t n;
linestream >> n;
linestream >> property.dimension;
property.objectType = NGT::ObjectSpace::ObjectType::Float;
property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeCosine;
property.edgeSizeForCreation = 100;
NGT::Index index(property);
while (getline(is, line)) {
stringstream linestream(line);
vector<float> object;
while (!linestream.eof()) {
float value;
linestream >> value;
object.push_back(value);
}
index.append(object);
}
index.createIndex(16); // 16 is the number of threads to construct.
index.saveIndex("fasttext.anng-100");
return 0;
}
An ONNG can be converted from an ANNG with the NGT::GraphOptimizer as follows.
#include "NGT/Index.h"
#include "NGT/GraphOptimizer.h"
using namespace std;
int main(int argc, char **argv)
{
NGT::GraphOptimizer optimizer;
optimizer.numOfOutgoingEdges = 10;
optimizer.numOfIncomingEdges = 100;
optimizer.execute("fasttext.anng-100", "fasttext.onng");
return 0;
}
execute() executes pass adjustment and dynamic degree adjustment optimization. numOfOutgoingEdges and numOfIncomingEdges specify the number of outgoing edges and incoming edges, respectively. In this example, the number of incoming edges is the same as the number of edges for ANNG creation, i.e., 100. However, since the number of real edges of an ANNG is more than the specified number, a larger number of incoming edges can be specified.
The search-fasttext.cpp above must be modified to use the ONNG by replacing 'fasttext.anng' with 'fasttext.onng' on the following line.
NGT::Index index("fasttext.onng");
Even if the search time increases compared with the result of the first ANNG, the search accuracy of the ONNG should be higher. Therefore, since the epsilon value for an ANNG needs to be increased to acquire higher accuracies, the search time of an ANNG increases more than that of an ONNG.
Command line tool
Python
C++