Two-Stage Routing with Optimized Guided Search and Greedy Algorithm is more efficient and effective routing strategy than state-of-the-art method, including SIGKDD'2011, CVPR'2016, PR'2019, and TPAMI'2021. At present, TOGG has been deployed in several state-of-the-art proximity graph algorithms (such as HNSW, NSG, and NSSG), and experiments on multiple real world datasets have verified the efficiency and effectiveness of TOGG.
We evaluate four advanced routing strategies applied to the proximity graph algorithms, and their papers are given in the following table. In addition, we propose TOGG, including TOGG that uses OGS-KDT + OGA (TOGG-KDT) and OGS-KMC + OGA (TOGG-KMC); they achieve better performance of efficiency and effectiveness than state-of-the-art work.
OGS-KDT: optimized Guided Search based on modified KD-Tree
OGS-KMC: optimized Guided Search based on k-means clustering
OGA: optimized greedy algorithm
Strategy | Paper Source | Year |
---|---|---|
SIGKDD'2011 | SIGKDD | 2011 |
CVPR'2016 | CVPR | 2016 |
PR'2019 | PR | 2019 |
TPAMI'2021 | TPAMI | 2021 |
We applied both TOGG to three state-of-the-art proximity graphs, i.e., HNSW, NSG, and NSSG, and then their scalability were tested on four real-world datasets. Note that greedy algorithm (GA) is the routing strategy currently used by these proximity graph algorithms.
Algorithm | Paper Source | Year |
---|---|---|
HNSW | TPAMI | 2018 |
NSG | VLDB | 2019 |
NSSG | TPAMI | 2021 |
Our experiment involves four real-world datasets popularly deployed by existing works. All datasets are pre-split into base data and query data and come with groundtruth data in the form of the top 100 neighbors.
dataset | base_num | base_dim | query_num | query_dim | groundtruth_num/query | download |
---|---|---|---|---|---|---|
SIFT1M | 1000000 | 128 | 10000 | 128 | 100 | sift.tar.gz(161MB) |
GIST1M | 1000000 | 960 | 1000 | 960 | 100 | gist.tar.gz(2.6GB) |
GloVe-100 | 1183514 | 100 | 10000 | 100 | 100 | glove-100.tar.gz(424MB) |
MNIST | 60000 | 784 | 10000 | 100 | 100 | mnist.tar.gz(19.8MB) |
b: interval factor (TOGG-KDT), it controls the size of the partition interval, thereby affecting the number of neighbors in each subtree.
m: the number of clusters (TOGG-KMC), it controls the number of neighbor clusters, which affects the number of neighbors in each neighbor cluster.
To obtain the optimal value of b (0 ≤ b < infinity) or m (1 ≤ m ≤ c, c is maximum outdegree), we first traverse the values in a large interval within their value range, and then reduce the interval and traverse the value within a small range until an optimal value is obtained.
- GCC 4.9+ with OpenMP
- CMake 2.8+
- Boost 1.55+
Download:
$ git clone https://github.com/whenever5225/TOGG.git
Locate directory:
$ cd TOGG/routing_evalutaion #for routing strategies evaluation
$ cd TOGG/algorithms/A/B/ #A is GA or TOGG-KDT or TOGG-KMC, B is hnsw or nsg or nssg, for scalability evaluation
$ mkdir build && cd build/
Compile:
$ cmake ..
$ make -j
Before building index, you should set the root directory for the dataset in TOGG/routing_evaluation/test/evaluation.cpp
first. Then, you can run the following instructions for build graph index.
$ cd TOGG/routing_evaluation/build/test/
$ ./evaluation strategy_name dataset_name build [other optional information]
After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. With the index built, you can run the following commands to perform the search. Related information about the search such as distance evaluation times, candidate set size, memory load can be obtained or calculated according to the output log information.
$ cd TOGG/routing_evaluation/build/test/
$ ./evaluation strategy_name dataset_name search [other optional information]
Before building index, you should set the root directory for the dataset in TOGG/algorithms/A/B/test/evaluation.cpp
first. Then, you can run the following instructions for build graph index.
$ cd TOGG/algorithms/A/B/build/test/
$ ./evaluation dataset_name build [other optional information] #see evaluation.cpp file for more detail
After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. With the index built, you can run the following commands to perform the search. Related information about the search such as distance evaluation times, candidate set size, memory load can be obtained or calculated according to the output log information.
$ cd TOGG/algorithms/A/B/build/test/
$ ./evaluation dataset_name search [other optional information] #see evaluation.cpp file for more detail
Thanks to everyone who provided references for this project. Special thanks to n2, nsg, and ssg for their assistance in the necessary implementation of this project.