Make and Query read-only binary-search tree file supporting billions of entries in files of tens of GB.
Immutable implicit naive Binary Search Tree structure stored in a file.
The tree structure (possibly larger than the available RAM) is created at once using bulk-loading. It is then possible to perform queries on it (nn query, knn query, range query, ...), but not to update it.
It has been developed for (with a more general usage than) static astronomical catalogues.
The datastructure is implicit: it is basically a flat array of entries ordered in a pre-defined way depending on a few parameters like the number of elements in the tree, the size of both the L1 and the disk caches.
The simple design inputs are:
- a metadata part followed by the data part
- data part as compact as possible, but without compression
- => hence the choice of an implicit structure with an unbalanced rightmost part of the tree
Remark: I do not claim this is the best possible structure, it is a quite naive implementation by a non-expert, rushing to have something working among other duties; any feedback welcome.
Contains 3 CLI tools:
mkbst
: to MaKe a Binary Search Tree fileqbst
: to Query a Binary Search Tree filegenfile
: to generate files with random values for testing purposes
Perform fast queries on a single catalogue column. The binary-search tree basically stores both values and OIDs (row indices).
In input, the tools takes an identifiers (which can be an implicit sequential number) and a value. The indexation is made on the values. Queries return entries which basically are tuples containing an identifier and value couple.
Although the first step is an external k-way merge sort, the final file is not ordered sequentially. It consists in a sequence of binary search tree blocks. There is two levels of blocks:
- blocks fitting in the L1 cache
- groups of blocks fitting in the disk cache The full tree is not balanced:
- it is made of a main balance tree
- plus a rightmost sub-tree recursively consisting in
- a main balanced tree +plus a rightmost sub-tree... The tree has 0 unused byte.
Main functionalities are complete (building and querying), but all parts may not necessarily be production ready: more testing is needed (please report any bug).
However, this code is already in production in VizieR, mainly to index large catalogues identifiers.
For performances purposes, the code makes a large use of monomorphization (no dynamic dispatch at all!). It leads to:
- very long compilation time (1min/10min in debug/release mode)
- large binaries:
mkbst
(tree creation) is about 9/65 MB in release/debug modeqbst
(tree query) is about 29/116 MB in release/debug mode
For a larger project that may fulfill the need (and more), see:
In the while project, we talk about key/value
pairs.
Although it may be counter-intuitive, we actually index the value
(e.g. a magnitude, which is not necessarily unique) and
we return the associated key
(e.g. a unique record number in a table).
A few ideas:
- common use cases:
- index a unique identifier (value) and return a recno (key)
- index magnitudes (value) and return recnos (key)
- to get the position of a Gaia source from its identifier:
index the unique identifier (value) and return formatted JNAMEs (string key). - for fast magnitude based density maps: index a magnitude (value) and return the order 12 healpix index (key).
- index CSV rows using as key row starting byte offset
The standard way to install the mkbst
, qbst
and genfile
binaries is:
- install rust see here, possibly removing
--tlsv1.2
in the command line - fork and download this repository
- type
cargo install --path .
from the downloaded directory (can take ~1-2min) - WARNING: by default only a subset of (key, value) pair is available. For all possibilities, use
cargo install --path . --features "all"
See Cargo.toml for the list of features.
Fast compilation to test for a two key/value datatype couples, e.g. (unsigned integer/unsigned integer) and (unsigned integer/float):
cargo build --release --no-default-features --features u32_u32,u32_f32 --bin qbst
Full compilation (can tae 10min!):
cargo install --all-features --path .
Bash script mkseq.bash to generate a simple sequence of value from 0 to n
:
#!/bin/bash
declare -rx END=$1
echo "val"
COUNTER=0
until [ $COUNTER -ge ${END} ]; do
echo $COUNTER
let COUNTER+=1
done
Remark: one can use the shuf
command (first removing and then adding val
) to shuffle the input lines.
The genfile
tool generates simple files to test the BSTree code.
Example:
genfile 10000000000 randf64 | mkbst -h test.10b.randf64 --id-type u5 --val-type f4
Build a Binary Search Tree on 10 billion single precision floats having values in [0.0, 1.0)
.
Generate a very simple sequential file containing integers with both the id and value columns:
genfile -o 10000000000 out.csv seqint
Remarks:
- the purpose of a same sequential number for both the identifier and the value is to test that a value is associated with the correct identifier.
- one can use the
shuf
command (first removing and then addingval
) to shuffle the input lines.
Create a simple tree containing 2 billion entries consisting of tuples having the same identifiers (a row number) and value (the row number):
./mkseq.bash 2000000000 | ./mkbst -h bigtest --id-type u4 --val-type u4
Both identifiers and values are stored on regular 32bit unsigned integers.
Perform queries:
time qbst bigtest.bstree get --value 256984
time qbst bigtest.bstree knn --value 69853145 -k 10
time qbst bigtest.bstree range --limit 10000 --from 25639 --to 250000
time qbst bigtest.bstree range --cout --from 25639 --to 250000
Generate 10 million random f64 and create a bstree storing id on 32 bit integers and value on 32bit floats
genfile 10000000 randf64 | mkbst -h test_10m --id-type u4 --val-type f4
Look at the nearest value from 0.5
time qbst test_10m.bstree nn value 0.5
Look at the 10 nearest values from 0.2 (the result is ordered by distance to 0.2)
time qbst test_10m.bstree knn -v 0.2 -k 10
Count the number of entries havig value in 0.4 and 0.6
time qbst test_10m.bstree range -f 0.4 -t 0.6 -c
Priny the value in the range 0.49999 and 0.50001 (the result is ordered by increasing values)
time qbst test_10m.bstree range -f 0.49999 -t 0.50001
Test on my personal desktop (MVNe SSD, 16 GB RAM, AMD Ryzen) on a 20 GB file containing sequential numbers (from 0 to 1999999999).
> time mkbst -h --input test.2billion.csv test2b --id-type u4 --val-type u4
real 9m49,775s
user 7m37,361s
sys 0m50,522s
It took less than 10min to build the 15 GB ouput file (ok, the input file is already sorted).
> qbst test2b.bstree info
{
"types": [
"U32",
"U32"
],
"constants": {
"n_entries": 2000000000,
"entry_byte_size": 8,
"n_entries_per_l1page": 4096,
"n_l1page_per_ldpage": 255
},
"layout": {
"depth": 2,
"n_entries_root": 1914,
"n_entries_main": 1999622790,
"rigthmost_subtree": {
"depth": 1,
"n_entries_root": 92,
"n_entries_main": 376924,
"rigthmost_subtree": {
"depth": 0,
"n_entries_root": 286,
"n_entries_main": 286,
"rigthmost_subtree": null
}
}
}
}
Simple exact value query:
> time qbst test2b.bstree get value 1569874529
id,val
1569874529,1569874529
real 0m0,002s
user 0m0,000s
sys 0m0,002s
Nearest neighbour query
> time qbst test2b.bstree nn -v 3000000000
distance,id,val
1000000001,1999999999,1999999999
real 0m0,002s (0m0,009s at the first execution)
user 0m0,002s
sys 0m0,000s
K nearest neighbours query
> time qbst test2b.bstree knn -v 25684 -k 10
distance,id,val
0,25684,25684
1,25685,25685
1,25683,25683
2,25686,25686
2,25682,25682
3,25681,25681
3,25687,25687
4,25688,25688
4,25680,25680
5,25679,25679
real 0m0,002s (0m0,005s at the first execution)
user 0m0,002s
sys 0m0,000s
Range query
> time qbst test2b.bstree range -l 10 -f 25698470 -t 25698570
id,val
25698470,25698470
25698471,25698471
25698472,25698472
25698473,25698473
25698474,25698474
25698475,25698475
25698476,25698476
25698477,25698477
25698478,25698478
25698479,25698479
real 0m0,002s
user 0m0,000s
sys 0m0,002s
At first execution, the limiting factor is the disk access. At the second execution, the limiting factor is the time required by the OS to handle the process. The 2ms includes the time needed to read the tree metadata.
Generate 100,000 random points in 0 - 2billion:
./genfile 2000000000 randint | head -100001 | tail -n +2 > toto.list
In release mode (third execution):
time qbst test2b.bstree get list toto.list > toto.res.txt
real 0m0,732s
user 0m0,245s
sys 0m0,481s
i.e. a mean of less than 8 micro second per query (including parsing, conversion to string and writing the result in a file)! (The mean is ~0.14 ms/query at the first execution)
> time qbst test2b.bstree nn list toto.list > toto.res.txt
real 0m16,224s
user 0m0,482s
sys 0m3,222s
> time qbst test2b.bstree nn list toto.list > toto.res.txt
real 0m1,651s
user 0m0,256s
sys 0m0,804s
> time qbst test2b.bstree nn list toto.list > toto.res.txt
real 0m0,760s
user 0m0,251s
sys 0m0,509s
- First execution: 0.16 ms/query
- Third execution: 7.60 us/query
We redo those query sorting the random number:
./genfile 2000000000 randint | head -100001 | tail -n +2 | sort -n > toto.list
The results are similar.
We recall that the index file is 15 GB large, so 2nd execution is faster since the data is in the disk cache.
Generate the value and the index at once:
nohup sh -c "genfile 10000000000 randf64 | mkbst -h test.10b.randf64 --id-type u5 --val-type f4" &
Or in two steps
genfile 10000000000 randf64 > test_10b_randf64.csv
mkbst -h --input test_10b_randf64.csv test.10b.randf64 --id-type u5 --val-type f4
The size of the index is ~85 GB
.
The queries have been tested on 2 distinct hardware:
- Desktop computer
- 1 TB, 7200 RPM, 32 MB cache HDD (HGST HTS721010A9E630)
- Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz (6 MB "smart cache")
- 16 GB DDR4 2133 MHz
- Server
- RAID of 5 SSDs (Samsung SATA SSDs)
- 2x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (25 MB "smart cache")
- 64 GB DDR4 2133 MHz
The table return the "real" time provided by the "time" command.
Each command starts by time qbst test.10b.bstree
plus the Query params
.
Here the command to generate the list input file in the queries using list
:
genfile 10000 randf64 | egrep -v "val" | sort -n > randf64_4q.csv
Query params | 1st or 2nd | Result | Desktop | Server |
---|---|---|---|---|
nn value 0.5 | 1 | 0m0,071s | 0m0.013s | |
nn value 0.5 | 2 | 0m0,004s | 0m0.003s | |
knn -v 0.8 -k 10 | 1 | 0m0,034s | 0m0.007s | |
knn -v 0.8 -k 10 | 2 | 0m0,004s | 0m0.003s | |
all -v 0.8 -c | 1/2 | 588 | 0m0,004s | 0m0.005s |
all -v 0.2 -c | 1 | 148 | 0m0,026s | 0m0.011s |
all -v 0.2 -c | 2 | 148 | 0m0,004s | 0m0.003s |
range -f 0.4 -t 0.5 -c | 1/2 | 1000028688 | 1m5,450s | 1m57.212s |
range -f 0.150 -t 0.149 -c | 1 | 157 | 0m0,028s | 0m0.025s |
range -f 0.150 -t 0.149 -c | 2 | 157 | 0m0,004s | 0m0.003s |
get list 1000.csv > res.csv | 1 | 0m9,392s | 0m0.251s | |
get list 1000.csv > res.csv | 2 | 0m0,026s | 0m0.034s | |
get list 10000.csv > res.csv | 1 | 1m40,354s | 0m3.807s | |
get list 10000.csv > res.csv | 2 | 0m0,181s | 0m0.207s | |
get list 100000.csv > res.csv | 1 | 0m24.548s |
In the last case (100000 queries, no data in the disk cache), the mean query time is about 0.24 ms (which is probably not far from the disk access time).
In the previous results, we clearly see the effect of the spinning vs SSD disk at the first execution.
On the query range -f 0.4 -t 0.5 -c
we see that the server has a slower CPU.
The query get list 10000.csv
is very interesting (a factor more than x20 in performances!):
- the mean time on a spinning disk is about 10 ms
- the mean time on a SSD disk is about 0.4 ms
For the query get list 100000.csv > res.csv
, the result time is about the
same the input being sorted or not.
The use case is simple: from the Gaia DR2
Source
unique identifier, I want to retrieve the associated position (I am using the formatted position
%015.9%+015.9f
as a string made of 30 ASCII chars).
Thus here the value I want to index is Source
and the associated identifier is the position
(yes it is different from a HashMap in which Source
would be the (unique) key and the position the associated value
because in a bs-tree the indexed value is not necessarily unique).
In input, the files looks like:
ra,dec,source
45.00431616421,0.02104503269,34361129088
44.99615368416,0.00561580621,4295806720
45.00497424498,0.01987700037,38655544960
44.96389532530,0.04359518482,343597448960
...
and contains 1,69,2919,136 rows (including the header line).
I build and exec the following script
#!/bin/bash
LC_NUMERIC=C LC_COLLATE=C; \
tail -n +2 gaia_dr2.idradec.csv | tr ',' ' ' |\
while read line; do printf "%015.11f%+015.11f,%d\n" $line; done |\
mkbst gaia_dr2_source --val 1 --id 0 --id-type t30 --val-type u8
(the process is quite slow due to the bash while
and printf
).
I then have a 2.5 GB file Gaia_source.txt
containing more 132,739,322 Source
, looking like:
2448780173659609728
2448781208748235648
2448689605685695488
2448689777484387072
2448783991887042176
...
Two consecutive executions (slow 7200 RPM HDD) gives:
time qbst ../gaia_dr2_source.bstree get list Gaia_source.txt > Gaia.test.csv
real 24m4,323s
user 5m8,834s
sys 4m52,898s
and
time qbst ../gaia_dr2_source.bstree get list Gaia_source.txt > Gaia.test.csv
real 13m58,572s
user 4m2,162s
sys 3m32,534s
I guess that the second execution benefits from HDD cache.
Now sorting the input and querying again leads to:
time qbst ../gaia_dr2_source.bstree get list Gaia_source.sorted.txt > Gaia.test.sorted.csv
real 5m53,569s
user 2m43,339s
sys 2m28,872s
The output file is 6.3 GB large.
I install PSQL10 on Ubuntu via apt
, create a user and move the database out of the system disk:
sudo apt-get install postgresql-10
sudo -u postgres createuser --interactive
sudo -u postgres createdb fxtests
sudo systemctrl stop postgresql
sudo rsync -av /var/lib/postgresql /data-local/psql/
sudo vim /etc/postgresql/10/main/postgresql.conf
sudo systemctrl start postgresql
I create two tables (the index and the data tables) and copy data:
CREATE TABLE gaia2_idpos (
source BIGINT PRIMARY KEY,
pos CHAR(30) NOT NULL
);
COPY gaia2_idpos(pos, source)
FROM '/data-local/org/gaia_dr2.idradec.4index.csv'
DELIMITER ',';
and
CREATE TABLE gaia2_id (
source BIGINT PRIMARY KEY
)
COPY gaia2_id(source)
FROM '/data-local/org/aas/Gaia_source.txt'
DELIMITER ','
And perform the query:
time psql -d fxtests -t -A -F"," -c "SELECT b.* FROM gaia2_id as a NATURAL JOIN gaia2_idpos as b" > output.csv
real 53m17,051s
user 1m4,109s
sys 0m50,723s
Remarks:
- I tested PSQL out of the box, without modifying any parameters
- In the PSQL test, using a PRIMARY KEY for both tables, the result is to be compared with the sorted input in the BSTree test.
- add the possibility to query by a list of target
- make a simple test with PSQL
- replace memory map by pread/pwrite? (see e.g. positioned-io or scroll)
- remove the code which is now obsolete (e.g.
get
overwritten byget exact visitor
) - add much more tests
- add benchmarks
- add CI
- try to reduce the code redundancy (particularly in
SubTreeW
andSubTreeR
) - add support for NULL values (storing them separately, out of the tree structure)
- in the meanwhile, we can use a pre-defined value coding null
- perform tests with SQLx and PostgreSQL to have a reference time (would be nice if we are at least as fast)
If you use this code and work in a scientific public domain (especially astronomy), please acknowledge its usage and the CDS who developed it. It may help us in promoting our work to our financiers.
If the compilation fails with a message like
Caused by:
process didn't exit successfully: `rustc --crate-name bstree_file [...]
(signal: 9, SIGKILL: kill)
, try
dmesg | egrep -i 'killed process'
If the result looks like
[...] Out of memory: Killed process xxxxx (rustc)
it means that your machine was out of memory.
Like most projects in Rust, this project is licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.