Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99

irevoire · 2024-10-01T13:45:57Z

Pull Request

After chatting with @ManyTheFish, we realized that deserializing a roaring bitmap, inserting an item, and serializing it again into LMDB every time we insert a new item wasn’t the smartest idea.
Ideally, we should write each document in a simple file or in RAM, but that adds a lot of complexity to the usage of arroy.
By simply writing the updated item ID directly in a database we were able to add 1M items in 1.7s instead of 18s+.

The drawback is that we use slightly more memory, but as a reminder, adding 150M vectors in one batch would only use around 0.5GiB (plus the LMDB internal structures).
This is clearly acceptable considering it’s memory mapped, and storing the vectors is going to take at least around 300GiB.

What does this PR do?

Create a new kind of node ID updated where every associated ItemId represents the item ID of something we updated
Update the rest of the code accordingly
The tests still works 👌

…itmap

ManyTheFish

Nice!

Store the list of updated IDs directly in LMDB instead of a roaring b…

950b792

…itmap

irevoire added indexing Everything related to indexing performance db-breaking labels Oct 1, 2024

irevoire added this to the v0.4.0 milestone Oct 1, 2024

irevoire requested a review from ManyTheFish October 1, 2024 13:45

change the todo! in unreachable!

d9a5694

ManyTheFish approved these changes Oct 1, 2024

View reviewed changes

irevoire merged commit d72b469 into main Oct 1, 2024
8 checks passed

irevoire deleted the store-the-updated-id-in-lmdb branch October 1, 2024 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99

Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99

irevoire commented Oct 1, 2024

ManyTheFish left a comment

Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99

Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99

Conversation

irevoire commented Oct 1, 2024

Pull Request

What does this PR do?

ManyTheFish left a comment

Choose a reason for hiding this comment

Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99

Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99