Store the list of updated IDs directly in LMDB instead of a roaring bitmap #99
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
After chatting with @ManyTheFish, we realized that deserializing a roaring bitmap, inserting an item, and serializing it again into LMDB every time we insert a new item wasn’t the smartest idea.
Ideally, we should write each document in a simple file or in RAM, but that adds a lot of complexity to the usage of arroy.
By simply writing the updated item ID directly in a database we were able to add 1M items in 1.7s instead of 18s+.
The drawback is that we use slightly more memory, but as a reminder, adding 150M vectors in one batch would only use around 0.5GiB (plus the LMDB internal structures).
This is clearly acceptable considering it’s memory mapped, and storing the vectors is going to take at least around 300GiB.
What does this PR do?
updated
where every associatedItemId
represents the item ID of something we updated