-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary quantization #82
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
Kerollmops
reviewed
Jul 8, 2024
… from the distance trait
irevoire
added
enhancement
New feature or request
breaking
Something that will break in the next release
indexing
Everything related to indexing
performance
labels
Sep 16, 2024
… same way to is_positive under x86 and aarch64
Kerollmops
requested changes
Sep 17, 2024
Kerollmops
reviewed
Sep 17, 2024
Kerollmops
reviewed
Sep 17, 2024
Kerollmops
requested changes
Sep 17, 2024
Kerollmops
approved these changes
Sep 19, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks perfect to me 👌 That is a wonderful job that you've done here @irevoire 👏
Thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breaking
Something that will break in the next release
enhancement
New feature or request
indexing
Everything related to indexing
performance
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related issue
Fixes #69
What does this PR do?
UnalignedVector
generic type to replace theUnalignedF32Slice
type we were using beforeCodec
f32
(equivalent to the oldUnalignedF32Slice
type)0
or1
depending on if their value was negative or positiveVec<f32>
quickly using SIMDEuclidean
,Manhattan
andAngular
distances, respectively namedBinaryQuantizedEuclidean
,BinaryQuantizedManhattan
andBinaryQuantizedAngular
two_means
« un-binary quantize » the vectors before searching for the centroidWarning
Below is the initial investigation of the different method we could use to make the binary quantization works without losing too much relevancy.
First version
Just binary quantize every operation.
Here's the measured relevancy:
Second version to improve relevancy
One issue we found out is that when binary quantizing vectors, we basically end up creating a bunch of clusters in two dimensions. It would look like this:
All vectors will end up on one of the four edges of the square.
The more dimensions we have, the more clusters we'll get.
The number of clusters is
2^nb_dimensions
.By making the internal computation of
two_means
not use the binary quantized vectors but instead use the real vectors, here are the results:Warning
It’s actually worse than initially
Third idea to improve relevancy
In the second solution I was computing the
two_means
loop with non binary-quantized distances which greatly improved the relevancy.But then the output of
two_means
was binary quantized again.We should try to compute the
normal
on non binary quantized distance as well and then binary quantize this vector right before storing it in the DB.Note
This improved the relevancy by almost 10 points of recall in the worst case over the previous best solution.
With the bits being [0:1] the relevancy is terrible
Fourth solution to improve relevancy
Store the non binary quantized distance in the
SplitNode
in the database directly.This increase by a lot the size of the database (still way less than the non-binary-quantized distance though).
The search may become slower as well.
The results are bad and I can't explain why:
I made another branch in case we want to investigate further: #84
In conclusion
The best version is the third one.
The next steps are:
Next step in Arroy: