Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #58
This will keep the 90 percent closest to the center (average) embedding of the keys provided. Altermnatively, you could set samples to an integer count of items to keep (or a negative number to throw away). You can also assume a gaussian distribution and set the key
std
to a number of standard deviations out from the center, instead of settingsamples
.Small note about embeddings: If you embed too short values, some embedding models will yield a very "sparse" distribution, where the absolute majority of points lie on the surface of a hyperssphere, meaning that this operation will not work very well!
Optional keys:
If
keep
istrue
outliers will be returned instead of non-outliers.If center is provided, it must have the same keys as those listed under embedding_keys, and their values will be used to calculate the "center" embedding, instead of using the average of all embeddings of the input items. This makes it possible to use this operation is a poor-mans-RAG :P