Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outliers #91

Merged
merged 6 commits into from
Oct 12, 2024
Merged

Outliers #91

merged 6 commits into from
Oct 12, 2024

Conversation

redhog
Copy link
Collaborator

@redhog redhog commented Oct 10, 2024

Closes #58

- name: remove-worst-10
  type: outliers
  samples: 0.9
  embedding_keys:
    - concept
    - description

This will keep the 90 percent closest to the center (average) embedding of the keys provided. Altermnatively, you could set samples to an integer count of items to keep (or a negative number to throw away). You can also assume a gaussian distribution and set the key std to a number of standard deviations out from the center, instead of setting samples.

Small note about embeddings: If you embed too short values, some embedding models will yield a very "sparse" distribution, where the absolute majority of points lie on the surface of a hyperssphere, meaning that this operation will not work very well!

Optional keys:

  keep: false
  center:
    concept: Horse
    description: A horse is a large steppe roaming and grazing animal. Humans have utilized horses for transport throughout historical times

If keep is true outliers will be returned instead of non-outliers.

If center is provided, it must have the same keys as those listed under embedding_keys, and their values will be used to calculate the "center" embedding, instead of using the average of all embeddings of the input items. This makes it possible to use this operation is a poor-mans-RAG :P

@redhog redhog marked this pull request as ready for review October 11, 2024 14:55
@redhog redhog marked this pull request as draft October 11, 2024 15:00
@redhog redhog marked this pull request as ready for review October 11, 2024 15:09
@redhog
Copy link
Collaborator Author

redhog commented Oct 11, 2024

  • This is in some way similar to the sample operation. Maybe they should just be one operation with different options?
  • Maybe there should be an option to keep the outliers instead of the center (for debugging purposes?)?

@shreyashankar shreyashankar changed the base branch from main to staging October 12, 2024 19:42
@shreyashankar shreyashankar merged commit 4a12935 into ucbepic:staging Oct 12, 2024
1 of 4 checks passed
shreyashankar added a commit that referenced this pull request Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outliers-filter
2 participants