Outliers #91

redhog · 2024-10-10T18:25:16Z

Closes #58

- name: remove-worst-10
  type: outliers
  samples: 0.9
  embedding_keys:
    - concept
    - description

This will keep the 90 percent closest to the center (average) embedding of the keys provided. Altermnatively, you could set samples to an integer count of items to keep (or a negative number to throw away). You can also assume a gaussian distribution and set the key std to a number of standard deviations out from the center, instead of setting samples.

Small note about embeddings: If you embed too short values, some embedding models will yield a very "sparse" distribution, where the absolute majority of points lie on the surface of a hyperssphere, meaning that this operation will not work very well!

Optional keys:

  keep: false
  center:
    concept: Horse
    description: A horse is a large steppe roaming and grazing animal. Humans have utilized horses for transport throughout historical times

If keep is true outliers will be returned instead of non-outliers.

If center is provided, it must have the same keys as those listed under embedding_keys, and their values will be used to calculate the "center" embedding, instead of using the average of all embeddings of the input items. This makes it possible to use this operation is a poor-mans-RAG :P

…merged?

redhog · 2024-10-11T15:10:37Z

This is in some way similar to the sample operation. Maybe they should just be one operation with different options?
Maybe there should be an option to keep the outliers instead of the center (for debugging purposes?)?

#91 document > item renaming

Outliers

3d04b1e

redhog marked this pull request as ready for review October 11, 2024 14:55

redhog marked this pull request as draft October 11, 2024 15:00

Changed api to look more like sample. Maybe these two should even be …

2e4b8c3

…merged?

redhog marked this pull request as ready for review October 11, 2024 15:09

Egil added 3 commits October 11, 2024 17:35

More options for outliers

a3e53ab

Added docs

7a13061

Added more docs

b2ee5a2

shreyashankar changed the base branch from main to staging October 12, 2024 19:42

Merge branch 'staging' into outliers

40c853e

shreyashankar merged commit 4a12935 into ucbepic:staging Oct 12, 2024
1 of 4 checks passed

shreyashankar added a commit that referenced this pull request Oct 14, 2024

Merge pull request #103 from garuna-m6/#91-document-->-item-renaming

0c6a4dd

#91 document > item renaming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outliers #91

Outliers #91

redhog commented Oct 10, 2024 •

edited

Loading

redhog commented Oct 11, 2024

Outliers #91

Outliers #91

Conversation

redhog commented Oct 10, 2024 • edited Loading

redhog commented Oct 11, 2024

redhog commented Oct 10, 2024 •

edited

Loading