De-duplicate between image data collections #39292
-
I'm collecting data from cameras to create a training dataset, but I want to exclude similar or duplicate data as it would be wasteful, and I want to make sure the vector embeddings are close. I am trying to perform the following methodology to achieve the above goal: When an image comes in, calculate its distance from the embedding values in the existing dataset (collection), and if the closest vector is farther than a(threshold), add it to the dataset (collection), otherwise pass it. Is the above methodology efficient or correct? I would appreciate it if you could answer these questions. Have a great day! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
It is difficult to define the threshold. The score/distance values are not in a linear curve. If you use different embedding model, the threshold is different. You can do some tests to observe the result and determine a "good threshold", but perhaps this threshold doesn't work well in some other cases. |
Beta Was this translation helpful? Give feedback.
-
I think this is can be implemented on client side. You can search before you insert and ignore distance above certain ratio. Do you ever think of deduplication offline? I guess that could be potentially faster. |
Beta Was this translation helpful? Give feedback.
It is difficult to define the threshold. The score/distance values are not in a linear curve. If you use different embedding model, the threshold is different. You can do some tests to observe the result and determine a "good threshold", but perhaps this threshold doesn't work well in some other cases.