De-duplicate between image data collections #39292

Go-MinSeong · 2025-01-15T08:48:53Z

Go-MinSeong
Jan 15, 2025

I'm collecting data from cameras to create a training dataset, but I want to exclude similar or duplicate data as it would be wasteful, and I want to make sure the vector embeddings are close.

I am trying to perform the following methodology to achieve the above goal: When an image comes in, calculate its distance from the embedding values in the existing dataset (collection), and if the closest vector is farther than a(threshold), add it to the dataset (collection), otherwise pass it.

Is the above methodology efficient or correct?
Also, how should I set the threshold: should I just take the existing 100 images as a dataset and use the statistics?

I would appreciate it if you could answer these questions. Have a great day!

Answered by yhmo

Jan 15, 2025

It is difficult to define the threshold. The score/distance values are not in a linear curve. If you use different embedding model, the threshold is different. You can do some tests to observe the result and determine a "good threshold", but perhaps this threshold doesn't work well in some other cases.

View full answer

yhmo · 2025-01-15T10:46:51Z

yhmo
Jan 15, 2025
Collaborator

It is difficult to define the threshold. The score/distance values are not in a linear curve. If you use different embedding model, the threshold is different. You can do some tests to observe the result and determine a "good threshold", but perhaps this threshold doesn't work well in some other cases.

0 replies

xiaofan-luan · 2025-01-18T16:17:36Z

xiaofan-luan
Jan 18, 2025
Maintainer

I think this is can be implemented on client side.

You can search before you insert and ignore distance above certain ratio.

Do you ever think of deduplication offline? I guess that could be potentially faster.

1 reply

xiaofan-luan Jan 18, 2025
Maintainer

We have the plan to implement vector exact match dedup but right now we don't have plan to implement similarity dedup.

We are thinking of implement some offline dedup algorithm on a large dataset and return you duplicated data list and deduped list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-duplicate between image data collections #39292

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

De-duplicate between image data collections #39292

Go-MinSeong Jan 15, 2025

Replies: 2 comments · 1 reply

yhmo Jan 15, 2025 Collaborator

xiaofan-luan Jan 18, 2025 Maintainer

xiaofan-luan Jan 18, 2025 Maintainer

Go-MinSeong
Jan 15, 2025

Replies: 2 comments 1 reply

yhmo
Jan 15, 2025
Collaborator

xiaofan-luan
Jan 18, 2025
Maintainer

xiaofan-luan Jan 18, 2025
Maintainer