Oversampling with the Data Sampler widget #5407

rehoyt · 2021-04-21T21:16:22Z

rehoyt
Apr 21, 2021

What's your use case?

Imbalanced data that is so common in medicine it is an unknown

It was pointed out in the help section of the Data Sampler widget that it could be used for under or oversampling. I used the Attrition dataset where the class imbalance is 1233/237. I separated the minority class with the Select Rows widget and connected to the Data Sampler widget where I chose both Fixed sample size = 1000 and sample with replacement. I then connected the Concatenate widget to the initial dataset, the Select Rows widget and the Data Sampler widget to create a new dataset where the minority and majority classes were about the same. Classification performance (AUC) with logistic regression, neural networks, and Ada Boost improved and classification accuracy stayed about the same.

I fully understand the Python options to over or under-sample but the goal of my workshop for older clinicians is to try to do as much using Orange as possible. Do you believe this is a legitimate approach to oversample the minority class?

What's your proposed solution?

I would like to see a full explanation regarding oversampling with the Data Sampler widget. Does it use kNN to identify similar subjects? Are we sure there is no data leakage?

Are there any alternative solutions? I fully understand that the typical data scientist might program in Python and use SMOTE to handle the imbalanced dataset. I am looking for solutions using only Orange visual programming

ajdapretnar · 2021-04-22T07:22:31Z

ajdapretnar
Apr 22, 2021
Maintainer

I think we wrote this section for over/undersampling because of the constant requests for SMOTE. The problem with this approach is you are showing the exact same instances the model already saw (sample with replacement), so there is no new information there. I think the best person to chime in on this is @janezd.
Data Sampler doesn't use kNN. It just samples randomly.
I will move this to Discussions since it is a more appropriate venue.

2 replies

rehoyt Apr 26, 2021
Author

@janezd could you please respond

markotoplak Apr 26, 2021
Maintainer

As @ajdapretnar mentioned, the Data Sampler widget will get do sampling with replacement and your dataset will contain repeated data points. Test and Score does not have any information about which data points are going were repeated and, if used directly on the oversampled data set, it will show misleading (=overfitted) results for your data.

To do the validation properly, the only solution I can think of with Orange would be to first split your data set into training and testing parts, then oversample each one of them separately, and then use Test and Score with given the train and the test data.

Just as a note. Orange could implement correct validation with some modifications. Data Sampler would need to add a meta-column containing original data set ID (maybe it already does; what is the data.ids in that case?) and then Test and Score would need to always keep samples with the same IDs in each fold.

janezd · 2021-04-27T09:14:23Z

janezd
Apr 27, 2021
Maintainer

We discussed this in #3269. I stand by my opinion. :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oversampling with the Data Sampler widget #5407

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Oversampling with the Data Sampler widget #5407

rehoyt Apr 21, 2021

Replies: 2 comments · 2 replies

ajdapretnar Apr 22, 2021 Maintainer

rehoyt Apr 26, 2021 Author

markotoplak Apr 26, 2021 Maintainer

janezd Apr 27, 2021 Maintainer

rehoyt
Apr 21, 2021

Replies: 2 comments 2 replies

ajdapretnar
Apr 22, 2021
Maintainer

rehoyt Apr 26, 2021
Author

markotoplak Apr 26, 2021
Maintainer

janezd
Apr 27, 2021
Maintainer