Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why teach method fits the data? #95

Open
Efesencan opened this issue Aug 3, 2020 · 4 comments
Open

Why teach method fits the data? #95

Efesencan opened this issue Aug 3, 2020 · 4 comments

Comments

@Efesencan
Copy link

Efesencan commented Aug 3, 2020

When you teach your Active Learner the queried instance and its label, instead of just adding these new instances to train dataset, it also fits the model with these newly labeled dataset. But this is unnecessary. Because in my case, when I learn the label of one instance, I learn the 300 (this number can vary) other instances label(since they share the same label) automatically. Therefore, I have to teach 300 new instances at each query iteration to the Active Learner which takes a lot time because of the fit method. For this reason, I believe that fitting the data should be performed only in query method.

@Efesencan
Copy link
Author

You can say that, one may want to use predict method right after it teaches. That's why fit method is used inside the teach method, but as I described the above issue that approach is problematic. At least there should be an option of whether fitting the data will be performed or not.

@cosmic-cortex
Copy link
Member

To only add training data without refitting the estimator, you can use the ActiveLearner._add_training_data method. (Here is the implementation: https://github.com/modAL-python/modAL/blob/master/modAL/models/base.py#L68-L92)

This is a "private" method, so I didn't include it in the documentation, but the method itself is documented, so it should be easy to use.

I don't understand your use case and argument exactly. What is the underlying model you use?

If by querying a single label you learn multiple other labels indirectly, than you can manually add these to the X_new and y_new before calling the teach method. This is roughly what I mean:

query_idx, X_query = learner.query(X_pool)

# ...
# get the label for X_query somehow
# ...

X_other, y_other = ... # these are the instances and labels you find indirectly after querying a single label

X_new = np.concat((X_query, X_other))
y_new = np.concat((y_query, y_other))

learner.teach(X_new, y_new)

@Efesencan
Copy link
Author

Efesencan commented Aug 4, 2020

Okay, I got your point. My another question is that, should I delete the queried instance from the X_pool and its corresponding label from y_pool after I make a query(learn the label) and teach them at each query iteration? Or is it unnecessary?

@cosmic-cortex
Copy link
Member

Yes, it should be deleted manually. Otherwise, the query strategy might select data which is already part of your training data, hence possibly leading to model bias in some scenarios.

There is a PR by @talolard who proposed a data manager class, but eventually decided to put this into a completely new package. I don't know the status on this, but will be very useful for this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants