Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Minor] Enable continuation of training #1605

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

Constantin343
Copy link
Contributor

@Constantin343 Constantin343 commented Jul 1, 2024

🔬 Background

Enable to continue the training of a model as described in #828.

Also needed to address #1542

🔮 Key changes

Load model state from the last checkpoint and continue training with a new scheduler. The last learning rate is used as a starting point for the new scheduler.

ExponentialLR is used as new scheduler since continuing with OneCycleLR leads to issues. @ourownstory please let me know if you have an idea for a specific scheduler that make sense here.

📋 Review Checklist

  • I have performed a self-review of my own code.
  • I have commented my code, added docstrings and data types to function definitions.
  • I have added pytests to check whether my feature / fix works.

Please make sure to follow our best practices in the Contributing guidelines.

@ourownstory
Copy link
Owner

@Constantin343 "The last learning rate is used as a starting point for the new scheduler."
Do you mean the LR found in the LR test, or the actual LR at the end of the cycle?

Copy link
Owner

@ourownstory ourownstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you got a first version running - good work!

I left a few comments aiming to robustify the user settings and to encapsulate code in a similar manner to existing practices.

Furthermore, I think we should avoid hard-coding the optimizer and scheduler for this functionality. I suggest you make them optionally user-settable (with reasonable defaults). And if so, then it may make sense to do the same for the regular optimizer and scheduler in the same manner.

neuralprophet/forecaster.py Show resolved Hide resolved
@@ -1067,6 +1067,10 @@ def fit(

if self.fitted is True and not continue_training:
log.error("Model has already been fitted. Re-fitting may break or produce different results.")

if continue_training and self.metrics_logger.checkpoint_path is None:
log.error("Continued training requires checkpointing in model.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain what necessitates this (for my understanding)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that it makes sense to continue from the checkpoint, but probably it's not necessary. All the necessary parameters should still be available in the model itself. I will adapt it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing and it seems like checkpoint is indeed necessary to correctly continue training with the pytroch-lighting trainer. I can get it to run without checkpointing but fitting again always leads to a complete restart of the training.

Maybe there is some workaround, but I would suggest keeping it like this as continued training always goes hand in hand with checkpointing in pytroch-lighting.

neuralprophet/forecaster.py Outdated Show resolved Hide resolved
neuralprophet/forecaster.py Outdated Show resolved Hide resolved
neuralprophet/time_net.py Show resolved Hide resolved
@Constantin343
Copy link
Contributor Author

@Constantin343 "The last learning rate is used as a starting point for the new scheduler." Do you mean the LR found in the LR test, or the actual LR at the end of the cycle?

Yeah, the learning rate at the end of the cycle (actually from the last checkpoint, to be precise). The idea was to ensure a smooth transition

@Constantin343
Copy link
Contributor Author

Constantin343 commented Jul 5, 2024

Furthermore, I think we should avoid hard-coding the optimizer and scheduler for this functionality. I suggest you make them optionally user-settable (with reasonable defaults). And if so, then it may make sense to do the same for the regular optimizer and scheduler in the same manner.

Totally agree. I tried again to make it work with the OneCycleLR scheduler, but it just doesn't seem to work with this scheduler. I will try to make some adaptations over the weekend to accept scheduler_type as a parameter, so the user can choose among different schedulers.

@Constantin343 Constantin343 marked this pull request as draft July 9, 2024 08:14
@Constantin343 Constantin343 marked this pull request as ready for review July 9, 2024 12:07
@Constantin343 Constantin343 requested a review from ourownstory July 9, 2024 12:07
Copy link

codecov bot commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 81.08108% with 14 lines in your changes missing coverage. Please review.

Project coverage is 88.09%. Comparing base (fa97742) to head (df74dc3).
Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
neuralprophet/configure.py 68.18% 7 Missing ⚠️
neuralprophet/forecaster.py 89.18% 4 Missing ⚠️
neuralprophet/time_net.py 80.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1605      +/-   ##
==========================================
- Coverage   88.27%   88.09%   -0.18%     
==========================================
  Files          41       41              
  Lines        5364     5428      +64     
==========================================
+ Hits         4735     4782      +47     
- Misses        629      646      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants