[Minor] Enable continuation of training #1605

Constantin343 · 2024-07-01T19:20:44Z

🔬 Background

Enable to continue the training of a model as described in #828.

Also needed to address #1542

🔮 Key changes

Load model state from the last checkpoint and continue training with a new scheduler. The last learning rate is used as a starting point for the new scheduler.

ExponentialLR is used as new scheduler since continuing with OneCycleLR leads to issues. @ourownstory please let me know if you have an idea for a specific scheduler that make sense here.

📋 Review Checklist

I have performed a self-review of my own code.
I have commented my code, added docstrings and data types to function definitions.
I have added pytests to check whether my feature / fix works.

Please make sure to follow our best practices in the Contributing guidelines.

ourownstory · 2024-07-01T21:48:10Z

@Constantin343 "The last learning rate is used as a starting point for the new scheduler."
Do you mean the LR found in the LR test, or the actual LR at the end of the cycle?

ourownstory

Looks like you got a first version running - good work!

I left a few comments aiming to robustify the user settings and to encapsulate code in a similar manner to existing practices.

Furthermore, I think we should avoid hard-coding the optimizer and scheduler for this functionality. I suggest you make them optionally user-settable (with reasonable defaults). And if so, then it may make sense to do the same for the regular optimizer and scheduler in the same manner.

neuralprophet/forecaster.py

ourownstory · 2024-07-01T21:51:02Z

neuralprophet/forecaster.py

@@ -1067,6 +1067,10 @@ def fit(

        if self.fitted is True and not continue_training:
            log.error("Model has already been fitted. Re-fitting may break or produce different results.")
+
+        if continue_training and self.metrics_logger.checkpoint_path is None:
+            log.error("Continued training requires checkpointing in model.")


Can you please explain what necessitates this (for my understanding)?

My thinking was that it makes sense to continue from the checkpoint, but probably it's not necessary. All the necessary parameters should still be available in the model itself. I will adapt it

I did some testing and it seems like checkpoint is indeed necessary to correctly continue training with the pytroch-lighting trainer. I can get it to run without checkpointing but fitting again always leads to a complete restart of the training.

Maybe there is some workaround, but I would suggest keeping it like this as continued training always goes hand in hand with checkpointing in pytroch-lighting.

neuralprophet/forecaster.py

neuralprophet/time_net.py

Constantin343 · 2024-07-03T23:26:28Z

@Constantin343 "The last learning rate is used as a starting point for the new scheduler." Do you mean the LR found in the LR test, or the actual LR at the end of the cycle?

Yeah, the learning rate at the end of the cycle (actually from the last checkpoint, to be precise). The idea was to ensure a smooth transition

Constantin343 · 2024-07-05T13:05:08Z

Furthermore, I think we should avoid hard-coding the optimizer and scheduler for this functionality. I suggest you make them optionally user-settable (with reasonable defaults). And if so, then it may make sense to do the same for the regular optimizer and scheduler in the same manner.

Totally agree. I tried again to make it work with the OneCycleLR scheduler, but it just doesn't seem to work with this scheduler. I will try to make some adaptations over the weekend to accept scheduler_type as a parameter, so the user can choose among different schedulers.

codecov · 2024-08-23T23:11:38Z

Codecov Report

Attention: Patch coverage is 81.08108% with 14 lines in your changes missing coverage. Please review.

Project coverage is 88.09%. Comparing base (fa97742) to head (df74dc3).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
neuralprophet/configure.py	68.18%	7 Missing ⚠️
neuralprophet/forecaster.py	89.18%	4 Missing ⚠️
neuralprophet/time_net.py	80.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1605      +/-   ##
==========================================
- Coverage   88.27%   88.09%   -0.18%     
==========================================
  Files          41       41              
  Lines        5364     5428      +64     
==========================================
+ Hits         4735     4782      +47     
- Misses        629      646      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Constantin343 added 6 commits June 24, 2024 17:40

enable re-training

2ae4506

update scheduler

900c8d5

change scheduler for continued training

f1355eb

add test

da3a6d5

merge main

492dee9

fix metrics logging

f996928

ourownstory requested changes Jul 1, 2024

View reviewed changes

Constantin343 added 3 commits July 5, 2024 10:19

include feedback

f9a77f8

get correct optimizer states

7ad761d

fix tests

b14d20b

Constantin343 added 2 commits July 8, 2024 11:52

enable setting the scheduler

9fe3401

update for onecyclelr

00f2e25

Constantin343 marked this pull request as draft July 9, 2024 08:14

add tests and adapt docstring

5f103d8

Constantin343 marked this pull request as ready for review July 9, 2024 12:07

Constantin343 requested a review from ourownstory July 9, 2024 12:07

fix array mismatch

e043201

ourownstory mentioned this pull request Aug 23, 2024

[Major] Support Re-Training #1635

Open

Merge branch 'main' into dynamic-weight-saving-for-retraining

df74dc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minor] Enable continuation of training #1605

[Minor] Enable continuation of training #1605

Constantin343 commented Jul 1, 2024 •

edited

Loading

ourownstory commented Jul 1, 2024

ourownstory left a comment

ourownstory Jul 1, 2024

Constantin343 Jul 3, 2024

Constantin343 Jul 5, 2024

Constantin343 commented Jul 3, 2024

Constantin343 commented Jul 5, 2024 •

edited

Loading

codecov bot commented Aug 23, 2024 •

edited

Loading

[Minor] Enable continuation of training #1605

Are you sure you want to change the base?

[Minor] Enable continuation of training #1605

Conversation

Constantin343 commented Jul 1, 2024 • edited Loading

🔬 Background

🔮 Key changes

📋 Review Checklist

ourownstory commented Jul 1, 2024

ourownstory left a comment

Choose a reason for hiding this comment

ourownstory Jul 1, 2024

Choose a reason for hiding this comment

Constantin343 Jul 3, 2024

Choose a reason for hiding this comment

Constantin343 Jul 5, 2024

Choose a reason for hiding this comment

Constantin343 commented Jul 3, 2024

Constantin343 commented Jul 5, 2024 • edited Loading

codecov bot commented Aug 23, 2024 • edited Loading

Codecov Report

Constantin343 commented Jul 1, 2024 •

edited

Loading

Constantin343 commented Jul 5, 2024 •

edited

Loading

codecov bot commented Aug 23, 2024 •

edited

Loading