Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: set max slots and checkpoint gc policy should comply with config policies #10140

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

amandavialva01
Copy link
Contributor

@amandavialva01 amandavialva01 commented Oct 25, 2024

Ticket

CM-590

Description

This PR fixes a couple of issues:

  • det e set max-slots to comply with invariant config and constraints
  • det e set gc-policy to comply with invariant configs

Test Plan

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@amandavialva01 amandavialva01 requested a review from a team as a code owner October 25, 2024 23:54
@cla-bot cla-bot bot added the cla-signed label Oct 25, 2024
Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 4 lines in your changes missing coverage. Please review.

Project coverage is 54.77%. Comparing base (782f7a0) to head (caa9179).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
master/internal/api_experiment.go 92.59% 2 Missing ⚠️
master/internal/configpolicy/utils.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10140      +/-   ##
==========================================
+ Coverage   54.71%   54.77%   +0.06%     
==========================================
  Files        1266     1266              
  Lines      159970   159986      +16     
  Branches     3662     3661       -1     
==========================================
+ Hits        87525    87637     +112     
+ Misses      72312    72216      -96     
  Partials      133      133              
Flag Coverage Δ
backend 46.20% <91.66%> (+0.18%) ⬆️
harness 72.56% <ø> (ø)
web 54.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ternal/configpolicy/postgres_task_config_policy.go 91.13% <100.00%> (ø)
master/internal/experiment.go 35.93% <100.00%> (+2.15%) ⬆️
master/internal/api_experiment.go 61.23% <92.59%> (+4.31%) ⬆️
master/internal/configpolicy/utils.go 73.20% <80.00%> (-0.85%) ⬇️

... and 3 files with indirect coverage changes

Copy link

netlify bot commented Oct 25, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit caa9179
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6720ec3c4a5c54000827db03

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch experiment really sketches me out. e.SetGroupWeight(*newResources.Weight) and all call e.db.SaveExperimentConfig(e.ID, e.activeConfig) and so does the endpoint itself. i feel like there are definitely calls where i could make the exp in memory state and db state inconsistent..?

@amandavialva01
Copy link
Contributor Author

patch experiment really sketches me out. e.SetGroupWeight(*newResources.Weight) and all call e.db.SaveExperimentConfig(e.ID, e.activeConfig) and so does the endpoint itself. i feel like there are definitely calls where i could make the exp in memory state and db state inconsistent..?

hmm yea good point.
Is the solution then to just add that func into setMaxSlots? instead of moving it in the patch handler? its the only of the three (max-slots, priority, weight) that doesn't have this call

@@ -418,13 +418,12 @@ func (e *internalExperiment) SetGroupMaxSlots(msg sproto.SetGroupMaxSlots) {
return
}

slots, err := configpolicy.CanSetMaxSlots(msg.MaxSlots, w.ID)
err = configpolicy.CanSetMaxSlots(msg.MaxSlots, w.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes/undoes some of the changes you just made in your previous PR ... why is that?

Copy link
Contributor Author

@amandavialva01 amandavialva01 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my bad, when I made the previous PR I didn't realize that a lot of the code in the PATCH experiment API handler is specifically for det e set CLI commands, and relied too heavily on integration tests of the functions I implemented to use in the PATCH handler, rather than testing the handler itself.
After manually testing PATCH requests resulting from those commands and adding automated integration testing for such requests, I think these changes should properly address any remaining issues/inconsistencies with expected config policy-related behavior in this API handler

Copy link
Contributor

@stoksc stoksc Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm still slightly fuzzy here. just to make sure we're all on the same page.. resources.max_slots/slots/slots_per_trial are all different but we're ok with constraints.max_slots controlling all of them? in which case.. maybe if there is a constraint or invariant for max slots it should always get set here, because otherwise an experiment could exceed max slots? once again i confused the check when patched and the check when launched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea! This is just for patching the resources.max_slots field specifically. Somewhere in the request validation, it could be useful to check resouces.slots or resouces.slots_per_trial in this case to make sure it doesn't exceed the requested resources.max_slots, but I think this goes back to the conversation of not wanting a constraint to alter configs? I think it could be worth checking in this API handler that resources.max_slots >= resources.slots_per_trial in the experiment config, but that kinda goes beyond intended fixes for this ticket
Happy to add it here either way though, since a check like that does make sense!

@amandavialva01 amandavialva01 changed the title chore: save experiment config later fix: setting max slots and checkpoint gc policy should comply with invariant configs Oct 28, 2024
@amandavialva01 amandavialva01 changed the title fix: setting max slots and checkpoint gc policy should comply with invariant configs fix: set max slots and checkpoint gc policy should comply with invariant configs Oct 28, 2024
@amandavialva01 amandavialva01 changed the title fix: set max slots and checkpoint gc policy should comply with invariant configs fix: set max slots and checkpoint gc policy should comply with config policies Oct 28, 2024
Copy link
Contributor Author

@amandavialva01 amandavialva01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanna note that technically, users can still override the name and description of an experiment with det e set name and det e set description, but I think that should be allowed, @kkunapuli wdyt?

@kkunapuli
Copy link
Contributor

I wanna note that technically, users can still override the name and description of an experiment with det e set name and det e set description, but I think that should be allowed, @kkunapuli wdyt?

Yeah, I agree.

@@ -1280,6 +1306,11 @@ func (a *apiServer) PatchExperiment(
}
}

// `patch` represents the allowed mutations that can be performed on an experiment, in JSON
Copy link
Contributor

@stoksc stoksc Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems removable to me? or maybe should go someone else ha.. def is a stray at this point though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this!

@amandavialva01 amandavialva01 force-pushed the amanda/TestConfPolicies branch 2 times, most recently from 7693b6d to c287a09 Compare October 29, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants