You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Occasionally Einstein@Home encounters the following situation:
quorum of a workunit is completed, the workunit is validated, a canonical result is found and the validator sets assimilate_state =1
while the assimilator is working on the workunit (and a chunk of others), another result is reported by the scheduler, need_validate is set and the validator reads the workunit record
the assimilator finishes assimilation and sets assimilate_state=2. file deletion is triggered
the validator finishes validation and sets assimilate_state=1, overriding the value the assimilator just set
the assimilator tries to assimilate the same workunit again.
Depending on whether the file deleter has already finished, the assimilator either tries to assimilate the same canonical result files a second time or it doesn't find these at all. Both conditions cause our assimilator to halt (for good reason), and require manual intervention to analyze and resolve the situation.
Also, when setting assimilate_state, the validator doesn't care about results of the workunit that are still "in progress". Triggering assimilation (and subsequently file deletion) of a workunit immediately after finding a canonical result means that these "late" results can not be (successfully) validated, assimilated and credited even if these are valid and arrive on time, i.e. before their deadline.
I was about to modify the validator to not set assimilate_state directly, but just let the transitioner know to take a look at that workunit (immediately). It does handle such late results correctly and, and the transitioner doesn't have such a delay between reading the workuits and writing modifications as the validator has (at least occasionally).
But then I came across that comment in the validator:
// if we found a canonical result,
// trigger the assimilator, but do NOT trigger
// the transitioner - doing so creates a race condition
What race condition is this comment referring to, and how would we resolve the problems above then?
The text was updated successfully, but these errors were encountered:
bema-aei
changed the title
validator - assimilator race condition - should validator set assimilate_state?
Server: validator - assimilator race condition - should validator set assimilate_state?
Jul 24, 2024
Note, you want to use permalinks when referring to code, so the link stays valid over time (i.e. in context with the issue). See the ellipsis in the upper-right corner of the file view.
The interesting thing here is that the commit actually doesn't change the wu.assimilate_state = ASSIMILATE_READY;line, it is just left untouched. This means the code before that commit did both, set assimilate_state to trigger assimilation and trigger immediate transition. This will surely create a race condition.
Bruce was correct to avoid this, but I think he fixed it at the wrong end. IMHO immediate transition should be triggered here, and not assimilation directly.
Occasionally Einstein@Home encounters the following situation:
Depending on whether the file deleter has already finished, the assimilator either tries to assimilate the same canonical result files a second time or it doesn't find these at all. Both conditions cause our assimilator to halt (for good reason), and require manual intervention to analyze and resolve the situation.
Also, when setting assimilate_state, the validator doesn't care about results of the workunit that are still "in progress". Triggering assimilation (and subsequently file deletion) of a workunit immediately after finding a canonical result means that these "late" results can not be (successfully) validated, assimilated and credited even if these are valid and arrive on time, i.e. before their deadline.
I was about to modify the validator to not set assimilate_state directly, but just let the transitioner know to take a look at that workunit (immediately). It does handle such late results correctly and, and the transitioner doesn't have such a delay between reading the workuits and writing modifications as the validator has (at least occasionally).
But then I came across that comment in the validator:
https://github.com/BOINC/boinc/blob/master/sched/validator.cpp#L677-L679
What race condition is this comment referring to, and how would we resolve the problems above then?
The text was updated successfully, but these errors were encountered: