Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Jira-3785
Context
We occasionally find that the migration run stalls part way through; there are no job failures but also no active/in-progress jobs, which suggests there is a race condition around the logic for queueing the next set of workers. This seems to have become more prominent since we added the
run_once_post_migration
functionality.Changes proposed in this pull request
I believe the race condition occurs around how we check for the last worker and then updated the
completed_at
for the current worker; if the last two workers perform the check at the same time it could result in both of them believing they are not the last worker and they finish without queueing the follow up migration. Similarly, we could end up running the post-migration task multiple times in other timing scenarios.To fix this we can wrap the check/completed_at update in a lock. We can also use locking in the coordinator so that we can queue multiple runnable migrators simultaneously and improve throughput (we avoided this previously as we found it can end up queueing the same migrator multiple times due to a race condition, but we have avoided that here with locking).
Guidance for review
I've ran this 4 times in the migration environment and it completed each time. There were sometimes a few failures due to the cached plan issue we sometimes see, but this is expected. There is 1 user failure that seems like bad data: