Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix migration race condition #1995

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Conversation

ethax-ross
Copy link
Collaborator

@ethax-ross ethax-ross commented Nov 20, 2024

Jira-3785

Context

We occasionally find that the migration run stalls part way through; there are no job failures but also no active/in-progress jobs, which suggests there is a race condition around the logic for queueing the next set of workers. This seems to have become more prominent since we added the run_once_post_migration functionality.

Changes proposed in this pull request

  • Fix migration race condition

I believe the race condition occurs around how we check for the last worker and then updated the completed_at for the current worker; if the last two workers perform the check at the same time it could result in both of them believing they are not the last worker and they finish without queueing the follow up migration. Similarly, we could end up running the post-migration task multiple times in other timing scenarios.

To fix this we can wrap the check/completed_at update in a lock. We can also use locking in the coordinator so that we can queue multiple runnable migrators simultaneously and improve throughput (we avoided this previously as we found it can end up queueing the same migrator multiple times due to a race condition, but we have avoided that here with locking).

Guidance for review

I've ran this 4 times in the migration environment and it completed each time. There were sometimes a few failures due to the cached plan issue we sometimes see, but this is expected. There is 1 user failure that seems like bad data:

? 'Validation failed: User not found with ecf_id or gai_id, user found with ecf_user_email.
  this user has a linked ecf_user which is not an orphan'
: - 0f1287d9-6911-449c-93e8-8ac717b89e7a
Screenshot 2024-11-20 at 13 16 42

Copy link

We occasionally find that the migration run stalls part way through; there are
no job failures but also no active/in-progress jobs, which suggests there is a
race condition around the logic for queueing the next set of workers. This
seems to have become more prominent since we added the
`run_once_post_migration` functionality.

I believe the race condition occurs around how we check for the last worker and
then updated the `completed_at` for the current worker; if the last two workers
perform the check at the same time it could result in both of them believing
they are not the last worker and they finish without queueing the follow up
migration. Similarly, we could end up running the post-migration task multiple
times in other timing scenarios.

To fix this we can wrap the check/completed_at update in a lock. We can also
use locking in the coordinator so that we can queue multiple runnable migrators
simultaneously (we avoided this previously as we found it can end up queueing
the same migrator multiple times due to a race conditon, but we have avoided
that here with locking).
Copy link

sonarcloud bot commented Nov 20, 2024

@ethax-ross ethax-ross marked this pull request as ready for review November 20, 2024 13:17
@ethax-ross ethax-ross requested a review from a team as a code owner November 20, 2024 13:17
@ethax-ross ethax-ross requested a review from a team November 20, 2024 13:17
@ethax-ross ethax-ross added this pull request to the merge queue Nov 21, 2024
Merged via the queue into main with commit 28cb19e Nov 21, 2024
18 checks passed
@ethax-ross ethax-ross deleted the 3785-fix-migration-race-condition branch November 21, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants