Fix migration race condition #1995

ethax-ross · 2024-11-20T10:04:42Z

Context

We occasionally find that the migration run stalls part way through; there are no job failures but also no active/in-progress jobs, which suggests there is a race condition around the logic for queueing the next set of workers. This seems to have become more prominent since we added the run_once_post_migration functionality.

Changes proposed in this pull request

Fix migration race condition

I believe the race condition occurs around how we check for the last worker and then updated the completed_at for the current worker; if the last two workers perform the check at the same time it could result in both of them believing they are not the last worker and they finish without queueing the follow up migration. Similarly, we could end up running the post-migration task multiple times in other timing scenarios.

To fix this we can wrap the check/completed_at update in a lock. We can also use locking in the coordinator so that we can queue multiple runnable migrators simultaneously and improve throughput (we avoided this previously as we found it can end up queueing the same migrator multiple times due to a race condition, but we have avoided that here with locking).

Guidance for review

I've ran this 4 times in the migration environment and it completed each time. There were sometimes a few failures due to the cached plan issue we sometimes see, but this is expected. There is 1 user failure that seems like bad data:

? 'Validation failed: User not found with ecf_id or gai_id, user found with ecf_user_email.
  this user has a linked ecf_user which is not an orphan'
: - 0f1287d9-6911-449c-93e8-8ac717b89e7a

github-actions · 2024-11-20T10:11:54Z

Review app deployed to https://npq-registration-review-1995-web.test.teacherservices.cloud/

We occasionally find that the migration run stalls part way through; there are no job failures but also no active/in-progress jobs, which suggests there is a race condition around the logic for queueing the next set of workers. This seems to have become more prominent since we added the `run_once_post_migration` functionality. I believe the race condition occurs around how we check for the last worker and then updated the `completed_at` for the current worker; if the last two workers perform the check at the same time it could result in both of them believing they are not the last worker and they finish without queueing the follow up migration. Similarly, we could end up running the post-migration task multiple times in other timing scenarios. To fix this we can wrap the check/completed_at update in a lock. We can also use locking in the coordinator so that we can queue multiple runnable migrators simultaneously (we avoided this previously as we found it can end up queueing the same migrator multiple times due to a race conditon, but we have avoided that here with locking).

sonarcloud · 2024-11-20T12:49:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ethax-ross temporarily deployed to review November 20, 2024 10:08 — with GitHub Actions Inactive

ethax-ross force-pushed the 3785-fix-migration-race-condition branch from cf39633 to bc038d7 Compare November 20, 2024 12:40

ethax-ross temporarily deployed to review November 20, 2024 12:44 — with GitHub Actions Inactive

ethax-ross temporarily deployed to staging November 20, 2024 12:47 — with GitHub Actions Inactive

ethax-ross marked this pull request as ready for review November 20, 2024 13:17

ethax-ross requested a review from a team as a code owner November 20, 2024 13:17

ethax-ross requested a review from a team November 20, 2024 13:17

cwrw approved these changes Nov 21, 2024

View reviewed changes

ethax-ross added this pull request to the merge queue Nov 21, 2024

Merged via the queue into main with commit 28cb19e Nov 21, 2024
18 checks passed

ethax-ross deleted the 3785-fix-migration-race-condition branch November 21, 2024 16:35

ethax-ross temporarily deployed to review November 21, 2024 16:36 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix migration race condition #1995

Fix migration race condition #1995

ethax-ross commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024

sonarcloud bot commented Nov 20, 2024

Fix migration race condition #1995

Fix migration race condition #1995

Conversation

ethax-ross commented Nov 20, 2024 • edited Loading

Context

Changes proposed in this pull request

Guidance for review

github-actions bot commented Nov 20, 2024

sonarcloud bot commented Nov 20, 2024

Quality Gate passed

ethax-ross commented Nov 20, 2024 •

edited

Loading