Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix migration race condition #1995

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Commits on Nov 20, 2024

  1. Fix migration race condition

    We occasionally find that the migration run stalls part way through; there are
    no job failures but also no active/in-progress jobs, which suggests there is a
    race condition around the logic for queueing the next set of workers. This
    seems to have become more prominent since we added the
    `run_once_post_migration` functionality.
    
    I believe the race condition occurs around how we check for the last worker and
    then updated the `completed_at` for the current worker; if the last two workers
    perform the check at the same time it could result in both of them believing
    they are not the last worker and they finish without queueing the follow up
    migration. Similarly, we could end up running the post-migration task multiple
    times in other timing scenarios.
    
    To fix this we can wrap the check/completed_at update in a lock. We can also
    use locking in the coordinator so that we can queue multiple runnable migrators
    simultaneously (we avoided this previously as we found it can end up queueing
    the same migrator multiple times due to a race conditon, but we have avoided
    that here with locking).
    ethax-ross committed Nov 20, 2024
    Configuration menu
    Copy the full SHA
    bc038d7 View commit details
    Browse the repository at this point in the history