-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: separate nrf due to nucleus exit #1658
Conversation
8ccc4ce
to
c540795
Compare
Unit Tests Coverage Report
Minimum allowed coverage is Generated by 🐒 cobertura-action against 86e78c4 |
Integration Tests Coverage Report
Minimum allowed coverage is Generated by 🐒 cobertura-action against 86e78c4 |
@@ -100,4 +102,7 @@ if [ $sigterm_received -eq 0 ] && is_directory_link "${GG_ROOT}/alts/old" && is_ | |||
flip_link "${GG_ROOT}/alts/old" "${LAUNCH_DIR}" | |||
fi | |||
|
|||
## Touch an empty file to indicate rollback due to unexpected Nucleus exit | |||
touch "${GG_ROOT}/work/aws.greengrass.Nucleus/restart_panic" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to just rely on the alts/broken dir link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relying on the alts/broken dir links is not sufficient as any form of rollback will create these symlinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, i could create a new symlink pointing to the absolute broken dir if letting loader touch a file is an issue
c540795
to
86e78c4
Compare
Issue #, if available:
Description of changes:
Nucleus returns 1 when it exits unexpectedly. 3 consecutive such exits during a deployment causes the loader to orchestrate a forced rollback. Nucleus, after rolling back, fails the ongoing deployment with a NUCLEUS_RESTART_FAILURE. A caveat here is if Nucleus is unable to persist stage details in deployment metadata during a normal rollback, Nucleus still fails a deployment with NUCLEUS_RESTART_FAILURE after rolling back, which is inaccurate.
In this change, the loader script touches a panic file before orchestrating a forced rollback. Nucleus looks for this panic file after starting up and fails a deployment with NUCLEUS_RESTART_FAILURE only if it finds this file. Any other reason for restart failure should come under device IO error.
A possible risk with this change is Nucleus unable to delete the panic file across deployments and a future deployment being classified as NRF incorrectly. However, this can only happen if the stage details for this future deployment were not set as expected due to a correct NRF issue or device IO issue. In either case, this is a low risk.
Why is this change necessary:
It has become necessary to separate out nucleus update workflow failures due to unexpected exits and device IO issues. Device IO issues can happen anytime and are independent of the Nucleus process and should not be considered as system errors.
How was this change tested:
Any additional information or context required to review the change:
Documentation Checklist:
Compatibility Checklist:
any deprecated method or type.
Refer to Compatibility Guidelines for more information.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.