Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unhandled error: tokio-runtime-worker panicked - Err value: SSHActivateExit(Some(255)) #287

Open
jficz opened this issue Aug 13, 2024 · 1 comment

Comments

@jficz
Copy link

jficz commented Aug 13, 2024

Following a rather large update I got this unhandled error. Unfortunately I was unable to replicate the problem (so far).

Note that I'm doing a local deploy (i.e. laptop is the machine I'm deploying both from and to) but I don't know if (or how) it is possible to do local deploys, well, locally and not over ssh.

user@laptop% deploy -sk --confirm-timeout 1200  .#laptop.system
🚀 ℹ️ [deploy] [INFO] Evaluating flake in .
warning: Git tree '/home/user/git/my-nixos-deployment' is dirty
trace: evaluation warning: The ‘gnome.gnome-keyring’ was moved to top-level. Please use ‘pkgs.gnome-keyring’ directly.
trace: evaluation warning: nixfmt was renamed to nixfmt-classic. The nixfmt attribute may be used for the new RFC 166-style formatter in the future, which is currently available as nixfmt-rfc-style
🚀 ⚠️ [deploy] [WARN] Interactive sudo is enabled! Using a sudo password is less secure than correctly configured SSH keys.
Please use keys in production environments.
🚀 ℹ️ [deploy] [INFO] You will now be prompted for the sudo password for laptop.
(sudo for laptop) Password: 
🚀 ℹ️ [deploy] [INFO] The following profiles are going to be deployed:
[laptop.system]
user = "root"
ssh_user = "user"
path = "/nix/store/...-activatable-nixos-system-laptop-24.11.20240809.5e0ca22"
hostname = "laptop"
ssh_opts = []

🚀 ℹ️ [deploy] [INFO] Building profile `system` for node `laptop`
🚀 ℹ️ [deploy] [INFO] Copying profile `system` to node `laptop`
🚀 ℹ️ [deploy] [INFO] Activating profile `system` for node `laptop`
🚀 ℹ️ [deploy] [INFO] Creating activation waiter
⭐ ℹ️ [activate] [INFO] Activating profile
👀 ℹ️ [wait] [INFO] Waiting for confirmation event...
Copied "/nix/store/...-systemd-256.2/lib/systemd/boot/efi/systemd-bootx64.efi" to "/boot/EFI/systemd/systemd-bootx64.efi".
Copied "/nix/store/...-systemd-256.2/lib/systemd/boot/efi/systemd-bootx64.efi" to "/boot/EFI/BOOT/BOOTX64.EFI".
updating systemd-boot from 255.6 to 256.2
stopping the following units: NetworkManager.service, audit.service, avahi-daemon.service, avahi-daemon.socket, bluetooth.service, cups-browsed.service, cups.service, cups.socket, ensure-printers.service, fwupd.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-local-commands.service, network-setup.service, node-red.service, nscd.service, opensnitchd.service, prometheus-node-exporter.service, resolvconf.service, rtkit-daemon.service, systemd-modules-load.service, systemd-oomd.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service, systemd-vconsole-setup.service, systemd-zram-setup@zram0.service, tlp.service, trackpoint.service, udisks2.service, upower.service, wireguard-wg0-peer-server1.service, wireguard-wg0-peer-server2.service, wireguard-wg0.service, wpa_supplicant.service, zfs-mount.service, zfs-share.service, zfs-zed.service
Job for systemd-zram-setup@zram0.service canceled.
NOT restarting the following changed units: greetd.service, systemd-backlight@backlight:amdgpu_bl1.service, systemd-backlight@leds:tpacpi::kbd_backlight.service, systemd-fsck@dev-disk-by\x2duuid-5C85\x2d53D4.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-update-utmp.service, systemd-user-sessions.service, user-runtime-dir@1000.service, user@1000.service
activating the configuration...
[agenix] creating new generation in /run/agenix.d/2
[agenix] decrypting secrets...
decrypting '/nix/store/...-wg-privkey.age' to '/run/agenix.d/2/wg-privkey-laptop'...
[agenix] symlinking new secrets to /run/agenix (generation 2)...
[agenix] removing old secrets (generation 1)...
[agenix] chowning...
setting up /etc...
restarting systemd...
reloading user units for user...
restarting sysinit-reactivation.target
reloading the following units: dbus.service, firewall.service, reload-systemd-vconsole-setup.service
restarting the following units: nix-daemon.service, polkit.service, sshd.service, systemd-journald.service
starting the following units: NetworkManager.service, audit.service, avahi-daemon.socket, bluetooth.service, cnups-browsed.service, cups.socket, ensure-printers.service, fwupd.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-local-commands.service, network-setup.service, node-red.service, nscd.service, opensnitchd.service, prometheus-node-exporter.service, resolvconf.service, rtkit-daemon.service, systemd-modules-load.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-vconsole-setup.service, systemd-zram-setup@zram0.service, tlp.service, trackpoint.service, udisks2.service, upower.service, wireguard-wg0-peer-server1.service, wireguard-wg0-peer-server2.service, wireguard-wg0.service, wpa_supplicant.service, zfs-mount.service, zfs-share.service, zfs-zed.service
🚀 ❌ [deploy] [ERROR] Waiting over SSH resulted in a bad exit code: Some(255)
🚀 ℹ️ [deploy] [INFO] Revoking previous deploys
thread '🚀 ❌ [deploy] [ERROR] Deployment failed, rolled back to previous generation
tokio-runtime-worker' panicked at /build/source/src/deploy.rs:488:41:
called `Result::unwrap()` on an `Err` value: SSHActivateExit(Some(255))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I see similar (time-out-related) issues from time to time when doing updates after longer periods of time but I never got an unhandled error directly from Rust.

I would assume that 20m timeout would be enough but it looks more like that if certain things are updated (probably related to network or ssh), no matter the timeout, it will always fail. Note that in this case as far as I can tell SSH was connecting through [::1]:22.

@freelock
Copy link

Hi,

I'm getting something similar trying to deploy to an AWS host.

I do see "Activation succeeded!" but then it times out on the "Waiting for confirmation event..." with a 90s confirm-delay.

Running with RUST_BACKTRACE=1, I'm getting this backtrace:

⭐ ❌ [activate] [ERROR] Failed to get activation confirmation: Error waiting for confirmation event: Timeout elapsed for confirmation
thread 'tokio-runtime-worker' panicked at /build/source/src/deploy.rs:488:41:
called `Result::unwrap()` on an `Err` value: SSHActivateExit(Some(1))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: tokio::runtime::task::harness::poll_future
   4: tokio::runtime::task::raw::poll
   5: tokio::runtime::task::Notified<S>::run
   6: tokio::runtime::thread_pool::worker::Context::run_task
   7: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'main' panicked at /build/source/src/deploy.rs:523:30:
called `Result::unwrap()` on an `Err` value: RecvError(())
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: deploy::cli::run_deploy::{{closure}}
   4: deploy::cli::run::{{closure}}
   5: deploy::main::{{closure}}
   6: deploy::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants