-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't replace or remove a node #2133
Comments
Please note that the Scylla Operator 1.14.0, ScyllaDB 6.0.4 and ScyllaDB 6.1.2 were all released after we started working on the original task here (upgrade to 6.0.3 and then 6.1.1). |
The main problem we have with this cluster state is that the backups are failing:
We would really appreciate any hints on how to move forward! |
what happens now if you add Kubernetes Node able to host Pod |
Just did that. The logs from the first ~10 minutes since starting the new node: extract-2024-10-10T10_43_13.937Z.csv.zip The nodetool status now shows:
Isn't
Generating it now... |
From the logs it seems to be joining just fine |
True! I hope it will complete correctly. 🤞 I will report the results here, it will probably take many hours/a day or two. Thanks @zimnx! |
Unfortunately, something went wrong again, @zimnx. :( One of the errors I see is:
The nodetool status in fact doesn't show this id anymore:
Here's the must-gather generated a few minutes ago: scylla-operator-must-gather-q9lcjrmkkgrh.zip I will also attach some logs in a few minutes, they are being exported now. Please let me know if you need anything else! |
Logs from the first ~6h of the bootstrap (Oct 10th, 10:30-16:15 UTC): extract-2024-10-11T08_49_50.794Z.csv.zip |
Looks like node that was replacing crashed with core dump:
|
Could you check if coredump was saved? At this point node will crash in loop because I would suggest to retry replacing To do so, remove |
Alternatively, you can try removing both |
Decoded backtrace:
|
It core dump apparently was not saved, I couldn't find it at that location ( |
I think I already tried both of these ways (see the issue description), but will try again. Probably will start on Monday morning though. |
In the meantime, please report an issue in Scylla repo, it shouldn't crash during replacement. Attach ~3h logs before the crash (2024-10-11T07:39:04) and backtrace. |
Trying that now... |
It didn't continue to stream the data, @zimnx. :( The disk usage stats suggest that the disk has been cleaned and it's bootstrapping the data from scratch. The nodetool status output:
The logs from pod I can see quite a lot of exceptions there:
|
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
/lifecycle stale |
Context
We had a 7-node Scylla cluster in GCP, n2d-highmem-32 with 6TB local SSDs, running Scylla 6.0.3, Scylla Operator 1.13.0, Scylla Manager 3.3.3.
(It is the same cluster that was the protagonist of #2068)
Before the current issue has started, we did:
...and wanted to upgrade to the latest 6.1.1 (as we were trying to fix scylladb/scylladb#19793).
What happened
Pod 3 loses its data on 16th of Sep
On Sep 16th, 19:14 UTC the pod
3
(iddea17e3f-198a-4ab8-b246-ff29e103941a
) has lost its data.The local provisioned on that node has logged:
The nodetool status output at that time looked like nothing was wrong:
The logs from 2024-09-16 19:10-19:34 UTC: extract-2024-09-30T14_19_57.781Z.csv.zip
But as the disk usage on the nodes was constantly growing, we assumed that the node will automatically get recreated so we left it like that for ~2 days. Then we noticed that it's failing to start with:
Pod 3 replacement fails on 18th of Sep
...and that the nodetool status is showing this:
The old node id from the error message was nowhere to be found:
Pod 3 another replacement try fails on 19th of Sep
We tried deleting the old node that had the local SSD issue, create a new one in its place and letting the cluster do the node replacement again, but it failed with a similar error as above:
Our cluster looked like this then:
Node removal fails on 24th of Sep
At this point we decided to try to remove the down node, with id
3ec289d5-5910-4759-93bc-6e26ab5cda9f
, from the cluster to continue our original task of upgrade Scylla to 6.1.1, planning to go back to replacing the missing node after we do that.However, the noderemove operation also failed.
We can't find meaningful errors from before this message, so I attach ~100k lines of logs from 2024-09-24 14:15-15:22 UTC from that day here: extract-2024-09-30T14_10_55.525Z.csv.zip
Node removal fails after a retry on 27th of Sep
Retrying noderemove didn't work:
We tried to do a rolling restart of the cluster and retry, similarly to what we did in #2068, but that did not help this time. The error message was as before, just with a different timestamp:
Additional info
During this time we had surprising moments when our Scylla disks were being filled with snapshots, getting dangerously close to 80% of disk use, example:
We cleared the snapshots when that happened using
commandnodetool clearsnapshot
.must-gather output
scylla-operator-must-gather-w7rn9tspr85z.zip
The text was updated successfully, but these errors were encountered: