-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to do with snapshot failures? #25676
Comments
If we could do maybe extra assertion when accepting writes so that we don't run into issues in snapshotting (that could invalidate the writes), that might be desirable. You can still run into errors when snapshotting, but they won't invalidate the writes that's already been accepted. By assertion, I'm just thinking after we validated the line and just before we write to buffer we do extra assertion (whether nulls appended etc for missing columns, going by PR for the fix). If this assertion fails it'll be erroring out before writing to WAL file.
This sounds good but I'm not sure about "we wouldn't be able to clean the log files". Is that the WAL files you're referring to? If so, would that be the same problem if we globally stop accepting writes (I'm thinking there's always going to be WAL files that aren't snapshotted yet)? Perhaps as a first step we can stop writes altogether and then later introduce the ability to stop it at the table level? At this stage it's maybe better for things to fail sooner in the cycle. |
Added a task list and opened #25677 |
We already to this. The data in the buffer is always valid. The problem is that there are errors that might happen in DataFusion that have nothing to do with the structure of the data specifically. We could hit resource exhaustion, object store errors (we should handle this with infinite retry), and who knows what else. It's more of these unforeseen or very rare errors that I'm thinking of.
Yes. Imagine that we accept a bunch of data into the WAL. Then when we go to snapshot it, we hit some problem in DataFusion where it blows up the memory or for some reason it can't actually sort, dedupe and persist the data. We can stop writes to everything, but then we still have to deal with the WAL files that are now sitting around holding data that is going to cause snapshotting to blow up. Maybe the answer is that we simply have snapshotting do infinite retries and then on the ingest path once we get past X number of WAL files backlogged, we stop accepting writes. Then we have some CLI tooling that will be able to move those WAL files off to a backup location and then inspect and dump them later. That way the operator can get back up quickly if it just happened to be some intermediate bad thing. |
Ah ok - skimming through your fix I thought it was missing the nulls for columns (arrow builder didn't have null). If data in buffer/WAL files are good could we not just replay it once the underlying issue is fixed? Or is it because the sequence number can get messed up and we need a separate tool to reingest the data to backfill it? I agree with your idea though, retry snapshot process in the background continuously and if there are more than X number of WAL files we stop the writes. |
I've also logged #25686 to stop writes when we hit a limit on WAL files backed up. |
If you follow the path of the snapshot code in
QueryableBuffer
, specificallybuffer_contents_and_persist_snapshotted_data
you can see there are a few places where it might panic. The database and table missing scenarios should never happen, so I'm not worried about those.But there are some potential errors that could be encountered specifically with the persist jobs. We should verify that we continually retry writing to object storage (we should never fail out, just keep retrying forever and logging the errors). But there could be problems with creating the plan or executing the DataFusion code. #25673 just fixed one of these errors.
The question is what we should do when snapshotting fails. If we're unable to snapshot and persist, the WAL won't be able to get cleaned up, which will result in many thousands of WAL files (which could be very tricky to recover from). Also, the buffer will continuously fill with data, eventually eating all the RAM on the machine and getting OOM killed.
So what should we do in these failure scenarios? Start failing writes until the operator intervenes? Fail writes just for the given table and let the other tables continue to advance (in this case we wouldn't be able to clean the log files for the writes we've already accepted that have writes to that table).
Either way, we should update those code paths to at least catch the errors and log them.
I think there is some number of backlogged WAL files at which point we'd want to start rejecting writes altogether.
This was observed in this conversation: https://influxdata.slack.com/archives/C084G9LR2HL/p1734445412251109
Tasks
The text was updated successfully, but these errors were encountered: