What to do with snapshot failures? #25676

pauldix · 2024-12-18T00:00:06Z

If you follow the path of the snapshot code in QueryableBuffer, specifically buffer_contents_and_persist_snapshotted_data you can see there are a few places where it might panic. The database and table missing scenarios should never happen, so I'm not worried about those.

But there are some potential errors that could be encountered specifically with the persist jobs. We should verify that we continually retry writing to object storage (we should never fail out, just keep retrying forever and logging the errors). But there could be problems with creating the plan or executing the DataFusion code. #25673 just fixed one of these errors.

The question is what we should do when snapshotting fails. If we're unable to snapshot and persist, the WAL won't be able to get cleaned up, which will result in many thousands of WAL files (which could be very tricky to recover from). Also, the buffer will continuously fill with data, eventually eating all the RAM on the machine and getting OOM killed.

So what should we do in these failure scenarios? Start failing writes until the operator intervenes? Fail writes just for the given table and let the other tables continue to advance (in this case we wouldn't be able to clean the log files for the writes we've already accepted that have writes to that table).

Either way, we should update those code paths to at least catch the errors and log them.

I think there is some number of backlogged WAL files at which point we'd want to start rejecting writes altogether.

This was observed in this conversation: https://influxdata.slack.com/archives/C084G9LR2HL/p1734445412251109

Tasks

Give feedback

The text was updated successfully, but these errors were encountered:

praveen-influx · 2024-12-18T10:42:11Z

If we could do maybe extra assertion when accepting writes so that we don't run into issues in snapshotting (that could invalidate the writes), that might be desirable. You can still run into errors when snapshotting, but they won't invalidate the writes that's already been accepted. By assertion, I'm just thinking after we validated the line and just before we write to buffer we do extra assertion (whether nulls appended etc for missing columns, going by PR for the fix). If this assertion fails it'll be erroring out before writing to WAL file.

Fail writes just for the given table and let the other tables continue to advance (in this case we wouldn't be able to clean the log files for the writes we've already accepted that have writes to that table).

This sounds good but I'm not sure about "we wouldn't be able to clean the log files". Is that the WAL files you're referring to? If so, would that be the same problem if we globally stop accepting writes (I'm thinking there's always going to be WAL files that aren't snapshotted yet)?

Perhaps as a first step we can stop writes altogether and then later introduce the ability to stop it at the table level? At this stage it's maybe better for things to fail sooner in the cycle.

hiltontj · 2024-12-18T14:51:53Z

Added a task list and opened #25677

pauldix · 2024-12-18T15:36:07Z

If we could do maybe extra assertion when accepting writes so that we don't run into issues in snapshotting

We already to this. The data in the buffer is always valid. The problem is that there are errors that might happen in DataFusion that have nothing to do with the structure of the data specifically.

We could hit resource exhaustion, object store errors (we should handle this with infinite retry), and who knows what else. It's more of these unforeseen or very rare errors that I'm thinking of.

Is that the WAL files you're referring to?

Yes. Imagine that we accept a bunch of data into the WAL. Then when we go to snapshot it, we hit some problem in DataFusion where it blows up the memory or for some reason it can't actually sort, dedupe and persist the data. We can stop writes to everything, but then we still have to deal with the WAL files that are now sitting around holding data that is going to cause snapshotting to blow up.

Maybe the answer is that we simply have snapshotting do infinite retries and then on the ingest path once we get past X number of WAL files backlogged, we stop accepting writes. Then we have some CLI tooling that will be able to move those WAL files off to a backup location and then inspect and dump them later. That way the operator can get back up quickly if it just happened to be some intermediate bad thing.

praveen-influx · 2024-12-19T15:21:47Z

The problem is that there are errors that might happen in DataFusion that have nothing to do with the structure of the data specifically.

Ah ok - skimming through your fix I thought it was missing the nulls for columns (arrow builder didn't have null). If data in buffer/WAL files are good could we not just replay it once the underlying issue is fixed? Or is it because the sequence number can get messed up and we need a separate tool to reingest the data to backfill it?

I agree with your idea though, retry snapshot process in the background continuously and if there are more than X number of WAL files we stop the writes.

pauldix · 2024-12-19T15:50:41Z

Closing this in favor of #25684 to address retry and #25683 to address broader snapshotting issues.

pauldix · 2024-12-19T16:03:55Z

I've also logged #25686 to stop writes when we hit a limit on WAL files backed up.

pauldix added the v3 label Dec 18, 2024

hiltontj mentioned this issue Dec 18, 2024

feat: log errors on panic in sort_dedupe_persist #25678

Merged

pauldix mentioned this issue Dec 19, 2024

Have snapshots continuously retry in the background #25684

Open

pauldix closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do with snapshot failures? #25676

What to do with snapshot failures? #25676

pauldix commented Dec 18, 2024 •

edited by hiltontj

Loading

Tasks

praveen-influx commented Dec 18, 2024

hiltontj commented Dec 18, 2024

pauldix commented Dec 18, 2024 •

edited

Loading

praveen-influx commented Dec 19, 2024

pauldix commented Dec 19, 2024

pauldix commented Dec 19, 2024

What to do with snapshot failures? #25676

What to do with snapshot failures? #25676

Comments

pauldix commented Dec 18, 2024 • edited by hiltontj Loading

Tasks

praveen-influx commented Dec 18, 2024

hiltontj commented Dec 18, 2024

pauldix commented Dec 18, 2024 • edited Loading

praveen-influx commented Dec 19, 2024

pauldix commented Dec 19, 2024

pauldix commented Dec 19, 2024

pauldix commented Dec 18, 2024 •

edited by hiltontj

Loading

pauldix commented Dec 18, 2024 •

edited

Loading