-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLite queue database corruption #324
Comments
Yes, this is excactly what I was seeing as well. Thanks for opening the issue and tagging me @andycroll as I wanted to open an issue for this too 🙌🏼 This happened to me in this open source repo: http://github.com/adrienpoly/rubyvideo You can easily reproduce it by running I think it only happens with the Puma Plugin from what I could tell. |
In production, I'm running as a separate worker, not the puma plugin. |
Ohhh! See also #294, where @wdiechmann provided some advice in the last comment, which I'm not sure if it applies to your case. The Puma plugin detail is quite interesting! Did you manage to reproduce only with that? |
Ah, sorry, just reread and I see @andycroll's error happened when running just the supervisor separately 🤔 |
In my case a deploy (on a Hatchbox managed server) so killing the worker process was certainly happening when the error started. |
It's also very surprising to me that you can corrupt a database with only Active Record and quite regular queries 🤔
I wouldn't know how to cause this on purpose 😅 A row missing from an index? 🤔 I'll use @marcoroth's instructions to reproduce and see if I can find out what's going on! |
I haven't been able to dig in myself yet, but after reading this: https://www.sqlite.org/howtocorrupt.html and knowing the bits that I know about the supervisor architecture, I was first drawn to:
|
If it helps, here's how my
|
Ohhhh, thanks a lot @fractaledmind. That'd make perfect sense as the culprit! I thought Rails would handle this properly and wouldn't reuse the same connection but I need to dig more. |
Yeah, this sounds like it might be what is occurring:
|
The reason I thought that were these PRs that I had looked at in the past, because of another error (not as serious as this one) that someone reported with PostgreSQL 🤔
There must be something else going on here. |
No idea yet if this is related, but rails/rails#31173 has this in the description:
And no |
Thanks @fractaledmind! I see this code has changed quite a lot since these PRs, (for example rails/rails#37296), so I was trying to understand how everything worked... That comment caught my attention too but I honestly wasn't sure why "there is no server to confuse" makes SQLite immune to this problem 😳 @marcoroth, I've been trying to reproduce the problem following your steps but I can't. I have to say I have upgraded Solid Queue in rubyvideo to version 0.8.2 as it was using a quite old version so maybe there's something else at play there. |
Tried without upgrading and same thing... I can't manage to reproduce it by starting the app with |
I've also tried to artificially delay the jobs that are being run (in particular, |
Oh, @andycroll, I just realised you're running version 0.3.4 in your logs above. The issue I referred to previously with PostgreSQL (#264) got fixed in version 0.4.0. However, in the |
@marcoroth: I created this repo to hopefully help us find a reproduction: https://github.com/fractaledmind/solid_queue_bughunt You will see that I have a You can see if the job eventually resolves by just waiting 5 seconds or so after you kill the Rails process and waiting for logs to print. All you have to do is You will see I have commented out automatically killing the Rails server process. I was trying that as well. Still, the job resolves. My hunch is that it is/was somehow possible to force Solid Queue to not stop gracefully and if a job was running and connected to the database, that could corrupt. But I am not sure. Maybe if we all work on this reproduction repo, we can hunt it down. |
Thanks a lot @fractaledmind! 🙏 I've been playing with it a bit, also tried downgrading Solid Queue there to 0.3.4 just in case. I've tried deleting the
but I think those were expected as I tried to start the server again when the workers themselves hadn't realised their parent had been killed and were still around. |
Also, here is a one-liner to check the queue database integrity: |
Nothing so far. Tried also configuring recurring long running jobs (that don't send any signals) every 10 seconds to add more stuff to the mix (and because that other report of a corrupted file was using recurring jobs heavily), but still haven't managed to reproduce this 😖 |
I'm trying to hang to any possible thread here as I'm quite stumped about why I can't manage to reproduce this 😅
@andycroll, do you know how the deploy is happening there? Does it happen by sending a specific signal to the Solid Queue process so it stops? With any grace period to terminate orderly? Or just terminate and then "pull the plug" right away...? |
Is there any chance that this is dependent on the sqlite3 gem variant you get? I'm running the |
I have had this happen twice on my local dev machine this week while running in the puma plugin and im on arm64. Unfortunately have no clue how it happened. I thought it might have something to do with switching between the puma plugin and running bin/jobs start and killing the process mid job but i cant recreate again. /shrug |
Im also using a separate db for queue stuff if that matters at all. |
@npezza93: Try pulling down my repro repo and see if you can set something up that causes corruption consistently: https://github.com/fractaledmind/solid_queue_bughunt Until we find a way to reproduce, we will never solve this. |
I upgraded SolidQueue to But I couldn't reproduce it on Prior to upgrading to |
I think im on to something. Just had it happen to me. Cant quite reproduce reliably yet but i think it may have something to do with me keeping the server running with the puma plugin running and closing my laptop for some period of time. Cant confirm yet but just a hunch. Will report back if i mind reliable instructions. Heres the log. It's weird that there are rollbacks happening trying to update the Process model and then it corrupts.
|
That reminds me that was my hunch when I saw this a while back. I upgrade aggressively and I've been conference prepping, so I haven't seen this personally in a while. But I recall wondering if it was related to closing the laptop. |
@npezza93 thanks a lot for that 🙏 It's telling that the error happens when updating the heartbeat, because the thread that runs the heartbeat task is started on the supervisor before forking 🤔 |
@rosa interesting! |
This is epic; thanks a lot @flavorjones 🙇♀️ 🏆 I think the conditions you describe could very well explain the very first issue from @andycroll. From your two conditions:
The first is always true for Solid Queue. The second shouldn't be, generally, as the supervisor waits for its supervised processes before exiting, but depending on how the deployment worked (e.g. grace period and then However, it wouldn't explain the other issues in macOS (like @marcoroth's example, no the examples where the computer goes to sleep; those are separate for sure), as I haven't been able to get a corrupted DB with your script after over one hour 😅 @marcoroth mentioned:
Maybe it has to do with that... I might try to reproduce with your script and |
This comment was marked as outdated.
This comment was marked as outdated.
Repository with a full Rails app reproduction here: https://github.com/flavorjones/2024-09-13-sqlite-corruption |
Ok, I finally could reproduce this with @flavorjones's test on macOS. Yesterday when I was testing that I didn't realise I was running this version of SQLite3
Today, I've got the corruption error with both:
I'm doing other tests now with different versions, there's also a difference if I reduce the number of children to 3 like yesterday. Update: yes, the number of children makes a difference. With |
Git bisecting, this is the PR/commit where this behavior changed: sparklemotion/sqlite3-ruby#392 / which has to do with the lifecycle of database connections and prepared statements. Getting closer. |
OK, I understand what's happening. Problems result when an inherited (from the parent process) database connection is In a Rails app it seems much more likely to be a statement being finalized because of the cached statements, but it can happen with the database objects as well. I think we should be able to work around this now that I understand the mechanism. |
OMG! You two are such absolute legends! @flavorjones you are a complete baller. @rosa you are a total magician. I am so grateful to get to work in the general vicinity of you both. I will buy you both some kind of drink in Toronto to show my thanks. |
OK, I hacked on the sqlite3 adapter so that it closes database connections before forking, and that seems to prevent the data corruption and disk i/o errors. I think we just need to polish the patch and land it in Rails and everything will be OK! Stay tuned. |
Aww @fractaledmind you're too kind! This was all @flavorjones, he's the magician 🪄. Plus it's me who's beyond grateful to you and everyone here for the help, it's a privilege to have people like you in the community 🙇🏻♀️ I'm definitely the one getting you drinks at Rails World! 😆 and Marco, Nick, Andy... 🥹🙏🏻 |
Epic team work, all 🤘. Sqlite is going to be a much bigger role in Rails apps going forward. |
Sqlite itself is not fork-safe, as documented in https://www.sqlite.org/howtocorrupt.html section 2.6. Any databases that are inherited from the parent process are unsafe to properly close and may impact either the parent or the child and potentially lead to database corruption. In this commit, a pre-fork callback `prepare_to_fork!` is introduced for the connection pools and adapters to take any necessary actions before forking. The callback is only implemented by the SQLite3Adapter which now invokes `disconnect!` to close all open prepared statements and database connections. As a side effect, the parent process will end up having to re-open database connections if it continues to do work, which may be a small performance overhead for some use cases, but is necessary in order to prevent database corruption. See rails/solid_queue#324 for examples of the type of corruption that can occur.
Sqlite itself is not fork-safe, as documented in https://www.sqlite.org/howtocorrupt.html section 2.6. Any databases that are inherited from the parent process are unsafe to properly close and may impact either the parent or the child and potentially lead to database corruption. In this commit, a pre-fork callback `prepare_to_fork!` is introduced for the connection pools and adapters to take any necessary actions before forking. The callback is only implemented by the SQLite3Adapter which now invokes `disconnect!` to close all open prepared statements and database connections. As a side effect, the parent process will end up having to re-open database connections if it continues to do work, which may be a small performance overhead for some use cases, but is necessary in order to prevent database corruption. See rails/solid_queue#324 for examples of the type of corruption that can occur.
See rails/rails#52931 which should address at least one of the issues discussed in this thread (fork-safety). I'd love to get some feedback on it! |
Sqlite itself is not fork-safe, as documented in https://www.sqlite.org/howtocorrupt.html section 2.6. Any databases that are inherited from the parent process are unsafe to properly close and may impact either the parent or the child and potentially lead to database corruption. In this commit, a pre-fork callback `prepare_to_fork!` is introduced for the connection pools and adapters to take any necessary actions before forking. The callback is only implemented by the SQLite3Adapter which now invokes `disconnect!` to close all open prepared statements and database connections. As a side effect, the parent process will end up having to re-open database connections if it continues to do work, which may be a small performance overhead for some use cases, but is necessary in order to prevent database corruption. See rails/solid_queue#324 for examples of the type of corruption that can occur.
Changed direction, here's the latest plan, I'd love feedback: |
@andycroll @marcoroth @fractaledmind I just cut a release candidate of sqlite3-ruby with the fork-safety changes from sparklemotion/sqlite3-ruby#558:
It's an invasive change, it's been a hard issue to reproduce, and it's challenging to test every possible scenario ... so I would really like to have some folks use it and see if we can find problems and/or build confidence. I've tried out the |
Released https://github.com/sparklemotion/sqlite3-ruby/releases/tag/v2.1.0.rc2 addressing a performance regression in rc1. |
FYI I have upgraded rubyvideo.dev with this latest version(adrienpoly/rubyvideo@71ece9b). It is now live in prod. So fare everything looks perfect. |
Been using the new sqlite rc version and havent had any issues. But when running
Should i be seeing that warning with this config? I have a database config of
And a queue.yml of
|
@npezza93 I think it's normal to see this message. I'm not a solid queue expert, but I believe the supervisor process hits the database (as part of active record bootstrapping if nothing else) before forking, and that would result in a warning on your tty. Someone who knows it better might be able to tell me how I'm wrong. Sqlite3-ruby maintainers discussed whether or not to emit warnings, but we decided to err on the side of overcommunicating, at least in the RC and probably the first final release of 2.1.0. |
@flavorjones Maybe we can set a flag for Rails 8 to suppress that warning? Because would be nice if out of the box there is nothing alarming posted about the default setup. Especially since we still have some ways to go to reassure folks that sqlite is good and safe in production. |
@dhh Yep. We can add a method |
Also see rails/rails#53033 |
Perfect, @flavorjones 👌 |
Yes, that's exactly right! It uses the DB to register itself, which happens before forking. It's useful to have the supervisor registered before because, in that way, the other processes can register themselves knowing their
I haven't had time in the last few days, but I hope to address this in Solid Queue as well, closing the connections before forking. |
Does sparklemotion/sqlite3-ruby#558 close this? |
Yes. |
I'm seeing a strange error. I dropped @fractaledmind a note and he suggested I post here.
In
sqlite3
client.Doing a
.clone
doesn't seem to fix.I'm not in "real" production and this is a job queue and thus somewhat destroyable, so I'm going to take that approach.
I've also seen similar corruption on my local development environment too akin to that mentioned by @marcoroth on X.
https://x.com/marcoroth_/status/1831045885423497223
In that case I just destroyed and recreated the file and moved on.
The text was updated successfully, but these errors were encountered: