-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main boot-collector process exits because of race condition with previous boot-collector process with --usleep arg detected #25
Comments
Riiight - it is possible that we are mis-filtering collectors that have the --usleep still running - but there should be code to detect and simply skip over those. Quite possibly that code doesn't work when they are exiting [ perhaps there is some race condition whereby the cmdline-args are destroyed by glibc before the kernel kills the process or something ? ]. If so that would be an interesting angle [ and yes the intermittent nature is deadly annoying - help fixing it much appreciated ]. Could you enable the debug to collector/dump.c in bootchart_find_running_pid to see if we can get something out of that ? In the medium term, I had hoped to move towards killing the nasty shell-script wrapper, and moving ever more of that into the collector process itself (into plain C) - any progress you can help out with there is much appreciated ! |
The call to fread() on the cmdline file in the proc entry of the process with the --usleep argument returns a length of 0 instead of the expected len of "--usleep". I think it is somehow catching the process so early in its execution that the proc entry hasn't been completely initialized yet. len = fread (abuffer, 1, 4095, argf); Annoyingly enough, the more debug code I throw in, the harder it is to reproduce. I could kill the loop at the beginning of wait_boot, because I've modified the script to wait X seconds as the exit criteria which will fix it for me, but will give some thought on the race condition. |
Wow - that is helpful; so yes there are races and unfinished state at start/finish of processes which is sad/broken but hard to avoid I guess. I think that (really) the sol'n is to bin the use of bootchart-collector --usleep in the code - perhaps we can do this with an installed symlink - so the process has a different name, and then do the usleep mode on argv[0] ? so ln- -s bootchart-collector busleep; busleep 1000 etc. ? |
The process will still have the same name, at least as seen under proc. When busleep is executed, /proc/1269/exe is still a link to exe -> /lib/bootchart/bootchart-collector. I think there will have be a change in the bootchart_find_running_pid function to use some other way to detect whether there is already a bootchart-collector process collecting data. |
Riight - in reality I think it should be fairly trivial to remove the need for usleep. Currently we use it in bootchartd.in for: Patches much appreciated [ of course ]. I guess it is the 'wait_boot' ones that are most problematic. |
What about this? This is only compile tested and break initrd case. |
Patch looks interesting to me, I think we need to ',' delimit arguments of EXIT_PROC (and in strtok) otherwise they'll be separate arguments ;-) I don't quite see how the two steps - running the collector inside some odd chroot, and running the dump phase from some later date / in a different root works out - but if you say it does :-) that's all good. |
Yeah, this is more of a proof of concept and some more work is needed to have it work. |
I've observed that bootchartd is unable to dump logs about 10% of the time because the bootchart-collector process isn't available. It looks like the process that is supposed to be there exits because there is a bootchart-collector process with the argument --usleep 5000 still running. This is from the wait_boot() function thats put in the background right before.
I've added -c to /sbin/bootchartd and a fprintf at the beginning of main() in collector.c to narrow down the issue.
[partial output]
Starting bootchart logging
bootchart: no initrd used; starting
bootchart-collector started as pid 415 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 414 with 2 args: '-c' '50'
bootchart console.debug [1]
bootchart-collector started as pid 413 with 3 args: '-c' '--usleep' '250000'
bootchart-collector started as pid 414 with 2 args: '-c' '50'
bootchart-collector run outside initrd
bootchart collector already running as pid 416, exiting...
bootchart-collector started as pid 416 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 417 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 418 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 419 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 420 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 421 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 422 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 423 with 3 args: '-c' '--usleep' '5000'
bootchart-collector started as pid 424 with 3 args: '-c' '--usleep' '5000'
.....
bootchart-collector started as pid 1207 with 3 args: '-c' '--dump' '/tmp/bootchart.XXXXN2niDZ'
bootchart-collector started as pid 1207 with 3 args: '-c' '--dump' '/tmp/bootchart.XXXXN2niDZ'
Failed to find the collector's pid
bootchart-collector pid: 1207 unmounted proc / clean exit
Can't extract boot chart from collector
The text was updated successfully, but these errors were encountered: