Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sonar is slow without --batchless because of how we get slurm job IDs #97

Closed
lars-t-hansen opened this issue Aug 21, 2023 · 6 comments
Closed
Assignees
Labels
bug Something isn't working performance

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Aug 21, 2023

Running a sonar release build just now on a lightly-loaded ML node (ml7, a beefy AMD system), it runs in 0.27s real time with --batchless and in 2.5s real time without --batchless (about 10x). The difference is even more stark on my development system (a slightly older Xeon tower): 0.03s vs 1.63s (about 50x).

I run with --exclude-users=root --exclude-system-jobs --rollup to keep the amount of output to a minimum, so that we can know it's not output generation that's the main problem.

Running perf on this it is clear that the problem is in get_slurm_job_id: every profiling hit in the first several pages of profiler output is in the pipeline that that function runs to get the job ID. We can probably do much better here (and we'll need to).

@lars-t-hansen lars-t-hansen added enhancement New feature or request performance bug Something isn't working and removed enhancement New feature or request labels Aug 21, 2023
@bast
Copy link
Member

bast commented Aug 21, 2023

That sounds very slow. This is built with --release?

@lars-t-hansen
Copy link
Collaborator Author

Yes. The problem, I think, is that we moved process filtering much later in the pipeline, so this task is run much more often than it used to be. The two obvious approaches to fixing this is to not compute the job ID until we need it (but I don't know how helpful that will be) and to avoid the shell pipeline for what is after all a very simple job of getting some text from a file with what does not actually need to be a regular expression.

@bast
Copy link
Member

bast commented Aug 21, 2023

OK. Indeed we need to fix this. It needs to run well below 1 second per poll. Ideally only milliseconds. I think we might need to do both: avoid shell pipeline and delay it as much as we can.

@lars-t-hansen
Copy link
Collaborator Author

I'll take a look after lunch, since technically I introduced this bug :-)

@lars-t-hansen
Copy link
Collaborator Author

Avoiding the pipeline brings the time of the slow version down to the time of the fast version, and this is on a system where the matching line is never found because there is no batch job system, so the entire cgroup file has to be read and parsed. I'm not sure how well I'll be able to test this locally (yet) but I'll look into that.

@lars-t-hansen
Copy link
Collaborator Author

Another factor that I don't know how to think about yet is that at least on the compute nodes on Fox, once I've found a slurm ID for one process, it could look like all the processes on the node that have a slurm ID have the same slurm ID. This would be a tricky invariant to rely on, and it's probably not low-hanging fruit. Let's see what the profile looks like after we've fixed #86, #87, and #88.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

2 participants