Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce memory consumption during population calling? #448

Open
tnguyengel opened this issue Jan 8, 2024 · 13 comments
Open

How to reduce memory consumption during population calling? #448

tnguyengel opened this issue Jan 8, 2024 · 13 comments
Assignees
Milestone

Comments

@tnguyengel
Copy link

We would like to reduce memory consumption during population calling. Is it possible to split SNF files by chromosome or genomic region?

Alternatively, should we supply smaller bams to Sniffles2 by splitting bams such that each bam only contains the reads that align to a chromosome/genomic region?

Related to #282.

@fritzsedlazeck
Copy link
Owner

There will be a new release coming very soon (days away) that reduces this and allows to split. @hermannromanek is on it :)
Thanks
Fritz

@tnguyengel
Copy link
Author

Has the feature to split up SNF files by chromosome already been released? If so, where can we find the new binaries?

@hermannromanek
Copy link
Collaborator

Hi,

Sorry for the delay - we encountered some issues which had to be fixed first and are in the process of re-testing.

I just pushed the current release candidate, feel free to give it a try. Bear in mind this is not yet fully tested, there is one open bug we know of causing sniffles to report the same SVs twice. Please share with us any other issues you encounter.

To enable the improved population calling, please also make sure the library psutil is installed.

Thanks,
Hermann

@hermannromanek hermannromanek self-assigned this Feb 27, 2024
@tnguyengel
Copy link
Author

tnguyengel commented Apr 24, 2024

I noticed that there is a new release: https://github.com/fritzsedlazeck/Sniffles/releases/tag/v2.3.2. Does this happen to solve this issue of large RAM usage for many samples? (We estimated Sniffles v2.2. will use up ~500-600 GB of RAM to do multisample calling on 5000 Human ONT samples, with no way to parallelize the effort across multiple machines to reduce the RAM consumption). If so, how does Sniffles v2.3+ handle many samples? Does it automatically throttle the memory usage when it detects that memory usage is becoming too high? We can't seem to find a way to tell Sniffles2.3+ to process the SNF files by chromosome (thereby increasing parallism and reducing RAM usage on a single machine).

@fritzsedlazeck
Copy link
Owner

Hey @tnguyengel
as you can imagine its a bit tricky :) What @hermannromanek implemented is a window approach that lets you scale with multithreading and memory. The tight control of the memory is tricky but Hermann can explain how to run it.
Thanks
Fritz

@hermannromanek
Copy link
Collaborator

Hi @tnguyengel

Yes, sniffles 2.3 should not use as high amounts of memory for merging as 2.2 did. It does so by monitoring RAM usage and freeing up memory once the memory footprint exceeds 2gb per thread/worker process (which will be hit quite soon when processing 5000 samples). Also, while with 2.2 threads were working on one chromosome each, 2.3 threads work on the same chromosome in parallel, thus you get better parallelization when processing only one chromosome.

To process a single chromosome you can use the new parameter --contig CONTIG (or -c CONTIG) with CONTIG being the contig name you want to process.

Whats the command you've been trying to run sniffles with?

Thanks for your feedback,
Hermann

@tnguyengel
Copy link
Author

tnguyengel commented Apr 25, 2024

Whats the command you've been trying to run sniffles with?

For both Sniffles v2.3.2 and Sniffles v2.2, we were running

sniffles -t ${threads} --allow-overwrite --input "${snf_list}" --vcf "${out_merged_vcf}"

To process a single chromosome you can use the new parameter --contig CONTIG (or -c CONTIG) with CONTIG being the contig name you want to process.

Facepalm! I missed that. My apologies. We'll try scaling tests again with the --contig option.

@lfpaulin
Copy link
Collaborator

Dear tnguyengel, did you manage to run the 5000 samples?
We just released a new version (2.3.3) that aids with some issues and are improving on merging large datasets. Your feedback is well appreciated

@tnguyengel
Copy link
Author

We don't have the full 5000 samples to run yet, but that will be the final set that we eventually run with. We will rerun scaling tests with v2.3.3, and report the results here.

@fritzsedlazeck
Copy link
Owner

Cool. We keep testing and optimizing. Keep us posted and we will push forward.
Thanks
Fritz

@tnguyengel
Copy link
Author

tnguyengel commented Jun 7, 2024

Dear tnguyengel, did you manage to run the 5000 samples?
We just released a new version (2.3.3) that aids with some issues and are improving on merging large datasets. Your feedback is well appreciated

Fyi, initial scaling test with up to 35 samples indicate v2.3.3 would theoretically use ~100GB of RAM to aggregate a contig across 5000 sample cohort. Much more reasonable in terms of resource usage. I'll report more results with more details as we go along.

@hermannromanek hermannromanek added this to the 2.5 milestone Sep 23, 2024
@hermannromanek
Copy link
Collaborator

While there are more improvements to come, v2.5 should yet improve multisample calling on larger data sets significantly. Merging 35 samples should stay well below 10gb of RAM.

@fritzsedlazeck
Copy link
Owner

Hey guys, the new version just got live which is much better in memory consumption. Please test it out.
Cheers
Fritz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants