Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run time estimation #9

Open
PeteHaitch opened this issue Feb 3, 2020 · 3 comments
Open

Run time estimation #9

PeteHaitch opened this issue Feb 3, 2020 · 3 comments

Comments

@PeteHaitch
Copy link

Similar to #3 but wondering if things have changed.
Running cellSNP v0.1.7 as

  cellSNP --samFile ${CELLRANGERDIR}/"${SAMPLE}"/outs/possorted_genome_bam.bam \
          --outDir ${OUTDIR} \
          --regionsVCF genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz \
          --barcodeFile ${PROJECT_ROOT}/data/emptyDrops/"${SAMPLE}".barcodes.txt \
          --nproc 20 \
          --minMAF 0.1 \
          --minCOUNT 20

with 31,707 barcodes on a 25G BAM file has been going for > 18 days!
It's still writing output, too (as of 2020-02-03 5PM):

% ll -t data/cellSNP/cellSNP.cells.vcf.gz.temp_*
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_17_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_3_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_11_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_15_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:01 data/cellSNP/cellSNP.cells.vcf.gz.temp_16_
-rw-r----- 1 hickey grpu_mritchie_1 1.1G Feb  3 16:59 data/cellSNP/cellSNP.cells.vcf.gz.temp_19_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 16:56 data/cellSNP/cellSNP.cells.vcf.gz.temp_12_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:55 data/cellSNP/cellSNP.cells.vcf.gz.temp_8_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_6_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_9_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:51 data/cellSNP/cellSNP.cells.vcf.gz.temp_10_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 16:50 data/cellSNP/cellSNP.cells.vcf.gz.temp_1_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:46 data/cellSNP/cellSNP.cells.vcf.gz.temp_2_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:38 data/cellSNP/cellSNP.cells.vcf.gz.temp_14_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:37 data/cellSNP/cellSNP.cells.vcf.gz.temp_13_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:32 data/cellSNP/cellSNP.cells.vcf.gz.temp_4_
-rw-r----- 1 hickey grpu_mritchie_1 2.5G Feb  3 16:19 data/cellSNP/cellSNP.cells.vcf.gz.temp_7_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 15:04 data/cellSNP/cellSNP.cells.vcf.gz.temp_18_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 13:44 data/cellSNP/cellSNP.cells.vcf.gz.temp_0_
-rw-r----- 1 hickey grpu_mritchie_1 1.8G Feb  2 23:52 data/cellSNP/cellSNP.cells.vcf.gz.temp_5_

I've run cellSNP before and although it took a few days it certainly didn't take this long.
I'm wondering:

  • What particular parts of this (e.g., size of BAM, number of barcodes, number of loci in --regionVCF, ...) might be causing this huge runtime?
  • What might I do to speed cellSNP up for subsequent datasets (I'm anticipating several datasets, many larger than this, over the course of the year)?
  • How can I estimate how much longer this particular process has to run?

Thanks,
Pete

@huangyh09
Copy link
Collaborator

Hi Pete,

The bottleneck is still there for large data set. In your case, it is probably caused by the large number of cell barcodes. Normally, it runs within one or two days for ~10k cells. In your case, 31k cells may increase the running time. Also, it is linearly sensitive to the candidate SNP size (i.e., --regionVCF). I suggest you change it the SNP_AF5e4 version to SNP_AF5e2 or even the one you got last time (i.e., the output cellSNP.base.vcf.gz).

For speeding up, maybe you could split the candidate SNPs (e.g., by chromosome or random) and run it in multiple nodes if it runs on cluster.

For estimating the running time, you could read the log file, which shows how many SNPs have been processed.

Yuanhua

@PeteHaitch
Copy link
Author

Thanks, Yuanhua.

I've started looking into providing a much-reduced set of candidate SNPs.
Might I suggest adding to the documentation to explain how cellSNP scales in the number of barcodes, candidate SNPs, and number of reads?
It would also be useful to have a reduced set of candidate SNPs for common use cases, e.g., SNP_AF5e4 or SNP_AF5e2 intersected with 3' UTRs (or similar) for use with 10X 3' scRNA-seq data.

@micans
Copy link

micans commented Jul 29, 2020

Thank you for this great tool.
We are running cellSNP without a VCF file (10x data mode 2) and it has now been running for a week.
Is there any downside to further parallelising by running a separate cellSNP process for each chromosome using --chrom (this would give me greater flexibility in task distribution)? Do you have any other/further recommendations for speeding up processing?
Thanks,
Stijn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants