Run time estimation #9

PeteHaitch · 2020-02-03T06:06:31Z

Similar to #3 but wondering if things have changed.
Running cellSNP v0.1.7 as

  cellSNP --samFile ${CELLRANGERDIR}/"${SAMPLE}"/outs/possorted_genome_bam.bam \
          --outDir ${OUTDIR} \
          --regionsVCF genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz \
          --barcodeFile ${PROJECT_ROOT}/data/emptyDrops/"${SAMPLE}".barcodes.txt \
          --nproc 20 \
          --minMAF 0.1 \
          --minCOUNT 20

with 31,707 barcodes on a 25G BAM file has been going for > 18 days!
It's still writing output, too (as of 2020-02-03 5PM):

% ll -t data/cellSNP/cellSNP.cells.vcf.gz.temp_*
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_17_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_3_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_11_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_15_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:01 data/cellSNP/cellSNP.cells.vcf.gz.temp_16_
-rw-r----- 1 hickey grpu_mritchie_1 1.1G Feb  3 16:59 data/cellSNP/cellSNP.cells.vcf.gz.temp_19_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 16:56 data/cellSNP/cellSNP.cells.vcf.gz.temp_12_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:55 data/cellSNP/cellSNP.cells.vcf.gz.temp_8_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_6_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_9_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:51 data/cellSNP/cellSNP.cells.vcf.gz.temp_10_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 16:50 data/cellSNP/cellSNP.cells.vcf.gz.temp_1_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:46 data/cellSNP/cellSNP.cells.vcf.gz.temp_2_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:38 data/cellSNP/cellSNP.cells.vcf.gz.temp_14_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:37 data/cellSNP/cellSNP.cells.vcf.gz.temp_13_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:32 data/cellSNP/cellSNP.cells.vcf.gz.temp_4_
-rw-r----- 1 hickey grpu_mritchie_1 2.5G Feb  3 16:19 data/cellSNP/cellSNP.cells.vcf.gz.temp_7_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 15:04 data/cellSNP/cellSNP.cells.vcf.gz.temp_18_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 13:44 data/cellSNP/cellSNP.cells.vcf.gz.temp_0_
-rw-r----- 1 hickey grpu_mritchie_1 1.8G Feb  2 23:52 data/cellSNP/cellSNP.cells.vcf.gz.temp_5_

I've run cellSNP before and although it took a few days it certainly didn't take this long.
I'm wondering:

What particular parts of this (e.g., size of BAM, number of barcodes, number of loci in --regionVCF, ...) might be causing this huge runtime?
What might I do to speed cellSNP up for subsequent datasets (I'm anticipating several datasets, many larger than this, over the course of the year)?
How can I estimate how much longer this particular process has to run?

Thanks,
Pete

The text was updated successfully, but these errors were encountered:

huangyh09 · 2020-02-03T06:54:40Z

Hi Pete,

The bottleneck is still there for large data set. In your case, it is probably caused by the large number of cell barcodes. Normally, it runs within one or two days for ~10k cells. In your case, 31k cells may increase the running time. Also, it is linearly sensitive to the candidate SNP size (i.e., --regionVCF). I suggest you change it the SNP_AF5e4 version to SNP_AF5e2 or even the one you got last time (i.e., the output cellSNP.base.vcf.gz).

For speeding up, maybe you could split the candidate SNPs (e.g., by chromosome or random) and run it in multiple nodes if it runs on cluster.

For estimating the running time, you could read the log file, which shows how many SNPs have been processed.

Yuanhua

PeteHaitch · 2020-02-04T05:20:15Z

Thanks, Yuanhua.

I've started looking into providing a much-reduced set of candidate SNPs.
Might I suggest adding to the documentation to explain how cellSNP scales in the number of barcodes, candidate SNPs, and number of reads?
It would also be useful to have a reduced set of candidate SNPs for common use cases, e.g., SNP_AF5e4 or SNP_AF5e2 intersected with 3' UTRs (or similar) for use with 10X 3' scRNA-seq data.

micans · 2020-07-29T22:25:40Z

Thank you for this great tool.
We are running cellSNP without a VCF file (10x data mode 2) and it has now been running for a week.
Is there any downside to further parallelising by running a separate cellSNP process for each chromosome using --chrom (this would give me greater flexibility in task distribution)? Do you have any other/further recommendations for speeding up processing?
Thanks,
Stijn

vincycheng mentioned this issue Jun 14, 2022

Proper steps for cellSNP and Vireo for large dataset single-cell-genetics/vireo#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run time estimation #9

Run time estimation #9

PeteHaitch commented Feb 3, 2020

huangyh09 commented Feb 3, 2020

PeteHaitch commented Feb 4, 2020

micans commented Jul 29, 2020

Run time estimation #9

Run time estimation #9

Comments

PeteHaitch commented Feb 3, 2020

huangyh09 commented Feb 3, 2020

PeteHaitch commented Feb 4, 2020

micans commented Jul 29, 2020