Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of RelocaTE2 results into the relocate2_nonredundant.bed and summary files appears to be incorrect #95

Open
andreabours opened this issue May 24, 2022 · 2 comments

Comments

@andreabours
Copy link

Hi,

I'm testing out your pipeline on a few individuals for which I have subsetted data (to see how well the programs perform on different read depths on my specific species and data). Looking at the summary output I saw that RelocaTE2 was reporting the exact same amount of reference TEs over different depths (8x, 16x, 24x, and 32x) but also between the individuals I have the same amount of reference TEs.
When I checked the unfiltered All.all_ref_insert.gff file I see that a lot of these reported reference TEs don't have the junction supported by reads nor left/right support reads, so I assume they should be excluded:

chr_10	RelocaTE2	not_given	1136933	1137334	.	-	.	ID=repeat_chr_10_1136933_1137334;TSD=Reference_Only;Note=Reference_Only;Right_junction_reads:0;Left_junction_reads:0;Right_support_reads:0;Left_support_reads:0;
chr_10	RelocaTE2	not_given	1152973	1153332	.	+	.	ID=repeat_chr_10_1152973_1153332;TSD=Reference_Only;Note=Reference_Only;Right_junction_reads:0;Left_junction_reads:0;Right_support_reads:0;Left_support_reads:0;
chr_10	RelocaTE2	not_given	1162183	1162264	.	-	.	ID=repeat_chr_10_1162183_1162264;TSD=Reference_Only;Note=Reference_Only;Right_junction_reads:0;Left_junction_reads:0;Right_support_reads:0;Left_support_reads:0;

However, this is them in the summary nonredundant.bed file:

chr_10	1136932	1137334	CR1_J1a_fAlb_LINE_CR1|reference|NA|4L23062_L1_L2_L3_R1|relocate2|sr|1601	0	-
chr_10	1152972	1153332	SylAtr025_LTR_ERV2|reference|NA|4L23062_L1_L2_L3_R1|relocate2|sr|1602	0	+
chr_10	1162182	1162264	SylAtr318_LINE_CR1|reference|NA|4L23062_L1_L2_L3_R1|relocate2|sr|1603	0	-

I can find these also in the summary output html, while they shouldn't be there.

This also happens when the tag reports insufficient data:

chr_4 RelocaTE2 not_given 4250994 4251609 . - . ID=repeat_chr_4_4250994_4251609;TSD=insufficient_data;Note=insufficient_data;Right_junction_reads:0;Left_junction_reads:5;Right_support_reads:1;Left_support_reads:0;
I can find this in the summary and the bed file, while it shouldn't be.

Interestingly, if a reference TE is not at all found within the sample, this is correctly excluded from the summary outputs. To clarify, what I'm seeing is that as soon as a reference TE is found 1x (with appropriate confidence), all other occurrences mentioned in the gff are also included regardless of them being not having any support.

I'm happy to share the files if needed.

Best,
Andrea

@cbergman
Copy link
Member

Hi @andreabours

Thanks for reporting this issue. This looks like an area for improvement in how McClintock is parsing RelocaTE2 output. It would be helpful to have test files to replicate this result. Can you email me offline to arrange a data transfer?

Thanks
Casey

@andreabours
Copy link
Author

andreabours commented May 24, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants