Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Contig Assembly in TELR #37

Open
prakashnarayanan98 opened this issue Dec 1, 2023 · 3 comments
Open

Issue with Contig Assembly in TELR #37

prakashnarayanan98 opened this issue Dec 1, 2023 · 3 comments

Comments

@prakashnarayanan98
Copy link

Description:

Click to expand for Sample of processing error
Successfully created the directory 
/TELR/intermediate_files/vcf_ins_repeatmask 
RepeatMasker version open-4.0.7
Search Engine: NCBI/RMBLAST [ 2.6.0+ ]
Rebuilding RepeatMaskerLib.embl library
  - Read in 216 sequences from /miniconda3/envs/TELR/share/RepeatMasker/Libraries/DfamConsensus.embl
RepeatMaskerLib.embl: 216 total sequences.
Master RepeatMasker Database: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 )
Custom Repeat Library: 
/TELR/intermediate_files/LIBRARY.fasta


Warning...unknown stuff <
>
Building general libraries in: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/dc20170127/general

analyzing file /TELR/intermediate_files/Read.vcf_ins.fasta
identifying matches to LIBRARY.fasta  sequences in batch 1 of 11
identifying matches to LIBRARY.fasta sequences in batch 2 of 11
identifying matches to LIBRARY.fasta sequences in batch 3 of 11
identifying matches to LIBRARY.fasta sequences in batch 4 of 11
identifying matches to LIBRARY.fasta sequences in batch 5 of 11
identifying matches to LIBRARY.fasta sequences in batch 6 of 11
identifying matches to LIBRARY.fasta sequences in batch 7 of 11
identifying matches to LIBRARY.fasta sequences in batch 8 of 11
identifying matches to LIBRARY.fasta sequences in batch 9 of 11
identifying matches to LIBRARY.fasta sequences in batch 10 of 11
identifying matches to LIBRARY.fasta sequences in batch 11 of 11
processing output: 
cycle 1 .
cycle 2 .
Generating output... .
masking
done
Successfully created the directory /TELR/intermediate_files/sv_reads 
Successfully created the directory /TELR/intermediate_files/contig_assembly 
assembly failed
assembly failed
assembly failed
assembly failed

assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
assembly failed
Successfully created the directory /TELR/intermediate_files/vcf_seq2contig 
Use repeatmasker to annotate contig TE families instead of minimap2
Successfully created the directory /TELR/intermediate_files/contig_te_repeatmask 
RepeatMasker version open-4.0.7
Search Engine: NCBI/RMBLAST [ 2.6.0+ ]
Master RepeatMasker Database: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 )
Custom Repeat Library: /TELR/intermediate_files/LIBRARY.fasta


Warning...unknown stuff <
>

analyzing file /TELR/intermediate_files/Read.fa
identifying matches to LIBRARY.fasta sequences in batch 1 of 10
identifying matches to LIBRARY.fasta sequences in batch 2 of 10
identifying matches to LIBRARY.fasta sequences in batch 3 of 10
identifying matches to LIBRARY.fasta sequences in batch 4 of 10
identifying matches to LIBRARY.fasta sequences in batch 5 of 10
identifying matches to LIBRARY.fasta sequences in batch 6 of 10
identifying matches to LIBRARY.fasta sequences in batch 7 of 10
identifying matches to LIBRARY.fasta sequences in batch 8 of 10
identifying matches to LIBRARY.fasta sequences in batch 9 of 10
identifying matches to LIBRARY.fasta sequences in batch 10 of 10
processing output: 
cycle 1 .
cycle 2 .
Generating output... .
masking
done
Done

Successfully created the directory /TELR/intermediate_files/telr_reads 
Scf_2L_22107544_22107544 no assembly
Scf_2L_22734202_22734202 no assembly
Scf_2R_380878_380881 no assembly
Scf_2R_2670123_2670123 no assembly
Scf_3L_23424019_23424021 no assembly
Scf_NODE_103476_626_627 no assembly
Scf_NODE_105063_6969_6970 no assembly
Scf_NODE_11571_24517_24519 no assembly
Scf_NODE_12809_1023_1023 no assembly
Scf_NODE_18214_489_489 no assembly
Scf_NODE_24465_1162_1163 no assembly
Scf_NODE_26715_949_952 no assembly
Scf_NODE_3168_468_468 no assembly
Scf_NODE_36936_1052_1057 no assembly
Scf_NODE_37551_815_815 no assembly
Scf_NODE_39506_601_603 no assembly
Scf_NODE_46678_5042_5042 no assembly
Scf_NODE_5267_896_897 no assembly
Scf_NODE_59901_2861_2861 no assembly
Scf_NODE_60709_627_628 no assembly
Scf_NODE_68951_87_88 no assembly
Scf_NODE_69473_1091_1091 no assembly
Scf_NODE_72290_1975_1976 no assembly
Scf_NODE_76112_1320_1320 no assembly
Scf_NODE_98642_1306_1307 no assembly
Successfully created the directory /TELR/intermediate_files/ref_repeatmask 
RepeatMasker version open-4.0.7
Search Engine: NCBI/RMBLAST [ 2.6.0+ ]
Master RepeatMasker Database: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 )
Custom Repeat Library: /TELR/intermediate_files/LIBRARY.fasta


Warning...unknown stuff <
>

Environment:

  • Operating System: x86_64 GNU/Linux
  • Command Executed: telr -i read.fastq -l library.fasta -r reference.fasta

Issue:
The contig assembly process in TELR is encountering multiple failures, leading to the generation of empty assemblies for several sequences.

Observed Behavior:

  • Successful creation of directories and initial processes.
  • Consistent failure in the contig assembly phase.

Logs:

11/29/2023 06:05:31: INFO: Parsing input files...
11/29/2023 06:05:31: INFO: Raw reads are provided
11/29/2023 06:05:31: INFO: Start alignment...
11/29/2023 22:24:36: INFO: Sort and index BAM...
11/29/2023 22:41:39: INFO: First alignment finished in 16 hours 36 minutes 8 seconds
11/29/2023 22:41:39: INFO: Detecting SVs from BAM file...
11/29/2023 23:00:59: INFO: SV detection finished in 19 minutes 19 seconds
11/29/2023 23:00:59: INFO: Parse structural variant VCF...
11/29/2023 23:02:33: INFO: Perform local assembly of non-reference TE loci...
11/30/2023 00:58:32: INFO: Local assembly finished in 1 hours 55 minutes 58 seconds
11/30/2023 00:58:32: INFO: Annotate contigs...
11/30/2023 01:02:07: INFO: Estimating allele frequency...
11/30/2023 01:02:46: INFO: Perform local realignment...
11/30/2023 01:12:33: INFO: Local realignment finished in 9 minutes 46 seconds
11/30/2023 01:13:58: INFO: Allele frequency estimation finished in 11 minutes 11 seconds
11/30/2023 03:25:49: INFO: Map contigs to reference...
11/30/2023 04:10:01: INFO: Write output...
11/30/2023 04:10:09: INFO: TELR finished in 22 hours 4 minutes 37 seconds

Additional Information:

  • RepeatMaskerLib.embl: 216 total sequences.
  • Custom Repeat Library

Notes:

  • Warning: Unknown content encountered in the process output.

This issue is hindering the progress of the project. Any assistance or guidance in resolving this matter would be greatly appreciated.

@shunhuahan
Copy link
Contributor

shunhuahan commented Dec 6, 2023

Hi @prakashnarayanan98,

Thanks for reporting the error. A few things:

  1. Can you describe the input FASTQ data, the library, and the reference genome? We have tested TELR on drosophila melanogaster dataset but not on other species.
  2. The "assembly failed" message is due to the assembler not being able to produce contigs. Based on your command line, I think you are using the default wtdbg2 assembler. Can you try with telr -i read.fastq -l library.fasta -r reference.fasta --assembler flye --polisher flye and see if switching to flye for assembly and polishing could help.

Thanks,
Shunhua

@prakashnarayanan98
Copy link
Author

FASTQ Data:

  • Species: Drosophila simulans
  • Fastq Data Type: Non-coding RNA

Library:
chakraborty_simulans_TE

  • Example Header:
    >TcMar-Mariner:MARINA
    gataagtccccggtctgacacatagatggcgtcgctagtatta
    

Reference Genome:

  • Genome: dsim-all-chromosome-r2.02
  • Example Header:
    >Scf_2L type=golden_path_region; loc=Scf_2L:1..23539531; ID=Scf_2L; dbxref=GB:CM002910; MD5=4db334c02c86dfa856dc1a48c595acf1; length=23539531; release=r2.02; species=Dsim;
    TTTGTGCAGTTAGAGTGGGCGTGGCAACATGTGTCAATAAACCTACGCTGCGTCTATGTCTCAAAATCTGTACGCTGAAT
    

@shunhuahan
Copy link
Contributor

Thanks @prakashnarayanan98 for providing this info.

We haven't yet tested TELR extensively on simulans data, so there is no guarantee that the entire workflow will be issue-free for this species. Did you get any successful assemblies for most insertion candidates? If the assembly failure is only for a small subset of all non-reference TE insertions, you can potentially look into rescuing those assemblies ad-hoc. Below are files you can use for this purpose.

If you provide --keep_files when running TELR, all intermediate files will be kept under <output_dir>/intermediate_files.

  • Sniffles VCF with all read IDs associated with each insertion candidate locus is available at /intermediate_files/reads.vcf
  • Raw reads and assembly intermediate files are available at /intermediate_files/sv_reads. Assembly results for all candidate loci are available at /intermediate_files/contig_assembly. They can be used to diagnose assembly errors and test your own assembly strategies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants