Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeate make annotation run after a preprocessing for --make_annotation #124

Open
craftor18 opened this issue May 6, 2024 · 7 comments
Open

Comments

@craftor18
Copy link

craftor18 commented May 6, 2024

Hi, I am using this the --make_annotaion preprocessing to run a multiple samples TE detecting,but after I ran out the --make_annotation,then I use -1 -2 to add sample,and find a repeat annotation step for relocate2 and other steps,which means RepeatMasker is progressed multiple times for every sample. Cound you help me? Below is my annotaion step code :

nohup python3 ~/software/mcclintock/mcclintock.py -c ref/TElib.fa -r ref/C.auratus.chromosome_20210819.fasta -p 80 -o output_template_all/   --serial --keep_intermediate all  --make_annotations > logs/make_annotation.log &

Then I add a sample for resume:

nohup python3 ~/software/mcclintock/mcclintock.py -r ref/C.auratus.chromosome_20210819.fasta -c ref/TElib.fa -1 input_fastq/bc17_1.fastq.gz -2 input_fastq/bc17_2.fastq.gz -p 40 -m relocate,TEMP,ngs_te_mapper  -o ./output_template_all/  --resume    > logs/bc17_template.log &

I find RepeatMasker and bwa index steps were re-run ,could you please tell me why? Thanks
Best wishes!

@craftor18
Copy link
Author

Below that, I find if I directly use the tsv and gff file for input and create a new fold for sample TE detect,It wiil be always ran the mcclintock.py and will not go next

@craftor18
Copy link
Author

craftor18 commented May 6, 2024

-rw-rw-r-- 1 zengy zengy 1.2K May  7 00:06 bc17.log
(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/mcclintock/logs$ cat bc17.log
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/C.auratus.chromosome_20210819.fasta
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/mcclintock/input_fastq/bc17_1.fastq.gz
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/mcclintock/input_fastq/bc17_2.fastq.gz
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/TElib.fa
SETUP            McClintock Version: 702acb4baacf53c732df84b9678490b8ea199495
SETUP            Checking config files to ensure previous intermediate files are compatible with this run
Job counts:
        count   jobs
        1       index_reference_genome
        1       make_ref_te_bed
        1       make_reference_fasta
        1       make_te_annotations
        1       map_reads
        1       median_insert_size
        1       ngs_te_mapper_post
        1       ngs_te_mapper_run
        1       process_temp
        1       reference_2bit
        1       relocaTE_consensus
        1       relocaTE_post
        1       relocaTE_ref_gff
        1       relocaTE_run
        1       run_temp
        1       sam_to_bam
        1       setup_reads
        1       summary_report
        1       telocate_taxonomy
        19
PROCESSING       formatting the name of consensus TE fasta headers for compatibility with relocaTE
PROCESSING       relocaTE consensus fasta created
(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/mcclintock/logs$ cat make_annotation.log
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/C.auratus.chromosome_20210819.fasta
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/mcclintock/ref/TElib.fa
SETUP            McClintock Version: 702acb4baacf53c732df84b9678490b8ea199495
Job counts:
        count   jobs
        1       make_consensus_fasta
        1       make_reference_fasta
        1       make_te_annotations
        3
PROCESSING       making consensus fasta
PROCESSING       consensus fasta created
PROCESSING       making reference fasta
PROCESSING       reference fasta created
PROCESSING       making reference TE annotations
PROCESSING       no reference TEs provided... finding reference TEs with RepeatMasker &> /data/zengy/reseq_C.auratus/work/mcclintock/output_template_all/logs/20240506.163954.7526138/processing.log
PROCESSING       reference TE annotations created

like above, make annotation step and resume sample step has the same processing and make annotation truely has been created all files that I need

@cbergman
Copy link
Member

cbergman commented May 9, 2024

Hi @craftor18

Could you try simplifying your initial --make_annotations execution and using full paths to your directories, e.g.

nohup python3 ~/software/mcclintock/mcclintock.py -r /full/path/to/ref/C.auratus.chromosome_20210819.fasta -c /full/path/to/ref/TElib.fa -p 80 -o /full/path/to/output_template_all/ --make_annotations > /full/path/to/logs/make_annotation.log &
nohup python3 ~/software/mcclintock/mcclintock.py -r /full/path/to/ref/C.auratus.chromosome_20210819.fasta -c /full/path/to/ref/TElib.fa -1 /full/path/to/input_fastq/bc17_1.fastq.gz -2 /full/path/to/input_fastq/bc17_2.fastq.gz -p 40 -m relocate,TEMP,ngs_te_mapper -o /full/path/to/output_template_all/ --resume > /full/path/to/logs/bc17_template.log &

If this doesn't work, can you upload the complete make_annotation.log and bc17_template.log files?

Thanks,
Casey

@craftor18
Copy link
Author

Thanks for answering,I'll try a full path.But I do not think its a path problem.Because I have use --make_annotation to generate a output dir and use this dir to resume run for a sample ,But it rerun the RepeatMasker step for generating annotation and I delete it.Maybe I should try another version of mcclintock. Could you please tell me which version should I use? Release version or master version or latency fix version? Now my version is master version but I use the mcclintock.py in latency fix version.
Best wishes

@craftor18
Copy link
Author

craftor18 commented May 11, 2024

Hello,I''ve tried another way to prepare my gff file and tsv file .I use EDTA to make gff file and by some command to make my input gff and tsv file like,gff is:

(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome$ head test.gff
LG01    EDTA    Mutator_TIR_transposon  10618026        10621335        .       .       .       ID=TE_struc_145;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=1;Method=structural;TSD=TGCAGCTGCA_TGCAGCTGCA_100.0;TIR=GCAACTTGCG_CGCAAGTTGC
LG01    EDTA    Mutator_TIR_transposon  15685590        15686152        4775    -       .       ID=TE_homo_272562;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.989;Method=homology
LG01    EDTA    Mutator_TIR_transposon  15688968        15689443        3968    +       .       ID=TE_homo_272566;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.957;Method=homology
LG01    EDTA    Mutator_TIR_transposon  15831079        15831563        3946    -       .       ID=TE_homo_272818;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.953;Method=homology
LG01    EDTA    Mutator_TIR_transposon  15839967        15840529        4754    +       .       ID=TE_homo_272827;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.98;Method=homology
LG01    EDTA    Mutator_TIR_transposon  20330106        20330666        4640    +       .       ID=TE_homo_279690;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.982;Method=homology
LG01    EDTA    Mutator_TIR_transposon  20334906        20335461        4277    +       .       ID=TE_homo_279697;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.949;Method=homology
LG02    EDTA    Mutator_TIR_transposon  11438649        11439221        4686    -       .       ID=TE_homo_390467;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.97;Method=homology
LG02    EDTA    Mutator_TIR_transposon  11439222        11439368        1167    +       .       ID=TE_homo_390468;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.952;Method=homology
LG02    EDTA    Mutator_TIR_transposon  23939864        23940375        3663    +       .       ID=TE_homo_408718;Name=TE_00000001;Classification=DNA/DTM;Sequence_ontology=SO:0002280;Identity=0.967;Method=homology
(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome$

and tsv is:

(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome$ head test.tsv
TE_struc_145    TE_00000001
TE_homo_272562  TE_00000001
TE_homo_272566  TE_00000001
TE_homo_272818  TE_00000001
TE_homo_272827  TE_00000001
TE_homo_279690  TE_00000001
TE_homo_279697  TE_00000001
TE_homo_390467  TE_00000001
TE_homo_390468  TE_00000001
TE_homo_408718  TE_00000001

and my run command is :nohup python3 ~/software/mcclintock/mcclintock.py -r ./ref_genome/C.auratus.chromosome_20210819.fasta -c ./ref_genome/test.fa -1 ./00_input_fastq_datas/bc17_1.fastq.gz -2 ./00_input_fastq_datas/bc17_2.fastq.gz -p 8 -m relocate2,temp2,ngs_te_mapper2 -o ./06_mcclintock/ --sample_name bc17 -g ./ref_genome/test.gff -t ./ref_genome/test.tsv > mcclintock_bc17.log & (wd: /data/zengy/reseq_C.auratus/work/non_ref_TE)
Why its log is :

(mcclintock) zengy@LiusServer:/data/zengy/reseq_C.auratus/work/non_ref_TE$ cat mcclintock_bc17.log
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/C.auratus.chromosome_20210819.fasta
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/non_ref_TE/00_input_fastq_datas/bc17_1.fastq.gz
SETUP            checking fastq: /data/zengy/reseq_C.auratus/work/non_ref_TE/00_input_fastq_datas/bc17_2.fastq.gz
SETUP            checking fasta: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/test.fa
SETUP            checking locations gff: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/test.gff
SETUP            checking taxonomy TSV: /data/zengy/reseq_C.auratus/work/non_ref_TE/ref_genome/test.tsv
SETUP            McClintock Version: 702acb4baacf53c732df84b9678490b8ea199495
Job counts:
        count   jobs
        1       index_reference_genome
        1       make_consensus_fasta
        1       make_ref_te_bed
        1       make_reference_fasta
        1       make_te_annotations
        1       map_reads
        1       median_insert_size
        1       ngs_te_mapper2_post
        1       ngs_te_mapper2_pre
        1       ngs_te_mapper2_run
        1       process_temp2
        1       reference_2bit
        1       relocaTE2_post
        1       relocaTE2_run
        1       repeatmask
        1       run_temp2
        1       sam_to_bam
        1       setup_reads
        1       summary_report
        1       telocate_taxonomy
        20
PROCESSING       making consensus fasta
PROCESSING       consensus fasta created
PROCESSING       making reference fasta
PROCESSING       reference fasta created
PROCESSING       creating 2bit file from reference genome fasta &> /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/logs/20240511.205850.2061338/processing.log
PROCESSING       reference 2bit file created
Failed to solve scheduling problem with ILP solver. Falling back to greedy solver.Run Snakemake with --verbose to see the full solver output for debugging the problem.

Truly its not a mistake,and the program is also running.But I 've supply a gff and a tsv file ,why program still run a repeatmasker progress for re-generating annotation file ?
Below is what is running program:

top - 21:07:45 up 2 days, 10:09,  2 users,  load average: 5.35, 4.50, 4.60
Tasks: 817 total,   6 running, 811 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.0 us,  1.3 sy,  0.0 ni, 93.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 128393.1 total,   3840.5 free,  10551.6 used, 115048.4 buff/cache
MiB Swap:   8192.0 total,   7804.0 free,    388.0 used. 117841.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
4037324 zengy     20   0  992472 581176   2308 R 100.0   0.4   8:03.51 bwa index /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_20210819/genome_fasta/C.auratus.chromosome_20210819.fasta
4040162 zengy     20   0    3560   1812   1400 R  99.3   0.0   3:30.79 gzip -cd /data/zengy/reseq_C.auratus/work/non_ref_TE/00_input_fastq_datas/bc17_2.fastq.gz
4046688 zengy     20   0  514828 489860   2824 R  30.9   0.4   0:00.94 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+
4046699 zengy     20   0  514828 490736   3696 R  18.8   0.4   0:00.57 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+
4037240 zengy     20   0  527052 504164   4904 S  12.5   0.4   3:21.62 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+
4046716 zengy     20   0  526512 502416   3696 R   4.6   0.4   0:00.14 perl /home/zengy/software/mcclintock/install/envs/conda/46211ea1/share/RepeatMasker/RepeatMasker -pa 3 -lib /data/zengy/reseq_C.auratus/work/non_ref_TE/06_mcclintock/C.auratus.chromosome_2+

Can you explain it?Thank you very much !
Best wishes

@craftor18
Copy link
Author

It seems my input gff and tsv file only work for the ngs_mapper2 method

@craftor18
Copy link
Author

And I also find that when the tsv and gff contain too much TE family lines ,Its time to parse paramers will be very long

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants