Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of ${GTF_FILE_BASE}.gene.bed and ${GTF_FILE_BASE}.exon.bed needs to operate off column matching #58

Open
bryce-turner opened this issue Dec 10, 2021 · 1 comment
Assignees

Comments

@bryce-turner
Copy link
Member

# Create a bed file for the start and stop for each gene
awk -F '[\t"]' '$1 !~ /^#/ { if (a[$10] == "" ) { a[$10] = $1 ; b[$10] = $4 ; c[$10] = $5 ; next } ;
if ($4 < b[$10]) { b[$10] = $4 } ;
if ($5 > c[$10]) { c[$10] = $5 }
} END {
for (i in a) {
OFS = "\t" ; print a[i], b[i], c[i], i
}
}' ${GTF_FILE} | sort -k1,1V -k2,2n -k3,3n > ${GTF_FILE_BASE}.gene.bed
# Create a bed file for the start and stop of each exon for each gene
awk -F '[\t"]' '$1 !~ /^#/ { if ($3 == "exon") { OFS = "\t" ; print $1, $4, $5, $10 }}' ${GTF_FILE} | sort -k1,1V -k2,2n -k3,3n > ${GTF_FILE_BASE}.exon.bed

We grab the value in column 10, e.g. $10, but this value is not always the gene_id value. For example we have the following for canfam3.1 ensemble 98:

X	ensembl	gene	21422	25435	.	+	.	gene_source "ensembl"; gene_biotype "lncRNA"; transcript_id "ENSCAFG00000039510"; transcript_name "ENSCAFG00000039510"; gene_version "2"; gene_name "ENSCAFG00000039510"; gene_id "ENSCAFG00000039510"

Instead we might be able to grab the column for gene_id and add 1. For example:

awk -F '[\t"]' '$1 !~ /^#/ { for (i=1; i<=NF; i++) { f[$i] = i } if (a[$(f["gene_id"]+1)] == "" ) { a[$(f["gene_id"]+1)] = $1 ; b[$(f["gene_id"]+1)] = $4 ; c[$(f["gene_id"]+1)] = $5 ; next } ;
        if ($4 < b[$(f["gene_id"]+1)]) { b[$(f["gene_id"]+1)] = $4 } ;
        if ($5 > c[$(f["gene_id"]+1)]) { c[$(f["gene_id"]+1)] = $5 }
} END {
for (i in a) {
        OFS = "\t" ; print a[i], b[i], c[i], i
}
}' ${GTF_FILE} | sort -k1,1V -k2,2n -k3,3n > ${GTF_FILE_BASE}.gene.bed
@bryce-turner bryce-turner self-assigned this Dec 10, 2021
@PedalheadPHX
Copy link
Member

This should really be based on a program that can read a GTF, maybe look if we can adopt the container we were working to sort out, it has lots of functions to parse gff and gtf files that might simplify this, otherwise you are correct we need to discover the position

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants