Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bioframe.overlap RAM error #219

Open
talipzengin opened this issue Aug 3, 2024 · 1 comment
Open

Bioframe.overlap RAM error #219

talipzengin opened this issue Aug 3, 2024 · 1 comment

Comments

@talipzengin
Copy link

talipzengin commented Aug 3, 2024

Hi. I have a gene table (genomic locations of genes, 63086 rows × 4 columns) and a CNV table including genomic locations of Copy Number Variables (799505 rows × 5 columns; after segment mean filtering: 507673 rows × 6 columns). I want to determine the genes that have variable copy number. Bioframe.overlap gave RAM error ("Your session crashed after using all available RAM.") for large table dimensions in Colab notebook (Colab RAM is 12.7 GB).

I have tried different dimensions and the results:
CNV table: 799505 rows × 5 columns
Gene table (gene_coord): 63086 rows × 4 columns => RAM error

CNV table: 507673 rows × 6 columns
Gene table (gene_coord): 63086 rows × 4 columns => 23018739 rows × 10 columns

CNV table: 507673 rows × 6 columns
Gene table (gene_coord_expanded): 63086 rows × 6 columns => RAM error (It should have returned 23018739 rows × 12 columns.)

My code:

import pandas
data_url = "https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-BRCA.cnv.tsv.gz"
cnv_data = pandas.read_csv(data_url, compression='gzip', header=0, sep='\t', quotechar='"')
cnv_data.columns = ['sample', 'chrom', 'start', 'end', 'value']
cnv_data['chrom'] = 'chr'+cnv_data['chrom'].astype(str)
cnv_data = cnv_data[(cnv_data['value'] < -0.2) | (cnv_data['value'] > 0.2)]
#cnv_data = cnv_data[abs(cnv_data['value']) > 0.2]
cnv_data['aberration'] = cnv_data['value'].apply(lambda x: 'deletion' if x < -0.2 else ('amplification' if x > 0.2 else 'none'))
cnv_data

gene_coord = bioframe.read_table(
    "https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz",
    schema="gtf", skiprows=5)
gene_coord = gene_coord[gene_coord['feature'] == 'gene']
gene_coord['chrom'] = 'chr'+gene_coord['chrom'].astype(str)
gene_coord = gene_coord[['chrom', 'start', 'end', 'attributes']]
chrs_to_keep = [f'chr{i}' for i in range(1, 23)] + ['chrX', 'chrY', 'chrMT']
gene_coord = gene_coord[gene_coord['chrom'].isin(chrs_to_keep)]
gene_coord

import pandas as pd
import re
def parse_attributes(attributes):
    attr_dict = {}
    matches = re.findall(r'(\w+)\s*"([^"]*)"', attributes)
    for key, value in matches:
        attr_dict[key] = value
    return attr_dict
attributes_df = gene_coord['attributes'].apply(parse_attributes).apply(pd.Series)
gene_coord_expanded = pd.concat([gene_coord, attributes_df], axis=1).drop(columns=['attributes'])
gene_coord_expanded.drop(columns=['gene_version', 'gene_source'], inplace=True)
gene_coord_expanded

import itertools
import numpy
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
import bioframe
import bioframe.vis
cnv_genes = bioframe.overlap(cnv_data, gene_coord_expanded, how='inner', suffixes=('_cnv','_genes'))
@Phlya
Copy link
Member

Phlya commented Aug 5, 2024

Hi Talip, after talking to other core devs it seems like the size of the final table is too large for the Colab environment.
You can try reducing the table size e.g. by taking one or a few chromosomes.
(for the purposes of the tutorials that is more than enough)

Also another option is to use the argument return_input=False, so you'd only get the indices of overlapping intervals. That should have a much lower memory footprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants