Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for updating the closest function. #163

Open
WANGchuang715 opened this issue Jun 7, 2023 · 6 comments
Open

Suggestion for updating the closest function. #163

WANGchuang715 opened this issue Jun 7, 2023 · 6 comments

Comments

@WANGchuang715
Copy link

I suggest adding an option for distance-based filtering in the closest function. Currently, closest only allows selecting the number of closest intervals to report with the k parameter. It would be beneficial to include an option to filter intervals based on a maximum distance criterion. This enhancement would provide more flexibility and control in selecting intervals based on their proximity. I recommend considering the addition of this feature to improve the functionality of the closest function.

@agalitsyna
Copy link
Member

Hi, @WANGchuang715, have you considered filtering dataframe by the distance column after applying closest with return_distance=True?

@WANGchuang715
Copy link
Author

Hi, @WANGchuang715, have you considered filtering dataframe by the distance column after applying closest with return_distance=True?
I am performing the operation in this way. However, I am unable to determine the appropriate value for k, so I can only choose a large k value, which is not very elegant and time-consuming.

@gfudenberg
Copy link
Member

hi Wang, if you are interested in many features in df2 around df1, perhaps

bf.overlap( df1, bf.expand(df2, pad=MAX_DIST))

is what you are looking for, rather than a bf.closest operation?

@WANGchuang715
Copy link
Author

hi Wang, if you are interested in many features in df2 around df1, perhaps

bf.overlap( df1, bf.expand(df2, pad=MAX_DIST))

is what you are looking for, rather than a bf.closest operation?

I understand the functionality you mentioned, and in comparison, the "closest" feature aligns better with my requirements. I am using it to find the cis-mRNAs for lncRNAs, so I need to differentiate the upstream and downstream relationships within a certain distance and determine if there is any direct overlap. Currently, the "closest" functionality is able to meet my basic needs, and I also hope that you can consider my suggestion.

@nvictus
Copy link
Member

nvictus commented Jun 8, 2023

Can you formulate the problem more precisely?

You mention that you are "unable to determine the appropriate value for k", so it sounds to me like what you really want is to make what is known as a "ball query" of some radius around lncRNAs (differentiating by strand, etc.)? i.e you want to catch all cis-mRNAs up to some given maximum distance away from each lncRNA in a particular direction.

Regardless of how this functionality might be exposed, the task I just described would make more sense as an extension of the overlap algorithm which is a type of ball query algorithm, rather than the closest algorithm, which is a nearest-neighbors algorithm. Am I understanding your goal correctly?

@WANGchuang715
Copy link
Author

Can you formulate the problem more precisely?

You mention that you are "unable to determine the appropriate value for k", so it sounds to me like what you really want is to make what is known as a "ball query" of some radius around lncRNAs (differentiating by strand, etc.)? i.e you want to catch all cis-mRNAs up to some given maximum distance away from each lncRNA in a particular direction.

Regardless of how this functionality might be exposed, the task I just described would make more sense as an extension of the overlap algorithm which is a type of ball query algorithm, rather than the closest algorithm, which is a nearest-neighbors algorithm. Am I understanding your goal correctly?

I think they are two different filtering dimensions. Currently, the "closest" functionality filters the nearest k ranges without considering the distance. It selects k ranges that are closest in proximity. What I want is to filter N ranges that are within a certain distance. I believe both of these filtering approaches are necessary in practical applications.
This is how I currently achieve my requirement using the "closest" function, which is actually quite convenient.

overlap = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_upstream=True,ignore_downstream=True).dropna()
upstream = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_overlaps=True,ignore_downstream=True).dropna() downstream = bf.closest(lnc,mRNA, suffixes=('_lncRNA','_mRNA'),k=20,ignore_overlaps=True,ignore_upstream=True).dropna()
upstream = upstream[upstream['distance']<=max_distance]
downstream = downstream[downstream['distance']<=max_distance]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants