Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group mutate revamp #109

Open
wants to merge 10 commits into
base: devel
Choose a base branch
from
Open

Conversation

ppaxisa
Copy link
Member

@ppaxisa ppaxisa commented Apr 6, 2024

When operating on an ungrouped object, mutate will use methods from {plyranges} but when operating on a grouped object, mutate will rely on vanilla tidyverse by converting the object to a tibble. The result is then coerced back to DataFrame to update the metadata of the object. Of note, core columns can be used in this framework to create new metadata columns.

@mikelove
Copy link
Member

mikelove commented Apr 7, 2024

thanks @ppaxisa!

I'm just doing a little package cleanup today, I plan to merge in your PR on Wednesday (i'm out of office Mon and Tue)

@sa-lee
Copy link
Collaborator

sa-lee commented Apr 7, 2024

Is this a breaking change when a user tries to mutate with a column that is an S4 class (one of the original reasons I didn't use base tidyverse for grouped mutate in plyranges, and why it was slow)? I would prefer an implementation that does allow for S4 columns but understand that maybe this is a rare use case and we are willing to make tradeoffs here. @lawremi do you have thoughts on this?

@ppaxisa
Copy link
Member Author

ppaxisa commented Apr 8, 2024

I have not formally tested that but yes, I believe this would be a breaking change if attempting to use S4 columns on grouped object (S4 columns would still work on ungrouped object).

@mikelove
Copy link
Member

mikelove commented Apr 8, 2024

I’ll can do testing on Wednesday (and add unit tests for this case). Maybe there is a solution so we deliver speed when we can but not break on S4. Ranges derived from TxDb and EnsDb are common and have S4 columns

@mikelove
Copy link
Member

@sa-lee Pierre-Paul and I are considering:

Upon group_by + mutate:

  • If metadata cols contains any S4 variable, do the original behavior but show a message that a faster operation is possible without S4 variables (either through select or downgrading)
  • Provide a convenience function for downgrading S4 columns
  • Provide the faster dplyr-based mutate only when S4 columns are not present

Thoughts?

@sa-lee
Copy link
Collaborator

sa-lee commented Apr 11, 2024

That sounds like a good approach to me. There are vectorised methods for a lot of S4 List classes that group_by() + summarize() takes advantage of, so maybe also worth looking into that code as well to see if anything can be adapted over.

@mikelove
Copy link
Member

Had some things pop up so didn’t get a chance to implement anything but plan on it later this week or next.

Thinking this will go to next Bioc cycle which is fine — important to get this working well, given the broad impact.

@ppaxisa
Copy link
Member Author

ppaxisa commented Apr 18, 2024

If it helps: can you give me some pointers to try and write the convenience function that checks if there are S4 columns in the DataFrame? Not sure where to start but I can spend some time to move forward on this.

@mikelove
Copy link
Member

@mikelove
Copy link
Member

I have time again to work on this (really!). I'll report back by end of week.

@mikelove
Copy link
Member

mikelove commented May 3, 2024

So I think just a check here that there are no S4 columns is all we need. I can build out the rest.

I was trying to follow the logic of this function and got a little lost. Should it just be an if / else block? Because any(core) and any(!core) aren't complements.

https://github.com/ImmuneAxisa/plyranges/blob/mutate_grp_revamp/R/dplyr-mutate.R#L88-L125

Another ask: because the code went through many iterations, could you make a new branch with your final changes, and can you mark the regions with comments, e.g. # PPA grouped mutate speedup, 2024

@mikelove
Copy link
Member

mikelove commented May 3, 2024

> d <- DataFrame(x=1:10, y=11:20, z=IntegerList(as.list(1:10)))
> lapply(d, isS4)
$x
[1] FALSE

$y
[1] FALSE

$z
[1] TRUE

> any(sapply(d, isS4))
[1] TRUE
> 

@ppaxisa
Copy link
Member Author

ppaxisa commented May 7, 2024

Ok I'll review the decision tree for which functions to use, comment that, and create a new branch. This might take a little while, I have limited bandwidth the next 3 weeks. I guess I'll have to close this pull request and submit a new one?

@mikelove
Copy link
Member

mikelove commented May 7, 2024

That would be awesome. No time pressure and thanks again for your effort on this.

You can just leave this PR open, and make a new one once you get to it?

@sa-lee
Copy link
Collaborator

sa-lee commented May 8, 2024

Just chiming into say I'm happy to review an PRs moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants