Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe datamodule #2403

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manuelkonrad
Copy link

@manuelkonrad manuelkonrad commented Nov 2, 2024

📝 Description

  • This PR adds the Dataframe datamodule which is instantiated directly from a pandas DataFrame. It is an alternative to the Folder datamodule for custom datasets where the labels are not encoded in the directory structure. Useful for situations where labels are refined regularly or for sub-sampling large datasets without copying or moving files.
  • The datamodule also includes a from_file constructor which loads the data from a tabular file supported by pandas. The file format is given as argument but I could also add explicit constructors such as from_csv or from_parquet.
  • As an alternative to Dataframe, I could rename the datamodule to Tabular in order to avoid confusion with pandas' DataFrame class.

✨ Changes

Select what type of change your PR is:

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🔨 Refactor (non-breaking change which refactors the code base)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • 📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
  • 📚 I have made the necessary updates to the documentation (if applicable).
  • 🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
@manuelkonrad manuelkonrad force-pushed the feature/dataframe-datamodule branch from 498d143 to d00c6b7 Compare November 23, 2024 16:18
@manuelkonrad
Copy link
Author

@samet-akcay @ashwinvaidya17 @djdameln

Hi there 👋 Any comments on this PR? This addition would greatly increase the flexibility of custom datasets and it does not introduce breaking changes. Any feedback would be appreciated, thanks!

@samet-akcay
Copy link
Contributor

@manuelkonrad, thanks for the reminder. I'll prioritise this tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants