Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce InstructionFinetuningDataRepository #1033

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NickyHavoc
Copy link
Collaborator

Description

No description.

Before Merging

  • Review the code changes
    • Unused print / comments / TODOs
    • Missing docstrings for functions that should have them
    • Consistent variable names
    • ...
  • Update changelog.md if necessary
  • Commit messages should contain a semantic label and the ticket number
    • Consider squashing if this is not the case

@NickyHavoc NickyHavoc force-pushed the instruction-finetuning-data-repository branch 10 times, most recently from bef8a61 to f5855ce Compare October 1, 2024 10:09
@NickyHavoc NickyHavoc force-pushed the instruction-finetuning-data-repository branch from f696773 to de46f5e Compare October 2, 2024 12:45
@LisaBM LisaBM force-pushed the instruction-finetuning-data-repository branch 5 times, most recently from 5365b05 to df81dd2 Compare October 2, 2024 14:54
@NickyHavoc NickyHavoc force-pushed the instruction-finetuning-data-repository branch 2 times, most recently from d59a31e to 8f5a8c8 Compare October 14, 2024 12:29
@NiklasKoehneckeAA NiklasKoehneckeAA changed the title WIP: InstructionFinetuningDataRepository feat: Introduce InstructionFinetuningDataRepository Oct 14, 2024
self.engine = create_engine(database_url)
self.Session = sessionmaker(bind=self.engine)

Base.metadata.create_all(self.engine)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our database model change, do we plan to support the user with database migrations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... What would you think that would imply for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't want to overcomplicate this for now. I'd rather have people use it and then worry later about something like migration.

self,
database_url: str,
) -> None:
self.engine = create_engine(database_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have connection pooling? https://docs.sqlalchemy.org/en/20/core/pooling.html#connection-pool-configuration if multiple users connect to the database it can reach a limit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if also make sense to use the postgresql+asyncpg. In order to not block the code, but maybe this is a topic for later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds like for later. For now, I just introduced pooling as in the docs.

.filter(InstructionFinetuningSample_.id.in_(ids))
.all()
)
for db_sample in db_samples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do implement pagination here? using limit and offset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

WIP: implement initial interface

WIP: minimal working implementation

WIP: store multiple samples for postgres repo

WIP: poetry lock, linting

WIP: actually running poetry lock

WIP: seperate functions for single and batch storing

WIP: test sample validations

WIP: `InstructionFinetuningDataHandler`

WIP: Support filtering

WIP: linting

feat: `FileInstructionFinetuningDataRepository`

WIP: user-facing functions

poetry install
temp commit

bugfix in samples_with_filter

poetry update
@NickyHavoc NickyHavoc force-pushed the instruction-finetuning-data-repository branch from 588d547 to 03272e8 Compare October 16, 2024 09:22
@NickyHavoc NickyHavoc force-pushed the instruction-finetuning-data-repository branch from 03272e8 to 61a2457 Compare October 16, 2024 09:22
@azayz azayz requested a review from mveleci October 17, 2024 08:06
Copy link
Contributor

@SebastianNiehusAA SebastianNiehusAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.
Mainly, I would like some more docstrings (see comments), but overall it's looking good.
Also, please add an entry to the CHANGELOG.md

def to_finetuning_sample(
self, messages: Sequence[Message]
) -> Sequence[FinetuningMessage]:
"""Abstract function allowing a user to what the model's finetuning samples should look like.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Abstract function allowing a user to what the model's finetuning samples should look like.
"""Abstract function allowing a user to define what the model's finetuning samples should look like.

def to_llama_3_finetuning_sample(
messages: Sequence[Message], eot_token: str
) -> Sequence[FinetuningMessage]:
"""Turn a sequence of messages into a finetuning train sample using the llama-3 format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Turn a sequence of messages into a finetuning train sample using the llama-3 format.
"""Turn a sequence of messages into a finetuning training sample using the llama-3 format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that it's called "train" in most places of the code anyway, we might also just leave this as "train". Don't have a strong opinion here.

role: Literal["system", "user", "assistant"]
content: str


class FinetuningMessage(BaseModel, frozen=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a brief docstring

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstrings to the classes and methods of this file

@@ -1,3 +1,3 @@
#!/usr/bin/env -S bash -eu -o pipefail

TQDM_DISABLE=1 poetry run pytest -n 10
TQDM_DISABLE=1 poetry run pytest -n 10 -s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the extra verbosity of the '-s'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember putting it there, removing...

n = 5
samples = [
InstructionFinetuningSample.from_raw_sample(raw_instruction_finetuning_sample)
for _ in range(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should store more then n samples since we want to check if the hean method is properly stopping after retrieving n entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstrings to methods

n = 5
samples = [
InstructionFinetuningSample.from_raw_sample(raw_instruction_finetuning_sample)
for _ in range(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should add more than n samples since we want to check if head properly stops after n

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants