-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Introduce InstructionFinetuningDataRepository
#1033
base: main
Are you sure you want to change the base?
Conversation
bef8a61
to
f5855ce
Compare
f696773
to
de46f5e
Compare
5365b05
to
df81dd2
Compare
d59a31e
to
8f5a8c8
Compare
InstructionFinetuningDataRepository
InstructionFinetuningDataRepository
self.engine = create_engine(database_url) | ||
self.Session = sessionmaker(bind=self.engine) | ||
|
||
Base.metadata.create_all(self.engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If our database model change, do we plan to support the user with database migrations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... What would you think that would imply for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't want to overcomplicate this for now. I'd rather have people use it and then worry later about something like migration.
self, | ||
database_url: str, | ||
) -> None: | ||
self.engine = create_engine(database_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have connection pooling? https://docs.sqlalchemy.org/en/20/core/pooling.html#connection-pool-configuration if multiple users connect to the database it can reach a limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know if also make sense to use the postgresql+asyncpg. In order to not block the code, but maybe this is a topic for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds like for later. For now, I just introduced pooling as in the docs.
.filter(InstructionFinetuningSample_.id.in_(ids)) | ||
.all() | ||
) | ||
for db_sample in db_samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should do implement pagination here? using limit and offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
WIP: implement initial interface WIP: minimal working implementation WIP: store multiple samples for postgres repo WIP: poetry lock, linting WIP: actually running poetry lock WIP: seperate functions for single and batch storing WIP: test sample validations WIP: `InstructionFinetuningDataHandler` WIP: Support filtering WIP: linting feat: `FileInstructionFinetuningDataRepository` WIP: user-facing functions poetry install
temp commit bugfix in samples_with_filter poetry update
588d547
to
03272e8
Compare
…ingDataRepository` poetry lock
03272e8
to
61a2457
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR.
Mainly, I would like some more docstrings (see comments), but overall it's looking good.
Also, please add an entry to the CHANGELOG.md
def to_finetuning_sample( | ||
self, messages: Sequence[Message] | ||
) -> Sequence[FinetuningMessage]: | ||
"""Abstract function allowing a user to what the model's finetuning samples should look like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Abstract function allowing a user to what the model's finetuning samples should look like. | |
"""Abstract function allowing a user to define what the model's finetuning samples should look like. |
def to_llama_3_finetuning_sample( | ||
messages: Sequence[Message], eot_token: str | ||
) -> Sequence[FinetuningMessage]: | ||
"""Turn a sequence of messages into a finetuning train sample using the llama-3 format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Turn a sequence of messages into a finetuning train sample using the llama-3 format. | |
"""Turn a sequence of messages into a finetuning training sample using the llama-3 format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing that it's called "train" in most places of the code anyway, we might also just leave this as "train". Don't have a strong opinion here.
role: Literal["system", "user", "assistant"] | ||
content: str | ||
|
||
|
||
class FinetuningMessage(BaseModel, frozen=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a brief docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docstrings to the classes and methods of this file
@@ -1,3 +1,3 @@ | |||
#!/usr/bin/env -S bash -eu -o pipefail | |||
|
|||
TQDM_DISABLE=1 poetry run pytest -n 10 | |||
TQDM_DISABLE=1 poetry run pytest -n 10 -s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the extra verbosity of the '-s'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't remember putting it there, removing...
n = 5 | ||
samples = [ | ||
InstructionFinetuningSample.from_raw_sample(raw_instruction_finetuning_sample) | ||
for _ in range(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should store more then n samples since we want to check if the hean method is properly stopping after retrieving n entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docstrings to methods
n = 5 | ||
samples = [ | ||
InstructionFinetuningSample.from_raw_sample(raw_instruction_finetuning_sample) | ||
for _ in range(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should add more than n samples since we want to check if head properly stops after n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will do.
Description
No description.
Before Merging
changelog.md
if necessary