feat: Introduce `InstructionFinetuningDataRepository` #1033

NickyHavoc · 2024-09-11T14:41:58Z

Description

No description.

Before Merging

Review the code changes
- Unused print / comments / TODOs
- Missing docstrings for functions that should have them
- Consistent variable names
- ...
Update changelog.md if necessary
Commit messages should contain a semantic label and the ticket number
- Consider squashing if this is not the case

mveleci · 2024-10-14T14:08:45Z

src/intelligence_layer/learning/postgres_instruction_finetuning_data_repository.py

+        self.engine = create_engine(database_url)
+        self.Session = sessionmaker(bind=self.engine)
+
+        Base.metadata.create_all(self.engine)


If our database model change, do we plan to support the user with database migrations?

Yes... What would you think that would imply for now?

I also don't want to overcomplicate this for now. I'd rather have people use it and then worry later about something like migration.

mveleci · 2024-10-14T14:22:09Z

src/intelligence_layer/learning/postgres_instruction_finetuning_data_repository.py

+        self,
+        database_url: str,
+    ) -> None:
+        self.engine = create_engine(database_url)


Can we have connection pooling? https://docs.sqlalchemy.org/en/20/core/pooling.html#connection-pool-configuration if multiple users connect to the database it can reach a limit.

I do not know if also make sense to use the postgresql+asyncpg. In order to not block the code, but maybe this is a topic for later.

Yeah, sounds like for later. For now, I just introduced pooling as in the docs.

mveleci · 2024-10-14T14:25:47Z

src/intelligence_layer/learning/postgres_instruction_finetuning_data_repository.py

+                .filter(InstructionFinetuningSample_.id.in_(ids))
+                .all()
+            )
+            for db_sample in db_samples:


Should do implement pagination here? using limit and offset

WIP: implement initial interface WIP: minimal working implementation WIP: store multiple samples for postgres repo WIP: poetry lock, linting WIP: actually running poetry lock WIP: seperate functions for single and batch storing WIP: test sample validations WIP: `InstructionFinetuningDataHandler` WIP: Support filtering WIP: linting feat: `FileInstructionFinetuningDataRepository` WIP: user-facing functions poetry install

temp commit bugfix in samples_with_filter poetry update

…ingDataRepository` poetry lock

SebastianNiehusAA

Thank you for the PR.
Mainly, I would like some more docstrings (see comments), but overall it's looking good.
Also, please add an entry to the CHANGELOG.md

SebastianNiehusAA · 2024-10-18T06:10:49Z

src/intelligence_layer/core/model.py

+    def to_finetuning_sample(
+        self, messages: Sequence[Message]
+    ) -> Sequence[FinetuningMessage]:
+        """Abstract function allowing a user to what the model's finetuning samples should look like.


Suggested change

"""Abstract function allowing a user to what the model's finetuning samples should look like.

"""Abstract function allowing a user to define what the model's finetuning samples should look like.

SebastianNiehusAA · 2024-10-18T06:37:57Z

src/intelligence_layer/core/model.py

+def to_llama_3_finetuning_sample(
+    messages: Sequence[Message], eot_token: str
+) -> Sequence[FinetuningMessage]:
+    """Turn a sequence of messages into a finetuning train sample using the llama-3 format.


Suggested change

"""Turn a sequence of messages into a finetuning train sample using the llama-3 format.

"""Turn a sequence of messages into a finetuning training sample using the llama-3 format.

Seeing that it's called "train" in most places of the code anyway, we might also just leave this as "train". Don't have a strong opinion here.

SebastianNiehusAA · 2024-10-18T06:42:15Z

src/intelligence_layer/core/model.py

    role: Literal["system", "user", "assistant"]
    content: str


+class FinetuningMessage(BaseModel, frozen=True):


Please add a brief docstring

SebastianNiehusAA · 2024-10-18T06:56:57Z

src/intelligence_layer/learning/enrich.py

Please add docstrings to the classes and methods of this file

SebastianNiehusAA · 2024-10-18T07:02:18Z

scripts/test.sh

@@ -1,3 +1,3 @@
 #!/usr/bin/env -S bash -eu -o pipefail

-TQDM_DISABLE=1 poetry run pytest -n 10
+TQDM_DISABLE=1 poetry run pytest -n 10 -s


Do we really need the extra verbosity of the '-s'?

Can't remember putting it there, removing...

SebastianNiehusAA · 2024-10-18T08:17:35Z

tests/learning/test_file_instruction_finetuning_data_repository.py

+    n = 5
+    samples = [
+        InstructionFinetuningSample.from_raw_sample(raw_instruction_finetuning_sample)
+        for _ in range(n)


This should store more then n samples since we want to check if the hean method is properly stopping after retrieving n entries.

tests/learning/test_file_instruction_finetuning_data_repository.py

SebastianNiehusAA · 2024-10-18T08:32:15Z

src/intelligence_layer/learning/instruction_finetuning_data_handler.py

Please add docstrings to methods

SebastianNiehusAA · 2024-10-18T08:52:28Z

tests/learning/test_postgres_instruction_finetuning_data_repository.py

+    n = 5
+    samples = [
+        InstructionFinetuningSample.from_raw_sample(raw_instruction_finetuning_sample)
+        for _ in range(n)


The test should add more than n samples since we want to check if head properly stops after n

Agreed, will do.

tests/learning/test_postgres_instruction_finetuning_data_repository.py

NickyHavoc force-pushed the instruction-finetuning-data-repository branch 10 times, most recently from bef8a61 to f5855ce Compare October 1, 2024 10:09

NickyHavoc force-pushed the instruction-finetuning-data-repository branch from f696773 to de46f5e Compare October 2, 2024 12:45

LisaBM force-pushed the instruction-finetuning-data-repository branch 5 times, most recently from 5365b05 to df81dd2 Compare October 2, 2024 14:54

NickyHavoc force-pushed the instruction-finetuning-data-repository branch 2 times, most recently from d59a31e to 8f5a8c8 Compare October 14, 2024 12:29

NiklasKoehneckeAA changed the title ~~WIP: InstructionFinetuningDataRepository~~ feat: Introduce InstructionFinetuningDataRepository Oct 14, 2024

mveleci requested changes Oct 14, 2024

View reviewed changes

NickyHavoc added 2 commits October 16, 2024 11:21

instruction_finetuning_handler_builder for easier handler construction

c0f2abd

temp commit bugfix in samples_with_filter poetry update

NickyHavoc force-pushed the instruction-finetuning-data-repository branch from 588d547 to 03272e8 Compare October 16, 2024 09:22

feat: use session pooling & pagination in `PostgresInstructionFinetun…

61a2457

…ingDataRepository` poetry lock

NickyHavoc force-pushed the instruction-finetuning-data-repository branch from 03272e8 to 61a2457 Compare October 16, 2024 09:22

azayz requested a review from mveleci October 17, 2024 08:06

SebastianNiehusAA requested changes Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Introduce `InstructionFinetuningDataRepository` #1033

feat: Introduce `InstructionFinetuningDataRepository` #1033

NickyHavoc commented Sep 11, 2024

mveleci Oct 14, 2024

NickyHavoc Oct 16, 2024

NickyHavoc Oct 16, 2024

mveleci Oct 14, 2024

NickyHavoc Oct 16, 2024

mveleci Oct 16, 2024

NickyHavoc Oct 16, 2024

mveleci Oct 14, 2024

NickyHavoc Oct 16, 2024

SebastianNiehusAA left a comment

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

NickyHavoc Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

SebastianNiehusAA Oct 18, 2024

NickyHavoc Oct 18, 2024

	"""Abstract function allowing a user to what the model's finetuning samples should look like.
	"""Abstract function allowing a user to define what the model's finetuning samples should look like.

	"""Turn a sequence of messages into a finetuning train sample using the llama-3 format.
	"""Turn a sequence of messages into a finetuning training sample using the llama-3 format.

feat: Introduce InstructionFinetuningDataRepository #1033

Are you sure you want to change the base?

feat: Introduce InstructionFinetuningDataRepository #1033

Conversation

NickyHavoc commented Sep 11, 2024

Description

Before Merging

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebastianNiehusAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feat: Introduce `InstructionFinetuningDataRepository` #1033

feat: Introduce `InstructionFinetuningDataRepository` #1033