New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

New Dataset and OCR variational pipeline #18

Open

J-Dymond wants to merge 60 commits into main from 8-taxi500-dataset

Collaborator

J-Dymond commented Nov 13, 2024

This branch includes:

src/arc_spice/data/multieurlex_utils.py
- Here code for loading and preprocessing the MultiEURLEX dataset is located.
src/arc_spice/variational_pipelines/RTC_variational_pipeline.py
- Here the variational pipeline is located. It has clean_inference and variational_inference functionality. As well as calculating some confidence metrics on the outputs.
src/arc_spice/variational_pipelines/dropout_utils.py
- This file contains some utility functions for performing MC dropout.
src/arc_spice/eval/classification_error.py and src/arc_spice/eval/translation_error.py
- These contain some helper functions for calculating errors and uncertainties relating to the two tasks
scripts/variational_RTC_example.py
- This is a barebones script with example usage of the variational pipeline.

J-Dymond added 30 commits

October 10, 2024 16:48


          added data and temp to the gitignore

0ea97af


          created the processed data for Taxi1500 experiments

66612d3


          renamed dataset creation to specify the dataset it is for (taxi1500)

bd06c4f


          added some resources for the MultiEURLEX dataset

40a258a


          added a dataloader for the MultiEURLEX dataset, and started an exampl…

b6b089c

…e script/pipeline


          creating pipeline for RTC

08571a8


          added some functions which will load a pipeline with dropout turned o…

e637e03

…n by default, also a function to change the dropout setting at runtime


          removed old test json script

f71a184


          renamed file structure

5acde55


          adding basic pipeline structure

990762e


          re-working pipeline

d594a43


          written a preliminary variational inference pipeline

59dd65f


          renamed the variational inference file

7df9c8b


          added logits, token-wise entropy, and token-wise probability to pipeline

86839b8


          added uncertainty to the TTS pipeline, example script shows cumulativ…

36797a1

…e uncertainty. TODO: calibrate these confidences


          variational pipeline now calculates the confidence metrics for each f…

28b81fb

…orward pass


          removing .vscode from the repo

f127e1b


          added some comments

2af2506


          pipeline now calculates a sentence embedding for each input

2c2ba6a


          semantic density now calculated in the class during variational infer…

3948f6e

…ence


          added some file structure for the data, and added the files to the gi…

fd1578d

…tignore


          added data and temp to the gitignore

9e29360


          created the processed data for Taxi1500 experiments

4fb94dd


          starting rebase

e2c2129


          restructured pipelines creating variational pipeline for the RTC task

3ee9617


          added variational confidence metrics to RTC pipeline. WIP, currently …

8398caf

…cannot stack classification outputs


          added test script

c017e6d


          used keywords for the pipeline definitions

b0e8da7


          removed Taxi1500 dataset

3e3c9f5


          removed Taxi1500 dataset

921f4a2

J-Dymond and others added 16 commits

November 6, 2024 16:40


          added a function which will be MC dropout based uncerainty measure fo…

b9897e3

…r classification, also added some tests to output the question to a text file


          added a function which will be MC dropout based uncerainty measure fo…

c102d25

…r classification, also added some tests to output the question to a text file


          removed the exit()

36c7178


          changes to the codebase for baskerville

719d361


          Merge remote-tracking branch 'refs/remotes/origin/8-taxi500-dataset' …

2a10ba2

…into 8-taxi500-dataset


          initial paired refactor

41ea190


          added some classification confidence measures

07a7ee1


          added standard deviation to example script

6e13f76


          removed unnecessary scripts

dbbaebe


          cleaned up the inference example

a043088


          removed unnecessary data script

d451528


          merging variational branch changes

6bd117c


          refactored dataset utils

d549c7c


          clarified the citations for the semantic density implementation

a0e8088


          commented the variational pipeline and dropout utilities files

7f6db59


          commented data utils

704f5f7

J-Dymond linked an issue

that may be closed by this pull request

Develop OCR-translate-topic pipeline #4

Open

J-Dymond requested review from eddableheath and lannelin and removed request for eddableheath

November 14, 2024 10:44

lannelin requested changes

View reviewed changes

Collaborator

lannelin left a comment

leaving a partial review before I jump into meetings.

Overall looks good, I've added some comments requesting some small changes.

.gitignore Outdated



		# project related
		data/Taxi1500*

Collaborator

lannelin Nov 14, 2024

presumably not needed now?

Collaborator Author

J-Dymond Nov 14, 2024

No you're right, we can remove this

pyproject.toml Outdated

@@ @@ -34,7 +34,11 @@ dependencies = [ @@
                 "numpy",
                 "sentencepiece",
                 "librosa",
-                "soundfile"
+                "soundfile",

Collaborator

lannelin Nov 14, 2024

is this necessary? we're not working with audio

Collaborator Author

J-Dymond Nov 14, 2024

This was left over from the transcription stuff, but yeah we can get rid of this, though it will probably be in all of our virtual environments now, so might it get added again?

requirements.txt Outdated

Collaborator

lannelin Nov 14, 2024

I think this can probably be removed and we can rely on the pyproject? not sure what best practice is

Collaborator Author

J-Dymond Nov 14, 2024

I'm also not sure, happy to go with what you both think is most appropriate.

Collaborator

lannelin Nov 15, 2024

i'd remove

scripts/variational_RTC_example.py Outdated



		if __name__ == "__main__":
		RTC_pars = {

Collaborator

lannelin Nov 14, 2024

snake_case makes it clear when something is a variable vs a class, I'd change this to rtc_params

Collaborator Author

J-Dymond Nov 14, 2024

will do

slurm_scripts/run_variational_RTC_example.sh

+              # change huggingface cache to be in project dir rather than user home
+              export HF_HOME="/bask/projects/v/vjgo8416-spice/hf_cache"
+              # TODO: script uses relative path to project home so must be run from home, fix

Collaborator

lannelin Nov 14, 2024

outstanding TODO - I think a simple fix in the data loading, will comment separately.

src/arc_spice/pipelines/RTC_pipeline.py Outdated

+                  def __init__(self, model_pars, data_pars) -> None:
+                      self.pars = model_pars
+                      self.OCR = pipeline(

Collaborator

lannelin Nov 14, 2024

-> self.ocr

src/arc_spice/pipelines/RTC_pipeline.py Outdated


		return translation

		return self.translator(text)[0]["translation_text"]

Collaborator

lannelin Nov 14, 2024

is this [0] relying on the fact that text is a str and never a list of strings? if so, maybe add a guard to check it is a str

Collaborator Author

J-Dymond Nov 14, 2024

Yes, I think this assumes a batch size of 1

src/arc_spice/pipelines/RTC_pipeline.py Outdated

Collaborator

lannelin Nov 14, 2024

type hints

src/arc_spice/pipelines/RTC_pipeline.py Outdated

+                      recognition = self.OCR(x)
+                      self.results["OCR"] = recognition["text"]
+                      self.results["translation"] = self.translate(recognition["text"])
+                      self.results["classification"] = self.classify(self.results["translation"])

Collaborator

lannelin Nov 14, 2024

why store results rather than simply return?

Collaborator Author

J-Dymond Nov 14, 2024

Since this is the clean pipeline I'm not sure, either way this is deprecated now so perhaps it is best to remove?

src/arc_spice/pipelines/RTC_pipeline.py Outdated

Collaborator

lannelin Nov 14, 2024

what's the purpose of this file? will we ever use it and not the variational one? should the variational one extend this rather than be completely separate?

Collaborator Author

J-Dymond Nov 14, 2024

As above, I can remove this now, the variational one has the same functionality when used with clean_inference().

Collaborator Author

J-Dymond Nov 15, 2024

removed

lannelin requested changes

View reviewed changes

data/MultiEURLEX/data/eurovoc_concepts.json Outdated

Collaborator

lannelin Nov 14, 2024

I think we'll need to link to license under CC-BY-4.0, maybe add a README to this folder?

Collaborator Author

J-Dymond Nov 15, 2024

added README.md

src/arc_spice/variational_pipelines/RTC_variational_pipeline.py Outdated

+                  set_dropout,
+              )
+              # From huggingface page with model:

Collaborator

lannelin Nov 14, 2024

remove?

Collaborator Author

J-Dymond Nov 15, 2024

Yeah, this was left over from old versions of this file. Removed.

src/arc_spice/variational_pipelines/RTC_variational_pipeline.py Outdated

+                                  new_values[run] = self.var_output[step][run][metric]
+                              new_var_dict[step][metric] = new_values
+                      # overwrite the existing output dictionary
+                      self.var_output = new_var_dict

Collaborator

lannelin Nov 14, 2024

why not return?

Collaborator

lannelin Nov 14, 2024

same for other fns

Collaborator Author

J-Dymond Nov 14, 2024

No particular reason, we could maybe refactor this in a separate issue? Or I can change this before we merge?

Collaborator

lannelin Nov 14, 2024

I don't think it's a massive change so let's just make it now.

Collaborator Author

J-Dymond Nov 15, 2024

will do this now

J-Dymond and others added 2 commits

November 15, 2024 13:17


          Made changes suggested in pull request

570a507


          Merge branch 'main' into 8-taxi500-dataset

e9b2d17

lannelin requested changes

View reviewed changes

Collaborator

lannelin left a comment

Functionality looks good.

I resolved the merge conflict for .gitignore as it was stopping the CI checks from running. Those are failing at the moment so could you take a look at why @J-Dymond ? I suspect you haven't got pre-commit installed locally when you're committing up. If you haven't, try:

pip install -e ".[dev]" # from project dir, installs with dev deps
pre-commit install
pre-commit run --all-files

src/arc_spice/variational_pipelines/RTC_variational_pipeline.py Outdated

@@ @@ -259,26 +256,26 @@ def stack_translator_sentence_metrics( @@
                           ]
                       return stacked
-                  def stack_variational_outputs(self):
+                  def stack_variational_outputs(self, var_output):

Collaborator

lannelin Nov 15, 2024

type hints

J-Dymond added 2 commits

November 15, 2024 16:27


          fixed pre-commit

3fe9180


          merge

7aadf7d

review-notebook-app bot commented Nov 15, 2024

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

J-Dymond linked an issue

that may be closed by this pull request

taxi500 dataset #8

Closed


          fixing pre-commit

6d3b737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet