Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dataset and OCR variational pipeline #18

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open

Conversation

J-Dymond
Copy link
Collaborator

This branch includes:

  • src/arc_spice/data/multieurlex_utils.py

    • Here code for loading and preprocessing the MultiEURLEX dataset is located.
  • src/arc_spice/variational_pipelines/RTC_variational_pipeline.py

    • Here the variational pipeline is located. It has clean_inference and variational_inference functionality. As well as calculating some confidence metrics on the outputs.
  • src/arc_spice/variational_pipelines/dropout_utils.py

    • This file contains some utility functions for performing MC dropout.
  • src/arc_spice/eval/classification_error.py and src/arc_spice/eval/translation_error.py

    • These contain some helper functions for calculating errors and uncertainties relating to the two tasks
  • scripts/variational_RTC_example.py

    • This is a barebones script with example usage of the variational pipeline.

…n by default, also a function to change the dropout setting at runtime
…e uncertainty. TODO: calibrate these confidences
@J-Dymond J-Dymond linked an issue Nov 13, 2024 that may be closed by this pull request
@J-Dymond J-Dymond requested review from eddableheath and lannelin and removed request for eddableheath November 14, 2024 10:44
Copy link
Collaborator

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving a partial review before I jump into meetings.

Overall looks good, I've added some comments requesting some small changes.

.gitignore Outdated


# project related
data/Taxi1500*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably not needed now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you're right, we can remove this

pyproject.toml Outdated
@@ -34,7 +34,11 @@ dependencies = [
"numpy",
"sentencepiece",
"librosa",
"soundfile"
"soundfile",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? we're not working with audio

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left over from the transcription stuff, but yeah we can get rid of this, though it will probably be in all of our virtual environments now, so might it get added again?

requirements.txt Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can probably be removed and we can rely on the pyproject? not sure what best practice is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure, happy to go with what you both think is most appropriate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd remove



if __name__ == "__main__":
RTC_pars = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake_case makes it clear when something is a variable vs a class, I'd change this to rtc_params

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

# change huggingface cache to be in project dir rather than user home
export HF_HOME="/bask/projects/v/vjgo8416-spice/hf_cache"

# TODO: script uses relative path to project home so must be run from home, fix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outstanding TODO - I think a simple fix in the data loading, will comment separately.

def __init__(self, model_pars, data_pars) -> None:

self.pars = model_pars
self.OCR = pipeline(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> self.ocr


return translation

return self.translator(text)[0]["translation_text"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this [0] relying on the fact that text is a str and never a list of strings? if so, maybe add a guard to check it is a str

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this assumes a batch size of 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type hints

recognition = self.OCR(x)
self.results["OCR"] = recognition["text"]
self.results["translation"] = self.translate(recognition["text"])
self.results["classification"] = self.classify(self.results["translation"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why store results rather than simply return?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the clean pipeline I'm not sure, either way this is deprecated now so perhaps it is best to remove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this file? will we ever use it and not the variational one? should the variational one extend this rather than be completely separate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I can remove this now, the variational one has the same functionality when used with clean_inference().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to link to license under CC-BY-4.0, maybe add a README to this folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added README.md

set_dropout,
)

# From huggingface page with model:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was left over from old versions of this file. Removed.

new_values[run] = self.var_output[step][run][metric]
new_var_dict[step][metric] = new_values
# overwrite the existing output dictionary
self.var_output = new_var_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for other fns

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, we could maybe refactor this in a separate issue? Or I can change this before we merge?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a massive change so let's just make it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this now

Copy link
Collaborator

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality looks good.

I resolved the merge conflict for .gitignore as it was stopping the CI checks from running. Those are failing at the moment so could you take a look at why @J-Dymond ? I suspect you haven't got pre-commit installed locally when you're committing up. If you haven't, try:

pip install -e ".[dev]" # from project dir, installs with dev deps
pre-commit install
pre-commit run --all-files

@@ -259,26 +256,26 @@ def stack_translator_sentence_metrics(
]
return stacked

def stack_variational_outputs(self):
def stack_variational_outputs(self, var_output):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type hints

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@J-Dymond J-Dymond linked an issue Nov 15, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop OCR-translate-topic pipeline taxi500 dataset
2 participants