-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Dataset and OCR variational pipeline #18
base: main
Are you sure you want to change the base?
Conversation
…e script/pipeline
…n by default, also a function to change the dropout setting at runtime
…e uncertainty. TODO: calibrate these confidences
…cannot stack classification outputs
…r classification, also added some tests to output the question to a text file
…r classification, also added some tests to output the question to a text file
…into 8-taxi500-dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving a partial review before I jump into meetings.
Overall looks good, I've added some comments requesting some small changes.
.gitignore
Outdated
|
||
|
||
# project related | ||
data/Taxi1500* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably not needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you're right, we can remove this
pyproject.toml
Outdated
@@ -34,7 +34,11 @@ dependencies = [ | |||
"numpy", | |||
"sentencepiece", | |||
"librosa", | |||
"soundfile" | |||
"soundfile", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? we're not working with audio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was left over from the transcription stuff, but yeah we can get rid of this, though it will probably be in all of our virtual environments now, so might it get added again?
requirements.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can probably be removed and we can rely on the pyproject? not sure what best practice is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure, happy to go with what you both think is most appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd remove
scripts/variational_RTC_example.py
Outdated
|
||
|
||
if __name__ == "__main__": | ||
RTC_pars = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake_case makes it clear when something is a variable vs a class, I'd change this to rtc_params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
# change huggingface cache to be in project dir rather than user home | ||
export HF_HOME="/bask/projects/v/vjgo8416-spice/hf_cache" | ||
|
||
# TODO: script uses relative path to project home so must be run from home, fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outstanding TODO - I think a simple fix in the data loading, will comment separately.
def __init__(self, model_pars, data_pars) -> None: | ||
|
||
self.pars = model_pars | ||
self.OCR = pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> self.ocr
|
||
return translation | ||
|
||
return self.translator(text)[0]["translation_text"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this [0] relying on the fact that text is a str and never a list of strings? if so, maybe add a guard to check it is a str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this assumes a batch size of 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type hints
recognition = self.OCR(x) | ||
self.results["OCR"] = recognition["text"] | ||
self.results["translation"] = self.translate(recognition["text"]) | ||
self.results["classification"] = self.classify(self.results["translation"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why store results rather than simply return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the clean pipeline I'm not sure, either way this is deprecated now so perhaps it is best to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of this file? will we ever use it and not the variational one? should the variational one extend this rather than be completely separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I can remove this now, the variational one has the same functionality when used with clean_inference()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to link to license under CC-BY-4.0, maybe add a README to this folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added README.md
set_dropout, | ||
) | ||
|
||
# From huggingface page with model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was left over from old versions of this file. Removed.
new_values[run] = self.var_output[step][run][metric] | ||
new_var_dict[step][metric] = new_values | ||
# overwrite the existing output dictionary | ||
self.var_output = new_var_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for other fns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, we could maybe refactor this in a separate issue? Or I can change this before we merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a massive change so let's just make it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionality looks good.
I resolved the merge conflict for .gitignore as it was stopping the CI checks from running. Those are failing at the moment so could you take a look at why @J-Dymond ? I suspect you haven't got pre-commit installed locally when you're committing up. If you haven't, try:
pip install -e ".[dev]" # from project dir, installs with dev deps
pre-commit install
pre-commit run --all-files
@@ -259,26 +256,26 @@ def stack_translator_sentence_metrics( | |||
] | |||
return stacked | |||
|
|||
def stack_variational_outputs(self): | |||
def stack_variational_outputs(self, var_output): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type hints
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This branch includes:
src/arc_spice/data/multieurlex_utils.py
src/arc_spice/variational_pipelines/RTC_variational_pipeline.py
clean_inference
andvariational_inference
functionality. As well as calculating some confidence metrics on the outputs.src/arc_spice/variational_pipelines/dropout_utils.py
src/arc_spice/eval/classification_error.py
andsrc/arc_spice/eval/translation_error.py
scripts/variational_RTC_example.py