Proposal: Report non-fatal errors from the WebNN timeline #778

a-sully · 2024-11-05T22:54:41Z

The Problem (see #477)

Our current method for surfacing dispatch() errors is to "lose" the MLContext. As I mentioned in #754 (comment) I don't think it makes sense for this to be the only option for surfacing errors from dispatch():

I don't think we can assume every failed dispatch() results in a lost MLContext, especially considering platforms where an MLContext is not so closely tied to a single GPU

Losing the MLContext is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think the MLContext is always the right blast radius for a dispatch() error.

There is also no way whatsoever to surface an error from writeTensor()!

State of the World

Here are examples of how I've observed dispatch() fail in the current Chromium implementation:

The process executing the graph OOMs and crashes
- blowing away the MLContext may indeed be the only option
Some resource allocation while executing the graph fails and aborts gracefully. Some thoughts on how to react:
- it may be reasonable to blow away the entire MLContext e.g. if you assume an OOM is imminent,
- it may be reasonable to just blow away the MLGraph e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM
- it may be reasonable to assume this issue is transitory
Graph execution fails due to a runtime error inherent in running the compiled graph in the current environment, meaning that executing this graph will always fail. Ideally this type of failure would be surfaced earlier during MLGraphBuilder.build(), but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:
- blowing away the entire MLContext is not a useful option
- it may be reasonable to blow away the MLGraph, especially if you're confident it will never execute successfully
- it's possible - though it seems unlikely - this issue is transitory
- you may not know whether you're actually in case 4
Graph execution fails due to a runtime error caused by the specific graph inputs and outputs. From what I can tell, this is always(?) due to issues with the user agent implementation. For example, TFLite does not support negative or OOB indices for gather ops (see Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:
- blowing away the entire MLContext is not a useful option
- it may be reasonable to assume this issue is transitory...
- ...though it may also be reasonable to assume that the website may attempt to dispatch() with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away the MLGraph
- you may not know whether you're actually in case 3

Observations

In each of cases 2, 3, and 4, a mechanism to signal failure without escalating by destroying the MLContext (or the entire GPU process) would be useful
Ideally case 3 would not exist. It is unfortunate that frameworks like CoreML "successfully" compile graphs which will never run, though user agents should also do more to cover for these bugs
- One example we've observed in Chromium is behavioral differences at runtime (including dispatch() failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessible
- In the long run, I expect frameworks to address some of these bugs...
- ...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...
- ...and since some frameworks and drivers are only updated with the OS, in practice I expect this to be an issue we'll have to contend with for quite a long time, even considering user agent workarounds
- More sophisticated techniques to work around these bugs come with drawbacks. For example, the user agent could execute the compiled graph with dummy inputs to probe for runtime errors. However, this may be expensive and the dummy inputs may not even exercise the problematic code path(s). For example, graphs with the where operator may fail to hit the affected branch(es).
Eventually, case 4 should not exist. This requires all WebNN operators to be well-defined and implementations work around the quirks of the underlying platform however is necessary to comply with these specs. The WebNN spec and Chromium implementation are immature, but I expect we'll get there
Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent
Blowing away the MLGraph is a reasonable (though not strictly necessary) response to examples 2, 3, and 4

Failures are cascading

// If this dispatch fails...
context.dispatch(graph, inputs, {'output': intermediateTensor});

// Any operations which depend on `intermediateTensor` should also fail.
context.dispatch(graph, {'input': intermediateTensor}, outputs);

Tracking down the source of cascading failures is challenging
Failures only matter if they're observable
- If a tree falls in a forest...
- If a dispatch() fails but its output tensors are never read back...
- If a dispatch() fails but its output tensors are later overwritten by new data...
Results of operations on the WebNN timeline are observable to script only via a limited set of async APIs:
- readTensor()
- (eventually) importExternalBuffer()

Proposal

If an operation on the WebNN timeline (writeTensor(), dispatch()) catastrophically fails, continue to lose the MLContext
If the failure is not catastrophic, the affected objects (usually MLTensors, though possibly also an MLGraph, TBD) are put into an errored state
An object's errored state may be reset if it is the output of a successful operation
- e.g. writeTensor() writes new data
Any operations which take an errored object as an input will propagate this error to its outputs
Any promise-bearing operations which take an errored object as an input will reject the promise
Use labels to improve debuggability by attributing failures to specific operations

Example:

// If this dispatch fails, `tensorA` is put into an errored state.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});

// An operation dependent on `tensorA` will fail.
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});

// This promise will reject with a implementation-defined message which at
// minumum mentions 'foo'.
context.readTensor(tensorA)
    .catch(error => ...);

// This promise will reject with a implementation-defined message which at
// minumum mentions 'bar' (though perhaps also points back to 'foo').
context.readTensor(tensorB)
    .catch(error => ...);

// Clears the errored state of `tensorA` if the write is successful.
context.writeTensor(tensorA);

Open Questions

In the example above, should graph1 be put into an errored state, too?
- Or only if the user agent believes graph1 will always fail to execute?
Do we need a more structured format for reporting errors?
- I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.
How will errors be reported with a sync importExternalBuffer() method?
- I'm tentatively hoping GPUError scopes will be able to handle this case
Should createBuffer() be made synchronous and use this error reporting mechanism?
This proposal does not include a way to query for this errored state on the affected objects from script (e.g. MLTensor.error), since the errored state exists on the WebNN timeline. Is that sufficient?

Tentative IDL:

dictionary MLObjectDescriptorBase {
  USVString label = "";
};

// Add labels to operations on the WebNN timeline.
void dispatch(
    MLGraph graph,
    MLNamedTensors inputs,
    MLNamedTensors outputs,
    optional MLObjectDescriptorBase options = {});

void writeTensor(
    MLTensor tensor,
    AllowSharedBufferSource inputData,
    optional MLObjectDescriptorBase options = {});

// Add labels to objects which may be used on the WebNN timeline.
partial dictionary MLContextOptions : MLObjectDescriptorBase {}

partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {}

partial interface MLGraphBuilder {
  // To label the resulting MLGraph.
  Promise<MLGraph> build(
    MLNamedOperands outputs,
    optional MLObjectDescriptorBase options = {});
};

The text was updated successfully, but these errors were encountered:

bbernhar · 2024-11-13T23:00:57Z

Thanks @a-sully for the proposal.

A couple of thoughts.

Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU).

Which WebNN backend is expected to fail after build() but before execution? Seems undesirable. Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it.

a-sully · 2024-11-15T20:06:24Z

Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU).

That's more or less what I've proposed :) See MLObjectDescriptorBase (and its usages) in the Tentative IDL section.

Which WebNN backend is expected to fail after build() but before execution? Seems undesirable.

The bigger problem we're seeing right now is backends failing during graph execution. That being said, there's a class of failures where an inconsistency in system state (or assumed system state, in the example below) between build() and dispatch() leads to failures such that build() succeeds and dispatch() will always fail. From the Observations section:

...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...

I agree it's undesirable, but I argue that it's unavoidable:

Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent

Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it.

Could you elaborate on what you mean by this?

RafaelCintron · 2024-11-20T01:19:52Z

@a-sully , thank you very much for putting this together.

Re: Labeling object. I am always in favor giving web developers a way to label objects and use those labels in subsequent diagnostic output, or errors flagged by the browser. Should we derive MLTensor and MLGraph off of MLObjectBase as well?

In the example above, should graph1 be put into an errored state, too?

For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad? Would failing dispatches subsequently succeed if you used an input with different values or is the input object doomed to fail no matter what graph you use it with? Knowing this would inform which object we should put into an error state, or propagating error state to.

When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next?

anssiko added the feature request label Nov 6, 2024

anssiko mentioned this issue Nov 6, 2024

API lacks handling for async ML device errors on the context #477

Closed

a-sully added a commit to a-sully/webnn that referenced this issue Nov 14, 2024

point to webmachinelearning#778

d87c18d

a-sully mentioned this issue Nov 14, 2024

Editorial: Link to error handling proposal from the MLTensor explainer #785

Open

inexorabletash mentioned this issue Nov 15, 2024

Specify MLTensor #787

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Report non-fatal errors from the WebNN timeline #778

Proposal: Report non-fatal errors from the WebNN timeline #778

a-sully commented Nov 5, 2024 •

edited

Loading

bbernhar commented Nov 13, 2024

a-sully commented Nov 15, 2024

RafaelCintron commented Nov 20, 2024

Proposal: Report non-fatal errors from the WebNN timeline #778

Proposal: Report non-fatal errors from the WebNN timeline #778

Comments

a-sully commented Nov 5, 2024 • edited Loading

The Problem (see #477)

State of the World

Observations

Proposal

Open Questions

Tentative IDL:

bbernhar commented Nov 13, 2024

a-sully commented Nov 15, 2024

RafaelCintron commented Nov 20, 2024

a-sully commented Nov 5, 2024 •

edited

Loading