-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Report non-fatal errors from the WebNN timeline #778
Comments
Thanks @a-sully for the proposal. A couple of thoughts. Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU). Which WebNN backend is expected to fail after build() but before execution? Seems undesirable. Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it. |
That's more or less what I've proposed :) See
The bigger problem we're seeing right now is backends failing during graph execution. That being said, there's a class of failures where an inconsistency in system state (or assumed system state, in the example below) between
I agree it's undesirable, but I argue that it's unavoidable:
Could you elaborate on what you mean by this? |
@a-sully , thank you very much for putting this together. Re: Labeling object. I am always in favor giving web developers a way to label objects and use those labels in subsequent diagnostic output, or errors flagged by the browser. Should we derive
For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad? Would failing dispatches subsequently succeed if you used an input with different values or is the input object doomed to fail no matter what graph you use it with? Knowing this would inform which object we should put into an error state, or propagating error state to. When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next? |
The Problem (see #477)
Our current method for surfacing
dispatch()
errors is to "lose" theMLContext
. As I mentioned in #754 (comment) I don't think it makes sense for this to be the only option for surfacing errors fromdispatch()
:Losing the
MLContext
is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think theMLContext
is always the right blast radius for adispatch()
error.There is also no way whatsoever to surface an error from
writeTensor()
!State of the World
Here are examples of how I've observed
dispatch()
fail in the current Chromium implementation:MLContext
may indeed be the only optionMLContext
e.g. if you assume an OOM is imminent,MLGraph
e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOMMLGraphBuilder.build()
, but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:MLContext
is not a useful optionMLGraph
, especially if you're confident it will never execute successfullygather
ops (see Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:MLContext
is not a useful optiondispatch()
with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away theMLGraph
Observations
MLContext
(or the entire GPU process) would be usefuldispatch()
failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessiblewhere
operator may fail to hit the affected branch(es).MLGraph
is a reasonable (though not strictly necessary) response to examples 2, 3, and 4dispatch()
fails but its output tensors are never read back...dispatch()
fails but its output tensors are later overwritten by new data...readTensor()
importExternalBuffer()
Proposal
writeTensor()
,dispatch()
) catastrophically fails, continue to lose theMLContext
MLTensor
s, though possibly also anMLGraph
, TBD) are put into an errored statewriteTensor()
writes new dataExample:
Open Questions
graph1
be put into an errored state, too?graph1
will always fail to execute?importExternalBuffer()
method?GPUError
scopes will be able to handle this casecreateBuffer()
be made synchronous and use this error reporting mechanism?MLTensor.error
), since the errored state exists on the WebNN timeline. Is that sufficient?Tentative IDL:
The text was updated successfully, but these errors were encountered: