Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that exporter timeout settings must be positive #4283

Open
jack-berg opened this issue Nov 5, 2024 · 8 comments · May be fixed by #4331
Open

Clarify that exporter timeout settings must be positive #4283

jack-berg opened this issue Nov 5, 2024 · 8 comments · May be fixed by #4331
Assignees
Labels
spec:protocol Related to the specification/protocol directory triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned

Comments

@jack-berg
Copy link
Member

Related to: open-telemetry/opentelemetry-java#6850

The env var configuration interface defines a variety of options related to exporter timeout settings:

  • OTEL_EXPORTER_OTLP_TIMEOUT
  • OTEL_EXPORTER_ZIPKIN_TIMEOUT
  • OTEL_BSP_EXPORT_TIMEOUT
  • OTEL_BLRP_EXPORT_TIMEOUT
  • OTEL_METRIC_EXPORT_TIMEOUT

While the spec states that:

Any value that represents a duration, for example a timeout, MUST be an integer representing a number of milliseconds. The value is non-negative - if a negative value is provided, the implementation MUST generate a warning, gracefully ignore the setting and use the default value if it is defined.

Implying that zero is a valid duration in addition to positive values. However, there is no explicit mention of zero, and how to interpret it.

Does zero mean indefinite? If so, that's a really important piece of information for implementers.
Does zero actually mean zero, and represent a degenerate case which is valid but always ends up with timeout? This seems useless.

I think the more likely case is that the spec overlooked zero for exporter timeouts.

I propose we clarify that zero is an invalid duration for all exporter timeout settings, and characterize this as a bugfix. Any language implementation which doesn't validate exporter timeout at all, or accepts zero as valid can fix their implementation as a bugfix. If a language implementation is insistent that switching from no validation to validating that timeout is positive is a breaking change (I disagree but this is a hypothetical), then they can continue without adding the new validation. This would be similar to how we changed the default protocol from grpc to http/protobuf, carving out an exception for backwards compatibility.

@jack-berg jack-berg added the spec:protocol Related to the specification/protocol directory label Nov 5, 2024
@trask trask added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label Nov 5, 2024
@trask
Copy link
Member

trask commented Nov 6, 2024

I'm kind of used to timeout 0 meaning "never timeout", e.g.

https://linux.die.net/man/7/socket

If the timeout is set to zero (the default) then the operation will never timeout.

That said, I'm not sure there's a practical use case for supporting "never timeout" in OpenTelemetry pipelines(?)

I could see "retry forever" when retrying from disk, but I think that would be a "retry setting" as opposed to a "timeout setting".

@github-actions github-actions bot added the triage:followup Needs follow up during triage label Nov 20, 2024
@svrnm svrnm added triage:deciding:tc-inbox Needs attention from the TC in order to move forward and removed triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:followup Needs follow up during triage labels Nov 25, 2024
@svrnm
Copy link
Member

svrnm commented Nov 25, 2024

@open-telemetry/technical-committee PTAL

@MrAlias
Copy link
Contributor

MrAlias commented Nov 25, 2024

I'm kind of used to timeout 0 meaning "never timeout"

This is how Go interprets a value of zero for many (maybe all?) of these timeouts.

@tigrannajaryan
Copy link
Member

If we can't agree on semantics of 0 vs -1 for "never timeout", here is an alternate: don't support it.

The point of "never timeout" logic is that typically you handle the timeouts yourself and have some other means to interrupt whatever operation is waiting on that "forever timeout". "Never timeout" does not mean wait until the heat death of the universe.

Given the above, is there a practical application of "never timeout" values for any of the Otel config settings? AFAIK, we don't provide any other means to interrupt it, so what's the point of waiting forever (until end of the process, I assume)?

For the hypothetical unknown use case where an actual forever timeout is needed I am guessing the largest acceptable number in milliseconds is enough (assuming 32bits signed, you get 50 years. I want to see a process that runs longer than that).

@pellared
Copy link
Member

I do not say that using -1 for "never timeout" is good.
Maybe OTel .NET would agree to handle 0 as "never timeout". However, it may be considered as a breaking change.
CC @open-telemetry/dotnet-maintainers

However, I do agree with @tigrannajaryan that we should have a real use case when such setting is needed. Supporting unbounded values can be considered unsafe.

@yurishkuro
Copy link
Member

+1 to change the spec to require positive values only

@jack-berg
Copy link
Member Author

However, it may be considered as a breaking change.

I mentioned in the issue how I think we should handle this from a compatibility standpoint:

I propose we clarify that zero is an invalid duration for all exporter timeout settings, and characterize this as a bugfix. Any language implementation which doesn't validate exporter timeout at all, or accepts zero as valid can fix their implementation as a bugfix. If a language implementation is insistent that switching from no validation to validating that timeout is positive is a breaking change (I disagree but this is a hypothetical), then they can continue without adding the new validation. This would be similar to how we changed the default protocol from grpc to http/protobuf, carving out an exception for backwards compatibility.

@jack-berg jack-berg added the triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned label Dec 4, 2024
@jack-berg jack-berg self-assigned this Dec 4, 2024
@jack-berg jack-berg removed the triage:deciding:tc-inbox Needs attention from the TC in order to move forward label Dec 4, 2024
carlosalberto added a commit that referenced this issue Dec 21, 2024
Related to #4283.

A
[comment](#4331 (comment))
adding a "type" column to each env var, but didn't feel appropriate to
extend scope of #4331.

---------

Co-authored-by: Carlos Alberto Cortez <calberto.cortez@gmail.com>
Co-authored-by: Reiley Yang <reyang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:protocol Related to the specification/protocol directory triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

7 participants