💡 [REQUEST] - Better logging and tracing #475

lvaylet · 2024-05-21T20:11:42Z

Summary

Following up on #441, it appears some of the telemetry required to troubleshoot random issues might be missing. Take this opportunity to rethink the metrics/logs/traces collected by the SLO Generator?

Basic Example

I am a huge fan of Chapter 4 in the excellent Zero to Production in Rust. The whole chapter is about Telemetry. The author starts with basic logging, then attaches Request IDs to every log (so he can correlate entries that show up in a random order in the logging service), then ultimately decides to use traces to track individual requests (to get the context automatically, without adding it explicitly). I feel like the same principle can be applied to each request to the SLO Generator API, or to each request to a backend/exporter. Traces could replace or extend the existing logs, and make troubleshooting much easier without having to enable the (very verbose) Debug mode with DEBUG=1.

A great opportunity to migrate to an agnostic stack like OpenTelemetry for metrics, logs and traces, with all these data exported to stdout/stderr and/or the OpenTelemetry Collector over the OpenTelemetry Protocol (OLTP). On GCP, Cloud Run supports sidecars for such a model, and the OpenTelemtry Collector can easily export to Cloud Operations.

Screenshots

No response

Drawbacks

Might require a significant rework, as well as the approval of existing users who rely on the logs themselves or on log-based metrics extracted from the log entries with regular expressions.

Unresolved questions

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

lvaylet · 2024-05-22T08:35:22Z

Traces have the concept of 'span events' to log structured data. For example, the number of good and bad events of a given SLI computation could be saved as trace events for automatic correlation with the API request itself.

More details in the OpenTelemetry documentation: https://hatch.pypa.io/dev/config/environment/advanced/

lvaylet · 2024-05-22T14:42:40Z

The OpenTelemetry documentation details how to instrument Python code with traces, spans, events, links and attributes: https://opentelemetry.io/docs/languages/python/instrumentation/

Next steps:

Identify the golden path of a request to the SLO Generator API,
Instrument this golden path on a local version of the SLO Generator, first at a very high level,
Launch an OpenTelemetry Collector on the side (as a Docker container),
Confirm the traces show up as expected in Cloud Trace, with the expected level of detail,
Add granularity (= nested spans) by instrumenting deeper and deeper layers of the golden path.

lvaylet · 2024-05-24T14:31:17Z

Automatic instrumentation works great out-of-the-box, and provides a good granularity as long as the Python packages are themselves instrumented. For example, wrapping the API server with:

opentelemetry-instrument \
    --traces_exporter console,otlp \
    --service_name slo-generator \
    --exporter_otlp_traces_endpoint "localhost:4317" \
    --exporter_otlp_traces_insecure true \
    slo-generator api --target=run_compute --signature-type=http -c samples/config.yaml

exports the following spans to Cloud Trace:

Mix automatic and manual instrumentation for more granularity?

lvaylet · 2024-05-25T13:52:04Z

TODO:

Correlate traces and logs automatically.
Use Sampling to optimize ingestion costs?
Test OpenTelemetry Python packages dedicated to popular packages like Flask. Unless they are automatically installed by opentelemetry-bootstrap -a install, as documented here?
Instrument our own functions to get more granularity on top of auto instrumentation. Add attributes and events when it makes sense. For example, the number of good and bad events can be events in the query spans.
Make OpenTelemetry instrumentation (and exporting to the OpenTelemetry Collector) optional for backwards compatiblity, without any error message for users who decide not to use it. For example with an environment variable?

lvaylet added feature New feature or request refactor observability (o11y) labels May 21, 2024

lvaylet self-assigned this May 21, 2024

google deleted a comment from Phoenix5789 May 23, 2024

lvaylet linked a pull request May 30, 2024 that will close this issue

feat: add tracing with OpenTelemetry #479

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 [REQUEST] - Better logging and tracing #475

💡 [REQUEST] - Better logging and tracing #475

lvaylet commented May 21, 2024

lvaylet commented May 22, 2024

lvaylet commented May 22, 2024 •

edited

Loading

lvaylet commented May 24, 2024

lvaylet commented May 25, 2024 •

edited

Loading

💡 [REQUEST] - Better logging and tracing #475

💡 [REQUEST] - Better logging and tracing #475

Comments

lvaylet commented May 21, 2024

Summary

Basic Example

Screenshots

Drawbacks

Unresolved questions

Code of Conduct

lvaylet commented May 22, 2024

lvaylet commented May 22, 2024 • edited Loading

lvaylet commented May 24, 2024

lvaylet commented May 25, 2024 • edited Loading

lvaylet commented May 22, 2024 •

edited

Loading

lvaylet commented May 25, 2024 •

edited

Loading