Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡 [REQUEST] - Better logging and tracing #475

Open
1 task done
lvaylet opened this issue May 21, 2024 · 4 comments · May be fixed by #479
Open
1 task done

💡 [REQUEST] - Better logging and tracing #475

lvaylet opened this issue May 21, 2024 · 4 comments · May be fixed by #479
Assignees
Labels

Comments

@lvaylet
Copy link
Collaborator

lvaylet commented May 21, 2024

Summary

Following up on #441, it appears some of the telemetry required to troubleshoot random issues might be missing. Take this opportunity to rethink the metrics/logs/traces collected by the SLO Generator?

Basic Example

I am a huge fan of Chapter 4 in the excellent Zero to Production in Rust. The whole chapter is about Telemetry. The author starts with basic logging, then attaches Request IDs to every log (so he can correlate entries that show up in a random order in the logging service), then ultimately decides to use traces to track individual requests (to get the context automatically, without adding it explicitly). I feel like the same principle can be applied to each request to the SLO Generator API, or to each request to a backend/exporter. Traces could replace or extend the existing logs, and make troubleshooting much easier without having to enable the (very verbose) Debug mode with DEBUG=1.

A great opportunity to migrate to an agnostic stack like OpenTelemetry for metrics, logs and traces, with all these data exported to stdout/stderr and/or the OpenTelemetry Collector over the OpenTelemetry Protocol (OLTP). On GCP, Cloud Run supports sidecars for such a model, and the OpenTelemtry Collector can easily export to Cloud Operations.

Screenshots

No response

Drawbacks

Might require a significant rework, as well as the approval of existing users who rely on the logs themselves or on log-based metrics extracted from the log entries with regular expressions.

Unresolved questions

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@lvaylet lvaylet self-assigned this May 21, 2024
@lvaylet
Copy link
Collaborator Author

lvaylet commented May 22, 2024

Traces have the concept of 'span events' to log structured data. For example, the number of good and bad events of a given SLI computation could be saved as trace events for automatic correlation with the API request itself.

More details in the OpenTelemetry documentation: https://hatch.pypa.io/dev/config/environment/advanced/

@lvaylet
Copy link
Collaborator Author

lvaylet commented May 22, 2024

The OpenTelemetry documentation details how to instrument Python code with traces, spans, events, links and attributes: https://opentelemetry.io/docs/languages/python/instrumentation/

Next steps:

  1. Identify the golden path of a request to the SLO Generator API,
  2. Instrument this golden path on a local version of the SLO Generator, first at a very high level,
  3. Launch an OpenTelemetry Collector on the side (as a Docker container),
  4. Confirm the traces show up as expected in Cloud Trace, with the expected level of detail,
  5. Add granularity (= nested spans) by instrumenting deeper and deeper layers of the golden path.

@google google deleted a comment from Phoenix5789 May 23, 2024
@lvaylet
Copy link
Collaborator Author

lvaylet commented May 24, 2024

Automatic instrumentation works great out-of-the-box, and provides a good granularity as long as the Python packages are themselves instrumented. For example, wrapping the API server with:

opentelemetry-instrument \
    --traces_exporter console,otlp \
    --service_name slo-generator \
    --exporter_otlp_traces_endpoint "localhost:4317" \
    --exporter_otlp_traces_insecure true \
    slo-generator api --target=run_compute --signature-type=http -c samples/config.yaml

exports the following spans to Cloud Trace:

image

image

Mix automatic and manual instrumentation for more granularity?

@lvaylet
Copy link
Collaborator Author

lvaylet commented May 25, 2024

TODO:

  • Correlate traces and logs automatically.
  • Use Sampling to optimize ingestion costs?
  • Test OpenTelemetry Python packages dedicated to popular packages like Flask. Unless they are automatically installed by opentelemetry-bootstrap -a install, as documented here?
  • Instrument our own functions to get more granularity on top of auto instrumentation. Add attributes and events when it makes sense. For example, the number of good and bad events can be events in the query spans.
  • Make OpenTelemetry instrumentation (and exporting to the OpenTelemetry Collector) optional for backwards compatiblity, without any error message for users who decide not to use it. For example with an environment variable?

@lvaylet lvaylet linked a pull request May 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant