Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

svrnm · 2024-12-16T15:06:41Z

Area(s)

area:new

Is your change request related to a problem? Please describe.

While reviewing the AI Agent Span Semantic Convention I made the comment, that the definition of ai_agent.workflow.* and ai_agent.task.* should not be unique, since there tasks & workflows that can be modeled similarly outside of the ai_agent scope. Same is true for the existing experimental CICD pipeline attributes. Other examples might be cronjobs, business processes, build tools, like make or goyek, which defines their own goyek.flow.* tasks in https://pkg.go.dev/github.com/goyek/x/otelgoyek#example-package (thanks to @pellared for pointing me to goyek)

Describe the solution you'd like

The solution I would like to see is a unified set of attributes that describe such "workflow" and then only to have the unique attributes in the related specifications, similar as we have it today for the HTTP SemConv, where server.* or url.* attributes are used where applicable.

So instead of the current proposals, the future may look like the following:

cicd.pipeline.name => workflow.name
cicd.pipeline.run.id => workflow.run.id
cicd.pipeline.task.name => workflow.task.name
cicd.pipeline.task.run.id => workflow.task.run.id
cicd.pipeline.task.run.url.full => workflow.task.url.full (or just url.full if it is a "task span"?)
cicd.pipeline.task.type => workflow.task.type
...

and

ai_agent.workflow.name => workflow.name
ai_agent.task.name => workflow.task.name
ai_agent.task.output => workflow.task.output
...

and

goyek.flow.output => workflow.output
...

Both @open-telemetry/semconv-genai-approvers @open-telemetry/semconv-cicd-approvers are very active working groups, pushing semantic conventions forward, and it would be great to see some broader thinking about "workflows" (or however this should be named in a unified way)

Describe alternatives you've considered

No response

Additional context

Note, that this may also require to make progress on the long standing question of "long running 'spans'"¹:

open-telemetry/opentelemetry-specification#373,
open-telemetry/opentelemetry-specification#2692 and other previous issues touched upon this topic, but so far there is no solution for long-running (++minutes, hours, days) or even "infinite" spans

the term "span" might not be correct in this context. ↩

The text was updated successfully, but these errors were encountered:

trask · 2024-12-16T15:46:26Z

I think(?) this is related: open-telemetry/opentelemetry-java-instrumentation#12830

cc @cb645j

christophe-kamphaus-jemmic · 2024-12-16T16:29:36Z

Related issue for long running spans: #1648

svrnm · 2024-12-19T11:55:55Z

Thanks for the feedback here and during the SIG Call. Especially @lmolkova statement on not over-generalizing made me think more about this topic.

Before digging into that, let's ignore the "long running" issue for now, as shared in #1648 this is a spec issue and not a semantic convention issue, and for the sake of simplicity, let's assume that a workflow run with it's sub tasks can be represented by a trace with the run being the parent and the tasks the child spans.

Let me begin with where I am coming from: in my previous job as solution engineer for APM I created lots of dashboards and visualizations, and a thing that annoyed me was that I had to do the same thing again and again because things carried slightly different names and by that required me reworking a lot of things. That's why I am a huge fan of the semantic conventions, they create consistence!

So, the leading question I had in mind is: how can I visualize workflow telemetry in my preferred visualization tool consistently?

To answer that question, let's take a look how workflows are visualized across different solutions today:

GitHub Action
Argo Workflows via CI/CD: producing long running traces #1648
Camunda BPMN
Make via Makefile::GraphViz
... and many more ...

So what all workflows have in common is that they can be represented by a graph where the vertices represent tasks, the edges the dependencies between tasks. There are "start vertices" and "end vertices", and a run is a flow through that graph from a start to an end.

From an observability standpoint, what I want to understand from a highlevel is:

How many runs through my workflow are successful? How many fail?
- If runs fail, which tasks have the most errors and are the root cause of this?
How long do my runs through my workflow take? How many are slow?
- If my runs slow down, or if I want to speed them up, which tasks take the longest on average?

These questions are independent of the workflow I am looking at. This is how I would troubleshoot a CICD workflow, a AI Agent workflow, a business workflow, a build workflow, etc.

What is different is when I want to drill into a specific task and investigate further why there are errors, or why that task my be slow.

So, similar to a service map, a way to represent them for monitoring is something like the following:

flowchart TD
    style A fill:#0f0
    style B fill:#ff0
    style C fill:#0f0
    style D fill:#0f0
    style E fill:#0f0
    style F fill:#f00
    style G fill:#ff0
    subgraph "WF1, sr: 62/100, art: 4min"
        A[Setup<br>0 errors/min<br>15s AVG runtime] -->|100 runs/min| B(Step 1<br>2 errors/min<br>3min AVG runtime)
        B -->|98 runs/min| C(Step 2<br>0 errors/min<br>1.3s AVG runtime)
        C -->|40 runs/min| D[Step 3a<br>0 errors/min<br>12s AVG runtime]
        C -->|20 runs/min| E[Step 3b<br>0 errors/min<br>17s AVG runtime]
        C -->|40 runs/min| F[Step 3c<br>0 errors/min<br>3.7min AVG runtime]
        D --->|40 runs/min| G
        E --->|20 runs/min| G
        F --->|2 runs/min| G(Tear Down<br>2 errors/min<br>13s AVG runtime)
    end

If different workflows now do not share a common set of attributes, engineers building visualizations have either to rebuild the same thing over and over again, or they have to add some logic to select the right attributes depending on the workflow type they are looking at.

This means, from my point of view, there are a few common attributes:

workflow.name (which should also be the span name of the parent span)
workflow.id (which can be used to uniquely identify the workflow if names are equal)
workflow.task.name (which should also be the span name of the task span)
workflow.task.id (which can be used to uniquely identify the task within the workflow if names are equal)
workflow.task.run.id or workflow.task.flow.id which uniquely identifies the current run.

If available, the *.id attributes should be taken from the workflow tool, such that they can be linked across the observability solution and the workflow tool.

For metrics, there are also some common examples:

workflow.run.duration (similar to http.server.request.duration)
workflow.failed_runs
workflow.active_runs (similar to http.server.active_requests)
workflow.task.run.duration
workflow.task.active_runs
workflow.task.failed_runs

Additionally, the following existing attributes may be used:

error.*
exception.*

There might be other common attributes, but these are the ones that would enable such a unified visualization. Other attributes would not be common, e.g. cicd.pipeline.task.run.url.full since they are domain specific/not available in all workflows (e.g. a tool like make would not have a concept of such an url), I also know think that url.full would not be correct, since this URL is not for a network call between services but for drilling deeper into the tool and it's representation of that run.

svrnm added enhancement New feature or request experts needed This issue or pull request is outside an area where general approvers feel they can approve triage:needs-triage labels Dec 16, 2024

github-actions bot added the area:new label Dec 16, 2024

svrnm mentioned this issue Dec 19, 2024

CI/CD: producing long running traces #1648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

svrnm commented Dec 16, 2024

trask commented Dec 16, 2024

christophe-kamphaus-jemmic commented Dec 16, 2024

svrnm commented Dec 19, 2024

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

Comments

svrnm commented Dec 16, 2024

Area(s)

Is your change request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Footnotes

trask commented Dec 16, 2024

christophe-kamphaus-jemmic commented Dec 16, 2024

svrnm commented Dec 19, 2024