-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parser.prometheusremotewrite, serializer.prometheusremotewrite): Native histogram support end-to-end #16121
base: master
Are you sure you want to change the base?
Conversation
Thanks so much for the pull request! |
!signed-cla |
498d7d6
to
0be4c14
Compare
0be4c14
to
f9d5ac0
Compare
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Reimirno thanks for your contribution! I do have some comments in the code. Additionally, please split the PR into one part for the parser and one part for the serializer so we do one thing per PR!
@@ -16,6 +16,9 @@ additional configuration options for Prometheus Remote Write Samples. | |||
|
|||
## Data format to consume. | |||
data_format = "prometheusremotewrite" | |||
|
|||
## Whether to parse a native histogram into one Telegraf metric | |||
keep_native_histograms_atomic = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please use metric_version
similar to what is done in the prometheus parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switching metric_version from prometheus parser seems to have more implications other than affecting how histogram is parsed.
Now, on the prometheusremotewrite side, it seems the current implementation is equivalent to v2. I could emulate a v1 here - where the atomic parsing of both classic and native would be implemented.
@@ -9,13 +9,16 @@ import ( | |||
"github.com/prometheus/common/model" | |||
"github.com/prometheus/prometheus/prompb" | |||
|
|||
"github.com/gogo/protobuf/proto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this instead of the upstream and well maintained github.com/golang/protobuf
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what prometheus uses to serialize their protobuf in remote write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found out the rationale for them to use this protobuf lib: prometheus/prometheus#14668.
It seem that they do have plan to migrate away from this and to a more maintained lib. So we should probably just do that.
// If keeping it atomic, we serialize the histogram into one single Telegraf metric | ||
// For now we keep the histogram as a serialized proto | ||
// Another option is to convert it to multi-field Telegraf metric | ||
serialized, err := proto.Marshal(&hp) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to marshal histogram: %w", err) | ||
} | ||
fields := map[string]any{ | ||
metricName: string(serialized), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You really should go for the multi-field
option as you cannot do anything in Telegraf with the serialized format but only can pass this through to the prometheusremotewrite
serializer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree, I wanted to do that too, but it's actually not straightforward.
Prometheus native histogram protobuf has quite a few fields being array, an even array of struct. There is no good way of representing that in Telegraf metric field (Value data type: Float | Integer | UInteger | String | Boolean)
(Same thing goes with otel exponential histogram)
Maybe I am missing something - could use some advice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, the problematic fields are:
-
count
(andzero_count
):oneof
in protobuf
This exist to hande both inthistogram and floathistogram. We can break up into two fieldscountInt
orcountFloat
and potentially add a flagisFloatHistogram
to achieve information lossless conversion. Or we can convert all to floathistogram, and only store acount
float. This is what the current implementation in Telegraf prometheusremotewrite parser is doing. -
negative_counts
negative_deltas
etc: int or float array.
We break down to index-suffixed fields likenegative_counts_0=... negative_counts_1=...
-
negative_spans
andpositive_spans
: array ofBucketSpan
which has two fieldsoffset
andlength
.
We can break down to index-and-field-suffixed fields likenegative_spans_0_offset=xxx negative_spans_0_length=xxx ...
Summary
Add a configuration field to
prometheusremotewrite
parser to change its behavior when parsing Prometheus native histogram.Previously
prometheusremotewrite
parser parses a native histogram into multiple Telegraf metric (the conversion is not lossless and is irreversible); now it parses it into one single Telegraf metric. Correspondingly, add a code path toprometheusremotewrite
serializer to be able to handle this single Telegraf "native histogram" metric and convert back to Prometheus native histogram. NOTE: logic that handles classic histogram is orthogonal and not changed.This treatment of native histograms is able to preserve its benefits in terms of correctness guarantee (atomicity, no correctness issue due to write batching) and its performance gain (low cardinality, sparse data structure). More importantly, this means Telegraf can supports native histogram end-to-end: ingest a native histogram and write out a native histogram.
Detailed rationale, including a diagram, please see #16120
Test
Previously, using a
prometheusremotewrite
HTTP listener v2 input to accept prometheus remote write, and aprometheusremotewrite
HTTP output, Telegraf ingests prometheus native histogram but writes out several counters, as if this were a classic histogram.In the query here, a native histogram get translated into several "bucket" metrics (but without
_bucket
suffix and with a<metric_name>_le
tag.Now, after this PR:
This native histogram now gets correctly written out as a native histogram.
I can add unit test / integration test when the community reaches some agreement on this implementation.
Checklist
Related issues
resolves #16120