Not sure why `response_result_data_Hardware_HDD_used_by_node` is used as a label in prometheus metrics #61

pruneau628 · 2023-04-05T15:09:12Z

It artificially creates different times series, is there a rationale for this ?

WadeBarnes · 2023-04-05T15:57:06Z

There are a number of text fields returned by the call to validator-info that we want to display. Prometheus does not handle text, so adding the text fields to labels allows the text to be accessed in Prometheus queries. This was done when we were initially evaluating Prometheus and Influx.

pruneau628 · 2023-04-05T17:18:47Z

I do understand the purpose, and I can get behind including attributes that seldom changes, like IP configuration, OS version, but the disk space usage change a bit too often to qualify as a label, though. It also changes randomly, which makes it even worse as a label.

WadeBarnes · 2023-04-06T12:02:32Z

It would be better to handle text fields through Influx and remove these labels from Prometheus, which brings up a question. Do we use Prometheus and Influx, or just Influx? Prometheus makes collecting and graphing metrics super easy, but it does not handle text. Influx is more difficult when it comes to queries, but supports text and metrics. Or is there another solution? @ConnorBarnes88, do you have any thoughts on this?

pruneau628 · 2023-04-06T12:25:19Z

from a restricted point of view, response_result_data_Hardware_HDD_used_by_node is actually a number, and should be treated as such
removing prometheus from the equation creates a void with respect of raising alarms.
it could also make sense to use this monitoring stack to gather metrics from agents and other parts of the eosystem, and prometheus is still a good candidate.

Now, I have the feeling I'm missing the big picture here.

WadeBarnes · 2023-04-06T12:49:06Z

from a restricted point of view, response_result_data_Hardware_HDD_used_by_node is actually a number, and should be treated as such

The issue is we receive it as text in human readable form where the number could be in MBs, GBs, etc:

"Hardware": {
  "HDD_used_by_node": "1037 MBs"
},

Converting that into an actual number will require some transformation as the telegraf level. Converting it to a number would allow us to remove it as a label; stop treating it as text.

removing prometheus from the equation creates a void with respect of raising alarms.

it could also make sense to use this monitoring stack to gather metrics from agents and other parts of the eosystem, and prometheus is still a good candidate.

Agreed

Now, I have the feeling I'm missing the big picture here.

I don't think you are.

pruneau628 · 2023-04-06T15:17:41Z

Thanks for your trust @WadeBarnes.
I have been looking into my own affirmation about prometheus, and it actually looks like the latest influxdb oss version 2.7.0 support alarms as well.
If prometheus is still perceived as useless, this is probably the way to go.
But that's just from reading the docs, not experimentation yet.

ConnorBarnes88 · 2023-04-06T15:50:54Z

Regarding all of this,There is no particular reason why "Hardware": { "HDD_used_by_node": "1037 MBs" },is being used as a label for prometheus. It very well could, and for consistency should be called by influx if it remains in text form. It likely is possible to convert the HHD usage to a numeric value (it is currently a text value by default) through telegraph assuming it will allow us to use MBs, GBs or a different work around. From my knowledge, or I'd say lack there of :) , Prometheus is great for time/series data and alert handling. It does have it's faults and isn't great at handling text if at all in most instances. Influx can do the same and then some but I'm not currently aware of it's alerting capabilities. Having access to if/else as needed is handy and it's great with text. However it's a little complex for something Prometheus can do so simply.Others have mentioned stuff like Loki which I haven't looked into myself. I believe it's pretty flexible on all fronts from what little I've heard.I'm open to feedback and suggestions. I'm definitely not the most knowledgable on this subject but I'm learning as I go. Thanks,Connor BarnesOn Apr 6, 2023, at 5:49 AM, Wade Barnes ***@***.***> wrote: from a restricted point of view, response_result_data_Hardware_HDD_used_by_node is actually a number, and should be treated as such The issue is we receive it as text in human readable for where the number could be in MBs, GBs, etc: "Hardware": { "HDD_used_by_node": "1037 MBs" }, Converting that into an actual number will require some transformation as the telegraf level. Converting it to a number would allow us to remove it as a label; stop treating it as text. removing prometheus from the equation creates a void with respect of raising alarms. it could also make sense to use this monitoring stack to gather metrics from agents and other parts of the eosystem, and prometheus is still a good candidate. Agreed Now, I have the feeling I'm missing the big picture here. I don't think you are. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

pruneau628 · 2023-04-06T19:59:46Z

Wow, I was not expecting a simple question to devovle into a full-on architecture review.
I'm definitely interested in contributing, but not sure how to go about it.

WadeBarnes · 2023-04-11T13:17:16Z

lol, you picked the right (or wrong) question @pruneau628. For this metric I think we could do a conversion so Prometheus can consume it more easily. However there are many other data elements from the validator-info calls that we're interested in and are in text form. So we still have a few choices to make. Stick with a mix of Prometheus and Influx and use each for their strengths, or switch to Influx completely. Both choices introduce a certain level of complexity, the question is which will be easier to maintain in the log run.

Perhaps you could assist with the decisions by doing a bit more investigation into the capabilities of Influx 2.x.

pruneau628 · 2023-04-11T19:01:01Z

Ok, we are doing just that, and will report on the progress as soon as possible.

GuillaumeBourque-QC · 2024-08-06T21:11:26Z

Hello all,

I'm very new to all this, but if I may add to the discussion, why not use Grafana for the alerting part ? Having run a couple of influxdb server in production with telegraf and Grafana for about 5 years without a single issue, I would say that I don't see the complexity in using these 3 products ( I have never run prometheus in production and it look more complicated to us ;-) )

but now I may have missed the purpose of the alarms you are refering too into influxDB.

@WadeBarnes can you shed some light here ? TIA

GuillaumeBourque-QC · 2024-09-26T19:32:19Z

Hello @WadeBarnes for this project which is not too complex, I don't see the value to store all metrics in prometheus and influxDB.

I would put prometheus and alertmanager as optional since I was able to convert all dashboards from prometheus to InluxDB only and all alerts to granafa alerting. This will be push as a PR when my code clean up will be done

pruneau628 mentioned this issue Apr 11, 2023

Given the capabilities of InfluxDb 2.7+, investigate wether Prometheus is still required #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure why `response_result_data_Hardware_HDD_used_by_node` is used as a label in prometheus metrics #61

Not sure why `response_result_data_Hardware_HDD_used_by_node` is used as a label in prometheus metrics #61

pruneau628 commented Apr 5, 2023

WadeBarnes commented Apr 5, 2023

pruneau628 commented Apr 5, 2023

WadeBarnes commented Apr 6, 2023

pruneau628 commented Apr 6, 2023 •

edited

Loading

WadeBarnes commented Apr 6, 2023 •

edited

Loading

pruneau628 commented Apr 6, 2023

ConnorBarnes88 commented Apr 6, 2023 via email

pruneau628 commented Apr 6, 2023

WadeBarnes commented Apr 11, 2023

pruneau628 commented Apr 11, 2023

GuillaumeBourque-QC commented Aug 6, 2024 •

edited

Loading

GuillaumeBourque-QC commented Sep 26, 2024

Not sure why response_result_data_Hardware_HDD_used_by_node is used as a label in prometheus metrics #61

Not sure why response_result_data_Hardware_HDD_used_by_node is used as a label in prometheus metrics #61

Comments

pruneau628 commented Apr 5, 2023

WadeBarnes commented Apr 5, 2023

pruneau628 commented Apr 5, 2023

WadeBarnes commented Apr 6, 2023

pruneau628 commented Apr 6, 2023 • edited Loading

WadeBarnes commented Apr 6, 2023 • edited Loading

pruneau628 commented Apr 6, 2023

ConnorBarnes88 commented Apr 6, 2023 via email

pruneau628 commented Apr 6, 2023

WadeBarnes commented Apr 11, 2023

pruneau628 commented Apr 11, 2023

GuillaumeBourque-QC commented Aug 6, 2024 • edited Loading

GuillaumeBourque-QC commented Sep 26, 2024

Not sure why `response_result_data_Hardware_HDD_used_by_node` is used as a label in prometheus metrics #61

Not sure why `response_result_data_Hardware_HDD_used_by_node` is used as a label in prometheus metrics #61

pruneau628 commented Apr 6, 2023 •

edited

Loading

WadeBarnes commented Apr 6, 2023 •

edited

Loading

GuillaumeBourque-QC commented Aug 6, 2024 •

edited

Loading