Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional recommended alerts #1135

Open
2 tasks done
ravindk89 opened this issue Feb 20, 2024 · 9 comments
Open
2 tasks done

Additional recommended alerts #1135

ravindk89 opened this issue Feb 20, 2024 · 9 comments

Comments

@ravindk89
Copy link
Collaborator

ravindk89 commented Feb 20, 2024

Summary

From an internal discussion, we should expand the alerting page to include the following list of recommended metrics:

metric Description
minio_node_drive_free_bytes Total storage available on a drive.
minio_node_drive_free_inodes Total free inodes.
minio_node_drive_latency_us Average last minute latency in µs for drive API storage operations.
minio_node_drive_offline_total Total drives offline in this node.
minio_node_drive_online_total Total drives online in this node.
minio_node_drive_total Total drives in this node.
minio_node_drive_total_bytes Total storage on a drive.
minio_node_drive_used_bytes Total storage used on a drive.
minio_node_drive_errors_timeout Total number of drive timeout errors since server start
minio_node_drive_errors_availability Total number of drive I/O errors, permission denied and timeouts since server start
minio_node_drive_io_waiting Total number I/O operations waiting on drive

There's a lot of metrics here and the page already has some examples, so I'm thinking we can use a tab setup of something like

| Example Alerts | Recommended Alerts |

To help constrain the default length of the procedure.

Goals

List the in-scope goals

  • Add alert examples matching the metrics above
  • Possibly tab out or otherwise organize page for readability

Non-Goals

Extensive testing of Prometheus + Alert Manager w/ the above metrics

Additional context
Add any other context or screenshots about the feature request here.

@ravindk89
Copy link
Collaborator Author

@kannappanr some assistance:

curl --retry 10 -L -X GET https://play.min.io/minio/v2/metrics/cluster | grep -E '^minio_[\s a-z _]*_drive'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  
minio_cluster_drive_offline_total{server="play.min.io:9000"} 0
minio_cluster_drive_online_total{server="play.min.io:9000"} 4
minio_cluster_drive_total{server="play.min.io:9000"} 4
minio_cluster_health_erasure_set_healing_drives{pool="0",server="play.min.io:9000",set="0"} 0
minio_cluster_health_erasure_set_online_drives{pool="0",server="play.min.io:9000",set="0"} 4

Most of the recommended list as discussed does not appear in cluster metrics.

They do appear for the node endpoint:

minio_node_drive_errors_availability{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_free_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.9700221952e+10
minio_node_drive_free_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.0129953792e+10
minio_node_drive_free_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.0129642496e+10
minio_node_drive_free_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.013072384e+10
minio_node_drive_free_inodes{drive="/disk1/data",server="play.min.io:9000"} 2.0950584e+07
minio_node_drive_free_inodes{drive="/disk2/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk3/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk4/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_io_waiting{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk1/data",server="play.min.io:9000"} 3600
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk2/data",server="play.min.io:9000"} 3868
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk3/data",server="play.min.io:9000"} 3454
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk4/data",server="play.min.io:9000"} 4263
minio_node_drive_latency_us{api="storage.Delete",drive="/disk1/data",server="play.min.io:9000"} 35
minio_node_drive_latency_us{api="storage.Delete",drive="/disk2/data",server="play.min.io:9000"} 34
minio_node_drive_latency_us{api="storage.Delete",drive="/disk3/data",server="play.min.io:9000"} 32
minio_node_drive_latency_us{api="storage.Delete",drive="/disk4/data",server="play.min.io:9000"} 45
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk1/data",server="play.min.io:9000"} 30
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk2/data",server="play.min.io:9000"} 38
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk3/data",server="play.min.io:9000"} 25
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk4/data",server="play.min.io:9000"} 39
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk1/data",server="play.min.io:9000"} 1000
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk2/data",server="play.min.io:9000"} 615
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk3/data",server="play.min.io:9000"} 643
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk4/data",server="play.min.io:9000"} 2280
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk2/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk3/data",server="play.min.io:9000"} 64
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk1/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk3/data",server="play.min.io:9000"} 49
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk4/data",server="play.min.io:9000"} 71
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk1/data",server="play.min.io:9000"} 802
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk2/data",server="play.min.io:9000"} 1039
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk3/data",server="play.min.io:9000"} 868
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk4/data",server="play.min.io:9000"} 1075
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk1/data",server="play.min.io:9000"} 41
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk3/data",server="play.min.io:9000"} 20
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk4/data",server="play.min.io:9000"} 33
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk1/data",server="play.min.io:9000"} 234
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk2/data",server="play.min.io:9000"} 329
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk3/data",server="play.min.io:9000"} 465
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk4/data",server="play.min.io:9000"} 632
minio_node_drive_offline_total{server="play.min.io:9000"} 0
minio_node_drive_online_total{server="play.min.io:9000"} 4
minio_node_drive_total{server="play.min.io:9000"} 4
minio_node_drive_total_bytes{drive="/disk1/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_used_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.228479488e+09
minio_node_drive_used_bytes{drive="/disk2/data",server="play.min.io:9000"} 2.798747648e+09
minio_node_drive_used_bytes{drive="/disk3/data",server="play.min.io:9000"} 2.799058944e+09
minio_node_drive_used_bytes{drive="/disk4/data",server="play.min.io:9000"} 2.7979776e+09

We had previously discussed de-emphasizing the node-level metrics because they should be included in the cluster endpoint as a rollup - is this a bug? cc/ @donatello @shtripat as I think you both have some experience here

@ravindk89
Copy link
Collaborator Author

https://github.com/minio/minio/blob/master/docs/metrics/prometheus/list.md#drive-metrics

basically very few of these seem to roll up properly

djwfyi added a commit that referenced this issue Mar 8, 2024
Partially addresses #1135

To consider:
I added the tabs as part of step 3 of the procedure, but we might want
to consider having a recommended alerts section separate from the
procedure, perhaps above the "Dashboards" heading. Let me know your
thoughts.
@djwfyi djwfyi assigned ravindk89 and unassigned djwfyi Mar 19, 2024
@bh4t
Copy link
Contributor

bh4t commented Apr 9, 2024

@kannappanr can you please assist here?

@ravindk89
Copy link
Collaborator Author

This might be somewhat resolved with metrics v3, but until we've had enough time for customers to roll past that, we will need to maintain both:

  • Recommended alerts for metrics v2
  • Recommended alerts for metrics v3

And then fixups to ensure that node-level metrics are rolled up appropriately

@allanrogerr
Copy link
Contributor

On metrics v3:
These node metrics do not roll up to any cluster metrics:

Total used inodes on a drive
Total free inodes on a drive
Total inodes available on a drive
Average last minute latency in µs for drive API storage operations
Total timeout errors on a drive
Total availability errors (I/O errors, timeouts) on a drive
Total waiting I/O operations on a drive

Node metric Total storage available on a drive in bytes rolls up to Cluster metrics

	Total cluster usable storage capacity in bytes
	Total cluster raw storage capacity in bytes

Node metric Total storage free on a drive in bytes rolls up to Cluster metrics

	Total cluster usable storage free in bytes
	Total cluster raw storage free in bytes

Node metric Total storage used on a drive in bytes rolls up to Cluster metric

	Total cluster usage in bytes

Node metric Count of offline drives rolls up to Cluster metric

	Count of offline drives in the cluster

Node metric Count of online drives rolls up to Cluster metric

	Count of online drives in the cluster

Node metric Count of all drives rolls up to Cluster metric

	Count of all drives in the cluster

@ravindk89
Copy link
Collaborator Author

@kannappanr @anjalshireesh was there still progress on addressing the metrics v2 rollups above, or should we just proceed with documenting the node-level ones for now?

Otherwise we can just focus on the cluster rollups that do work and drop the rest until v3 stabilizes.

@ravindk89 ravindk89 removed their assignment May 10, 2024
@feorlen
Copy link
Collaborator

feorlen commented Jun 13, 2024

re: v2 rollup, customer reported these metrics were "missing" after upgrade because they are now found under minio/v2/metrics/node

minio_cluster_replication_link_offline_duration_seconds
minio_cluster_replication_link_online
minio_cluster_replication_current_active_workers
minio_cluster_replication_current_link_latency_ms
minio_cluster_replication_recent_backlog_count
minio_cluster_replication_last_minute_queued_count
minio_cluster_replication_credential_errors
minio_cluster_replication_current_transfer_rate
minio_cluster_replication_last_minute_queued_bytes
minio_cluster_replication_max_queued_count

@ravindk89
Copy link
Collaborator Author

@kannappanr @anjalshireesh are we generally going to leave metrics v2 as-is for now then, and focus metrics v3? Our attempt to document the recommended alerts gets flaky because we do not list the /node metrics at all - since historically those are not recommended for use.

@feorlen
Copy link
Collaborator

feorlen commented Jun 14, 2024

see also minio/minio#19932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants