Skip to content

Commit

Permalink
fix grafana dashboard
Browse files Browse the repository at this point in the history
Signed-off-by fangfenghuang <15068704759@163.com>
  • Loading branch information
fangfenghuang committed Sep 23, 2024
1 parent cd5b74f commit 15916ce
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 1,166 deletions.
16 changes: 7 additions & 9 deletions docs/dashboard.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
## Grafana Dashboard
# hami-vgpu-dashboard

- You can load this dashboard json file [gpu-dashboard.json](./gpu-dashboard.json)
- You can find the hami-vgpu-dashboard here: [https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard](https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard)

- This dashboard also includes some NVIDIA DCGM metrics:
- This dashboard also includes some [NVIDIA DCGM metrics](https://github.com/NVIDIA/dcgm-exporter)`kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml`

[dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) deploy:`kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml`

- use this prometheus custom metric configure:
- add prometheus custom metric configuration:

```yaml
- job_name: 'kubernetes-vgpu-exporter'
- job_name: 'kubernetes-hami-exporter'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: vgpu-device-plugin-monitor
regex: hami-.*
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_node_name]
Expand Down Expand Up @@ -47,7 +45,7 @@
action: replace
```
- reload promethues
- reload promethues:
```bash
curl -XPOST http://{promethuesServer}:{port}/-/reload
Expand Down
13 changes: 6 additions & 7 deletions docs/dashboard_cn.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
## Grafana Dashboard
# hami-vgpu-dashboard

- 你可以在 grafana 中导入此 [gpu-dashboard.json](./gpu-dashboard.json)
- 此 dashboard 还包括一部分 NVIDIA DCGM 监控指标:
- 你可以在此找到 hami-vgpu-dashboard:[https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard](https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard)

[dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter)部署:`kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml`
- 此 dashboard 还包括一部分 [NVIDIA DCGM 监控指标](https://github.com/NVIDIA/dcgm-exporter)`kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml`

- 添加 prometheus 自定义的监控项:

```yaml
- job_name: 'kubernetes-vgpu-exporter'
- job_name: 'kubernetes-hami-exporter'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: vgpu-device-plugin-monitor
regex: hami-.*
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_node_name]
Expand Down Expand Up @@ -46,7 +45,7 @@
action: replace
```
- 加载 promethues 配置:
- 热加载 promethues 配置:
```bash
curl -XPOST http://{promethuesServer}:{port}/-/reload
Expand Down
Loading

0 comments on commit 15916ce

Please sign in to comment.