- Install and configure helm and helmfile (including configuring kubectl context for your cluster).
- If using local storage for persistence, set up a storage class on your cluster that can handle dynamic persistent volumes. We use Rancher's local-path-provisioner by default.
- Label nodes for scheduling monitoring pods if using affinity:
kubectl label node/your-node 'aistore.nvidia.com/role_monitoring=true'
. - Create an environment for your deployment based on the values files for either the default (everything) or the external deployment (no grafana/loki).
- Update the values for your deployment environment
- Export any required environment variables (e.g. if bundling grafana,
export GRAFANA_PASSWORD=<password>
). - Run
helmfile sync
orhelmfile --environment <your-env> sync
. - Access Grafana from an external machine.
With the proper values configured, all tools should automatically sync and provide data in the grafana dashboard.
Most chart values are set in the source charts or in the values.yaml.gotmpl
in each chart's directory. To configure a specific deployment, create an environment file and replace default.yaml
in the helmfile or create a new environment.
- Grafana admin user login
export GRAFANA_PASSWORD=<password>
For setting the securityContext
, specify details of a non-root user (typically UID > 1000). To identify existing non-root users, use the following command:
awk -F: '$3 >= 1000 {print $1}' /etc/passwd
Alternatively, you can either use an existing non-root user or create a new one. To obtain the UID and Group ID (GID) of a user, execute:
id [username]
Then, update your deployment environment file with the user's UID and GID by setting the runAsUser
, runAsGroup
, and fsGroup
fields, under securityContext
.
AlertManager supports various receivers, and you can configure them as needed. We include a slack alert in our config file in kube-prom/alertmanager_config, but more can be added. Refer to the Prometheus Alerting Configuration for details on each receiver's config.
To monitor AIS, create PodMonitor definitions.
You can find an AIS PodMonitor
definition in ais_podmonitors.yaml
which will be automatically applied after syncing the kube-prometheus chart.
If using HTTPS for AIS, be sure to update the PodMonitor definition with the appropriate configs for scheme and TLS (an example is provided in the definition).
When applied, the monitors will configure prometheus to scrape metrics from AIStore's proxy and target pods individually every 30 seconds.
The web services for Prometheus and Grafana are not directly accessible from outside the cluster. Options include changing the service type to NodePort
or using port-forwarding. Use kube-prometheus-stack-prometheus
for the Prometheus service and kube-prometheus-stack-grafana
for Grafana. Below are instructions for Grafana.
- Configure access from the host into the pod by using ONE of the following
- Port-forward:
kubectl port-forward --namespace monitoring service/kube-prometheus-stack-grafana 3000:80
- Patch the service to use NodePort:
kubectl patch svc kube-prometheus-stack-grafana -n monitoring -p '{"spec": {"type": "NodePort"}}'
- Create a separate NodePort or LoadBalancer service: k8s docs
- Port-forward:
If needed, use an ssh tunnel to access the k8s host:
ssh -L <port>:localhost:<port> <user-name>@<ip-or-host-name>
and viewlocalhost:<port>
For Grafana, login with the admin user and the password set with the
GRAFANA_PASSWORD
environment variable
- Ships K8s logs to Loki
- Main docs
- Chart source
- Prometheus stack
- Includes
- Log storage and search
- Main docs
- Chart source
- Additional values options
- Visualization, dashboards, log search, metrics, etc.
- https://grafana.com/
- Chart source