Improve k8s infra chart components description and explain presets (#938

)
SigNoz · Nov 4, 2024 · 43e53fe · 43e53fe
1 parent 18e0697
commit 43e53fe
Show file tree

Hide file tree

Showing 2 changed files with 251 additions and 6 deletions.
diff --git a/constants/docsSideNav.ts b/constants/docsSideNav.ts
@@ -772,7 +772,7 @@ const docsSideNav = [
       },
       {
         type: 'doc',
-        route: '/docs/metrics-management/k8s-deployment-override',
+        route: '/docs/metrics-management/k8s-infra-otel-config',
         label: 'Configure k8s-infra otelDeployment to collect metrics from receivers',
       },
     ],

diff --git a/data/docs/metrics-management/k8s-deployment-override.mdx b/data/docs/metrics-management/k8s-deployment-override.mdx
@@ -1,15 +1,17 @@
 ---
 date: 2024-10-28
-title: Configure k8s-infra otelDeployment to collect metrics from receivers
-id: k8s-deployment-override
+title: k8s-infra otelDeployment and otelAgent configuration
+id: k8s-infra-otel-config
 ---
 
 The [k8s-infra chart](https://signoz.io/docs/tutorial/kubernetes-infra-metrics/) runs two kind of otel collectors:
 
 - otelAgent: This collector is deployed as a daemonset and collects metrics/logs from the pods on the node. One agent is deployed per node.
 - otelDeployment: This collector is deployed as a deployment and collects metrics/logs of the cluster. One deployment is deployed per cluster.
 
-The `otelDeployment` collector is configured to collect cluster metrics and events by default. There are some use cases where you might want to collect metrics from components other than the ones that are being collected by default. For example, you might want to collect metrics from a redis server or a postgres database. This page describes how to override the default configuration of the `otelDeployment` collector to collect metrics from additional components. The override-values.yaml file contains the configurations that you can override.
+## otelDeployment
+
+The `otelDeployment` collector is configured to collect cluster metrics and events by default. There are some use cases where you might want to collect metrics from components other than the ones that are being collected by default. For example, you might want to collect metrics from a redis server or a postgres database. This page describes how to override the default configuration of the `otelDeployment` collector to collect metrics from additional components. The `override-values.yaml` file contains the configurations that you can override.
 
 ### Example 1: Configuring a redis receiver running inside a namespace "redis-ns" with the name "redis-server" and port 6379
 
@@ -64,6 +66,249 @@ otelDeployment:
 
 The above examples are simple and show how to configure the `otelDeployment` collector to collect metrics from a single component. However, in a real-world scenario, you might have to configure the receiver with a username and password and other configurations.
 
-## Why `otelDeployment`?
+## otelAgent
+
+The `otelAgent` collector is deployed as a daemonset, it results in duplicate metrics if the same component is running on multiple nodes and the metrics are collected from all the nodes. The `otelDeployment` collector solves this problem by collecting metrics from the cluster and not from the individual nodes.
+
+## Essential Presets
+
+This guide explains how to configure the OpenTelemetry Collector presets in SigNoz. These presets control what telemetry data you collect and how you collect it from your Kubernetes cluster.
+
+### 1. OTLP Exporter
+```yaml
+presets:
+  otlpExporter:
+    enabled: true  # Must be enabled for data to reach SigNoz
+```
+**What it does**: Enables sending telemetry data to SigNoz backend. This is required for SigNoz to receive any data.
+
+### 2. Host Metrics Collection
+```yaml
+presets:
+  hostMetrics:
+    enabled: true
+    collectionInterval: 30s  # Adjust based on your monitoring needs
+    scrapers:
+      cpu: {}        # Collects CPU usage, time, and utilization metrics
+      load: {}       # Collects system load averages
+      memory: {}     # Collects memory usage metrics
+      disk:          # Collects disk I/O metrics
+        exclude:
+          devices:   # Regex patterns to exclude specific devices
+            - ^ram\d+$
+            - ^loop\d+$
+            # Add custom device exclusions here
+      filesystem:    # Collects filesystem metrics
+        exclude_fs_types:
+          fs_types:
+            - tmpfs
+            - squashfs
+            # Add filesystem types to exclude
+      network:       # Collects network metrics
+        exclude:
+          interfaces:
+            - ^veth.*$    # Excludes virtual ethernet devices
+            - ^docker.*$  # Excludes docker interfaces
+            # Add network interfaces to exclude
+```
+**Key points**:
+- Adjust `collectionInterval` based on your monitoring granularity needs
+- Customize device exclusions to reduce noise from irrelevant storage devices
+- Configure network interface exclusions to focus on relevant network metrics
+
+### 3. Container Log Collection
+```yaml
+presets:
+  logsCollection:
+    enabled: true
+    startAt: beginning      # Options: beginning or end
+    includeFilePath: true   # Adds file path as metadata
+    includeFileName: false  # Adds file name as metadata
+    
+    # Exclusion configuration
+    blacklist:
+      enabled: true
+      signozLogs: true     # Excludes SigNoz's own logs
+      namespaces:
+        - kube-system      # Add namespaces to exclude
+      pods:
+        - hotrod          # Add pod names to exclude
+        - locust
+      containers: []       # Add container names to exclude
+    
+    # Inclusion configuration (overrides blacklist if enabled)
+    whitelist:
+      enabled: false
+      signozLogs: true
+      namespaces: []      # Only collect logs from these namespaces
+      pods: []            # Only collect logs from these pods
+      containers: []      # Only collect logs from these containers
+```
+**Best practices**:
+- Use blacklist for excluding noisy system pods
+- Enable whitelist when you need to monitor specific applications only
+- Consider disk usage implications when enabling file path/name inclusion
+
+### 4. Kubernetes Metrics Collection
+```yaml
+presets:
+  kubeletMetrics:
+    enabled: true
+    collectionInterval: 30s
+    authType: serviceAccount    # Authentication method
+    endpoint: ${env:K8S_HOST_IP}:10250
+    insecureSkipVerify: true   # Set to false in production
+    
+    # Metrics configuration
+    metrics:
+      k8s.pod.cpu_limit_utilization:
+        enabled: true
+      k8s.pod.memory_limit_utilization:
+        enabled: true
+      # Enable other metrics as needed
+```
+**Important settings**:
+- Set appropriate `collectionInterval` based on cluster size
+- Configure `insecureSkipVerify: false` in production environments
+- Enable specific metrics based on monitoring requirements
+
+### 5. Kubernetes Metadata Enrichment
+```yaml
+presets:
+  kubernetesAttributes:
+    enabled: true
+    passthrough: false    # Set true to disable k8s API calls
+    
+    # Control which metadata to collect
+    extractMetadatas:
+      - k8s.namespace.name
+      - k8s.deployment.name
+      - k8s.pod.name
+      - k8s.node.name
+      # Add or remove metadata fields
+    
+    # Pod association configuration
+    podAssociation:
+      - sources:
+        - from: resource_attribute
+          name: k8s.pod.ip
+```
+**Configuration tips**:
+- Enable only required metadata fields to optimize performance
+- Use `passthrough: true` in very large clusters to reduce API load
+- Configure pod association based on your networking setup
+
+### 6. Cluster-level Metrics
+```yaml
+presets:
+  clusterMetrics:
+    enabled: true
+    collectionInterval: 30s
+    
+    # Node conditions to monitor
+    nodeConditionsToReport:
+      - Ready
+      - MemoryPressure
+      - DiskPressure
+      # Add conditions based on monitoring needs
+    
+    # Resource types to monitor
+    allocatableTypesToReport:
+      - cpu
+      - memory
+      # - storage    # Uncomment if needed
+```
+**When to use**:
+- Enable for cluster health monitoring
+- Useful for capacity planning and resource optimization
+- Essential for multi-node cluster monitoring
+
+### 7. Kubernetes Events
+```yaml
+presets:
+  k8sEvents:
+    enabled: true
+    authType: serviceAccount
+    namespaces: []    # Empty for all namespaces, or specify list
+```
+**Usage scenarios**:
+- Enable for debugging and audit trails
+- Monitor specific namespaces by listing them
+- Useful for compliance and security monitoring
+
+## Common Deployment Scenarios
+
+### Production Cluster with Full Monitoring
+```yaml
+presets:
+  otlpExporter:
+    enabled: true
+  hostMetrics:
+    enabled: true
+    collectionInterval: 30s
+  logsCollection:
+    enabled: true
+    blacklist:
+      enabled: true
+      namespaces: 
+        - kube-system
+  kubeletMetrics:
+    enabled: true
+    insecureSkipVerify: false
+  kubernetesAttributes:
+    enabled: true
+  clusterMetrics:
+    enabled: true
+  k8sEvents:
+    enabled: true
+```
+
+### Resource-Constrained Environment
+```yaml
+presets:
+  otlpExporter:
+    enabled: true
+  hostMetrics:
+    enabled: true
+    collectionInterval: 60s
+  logsCollection:
+    enabled: true
+    whitelist:
+      enabled: true
+      namespaces: 
+        - production
+  kubeletMetrics:
+    enabled: true
+    collectionInterval: 60s
+  kubernetesAttributes:
+    enabled: true
+    passthrough: true
+```
+
+### Development Environment
+```yaml
+presets:
+  otlpExporter:
+    enabled: true
+  hostMetrics:
+    enabled: true
+    collectionInterval: 30s
+  logsCollection:
+    enabled: true
+  kubeletMetrics:
+    enabled: true
+  kubernetesAttributes:
+    enabled: true
+```
+
+## Resource Requirements
+
+Recommended resource allocations based on preset combinations:
+
+| Configuration | CPU Request | Memory Request | CPU Limit | Memory Limit |
+|--------------|-------------|----------------|-----------|--------------|
+| Minimal | 100m | 256Mi | 200m | 512Mi |
+| Standard | 200m | 512Mi | 500m | 1Gi |
+| Full | 500m | 1Gi | 1000m | 2Gi |
 
-The `otelAgent` collector is deployed as a daemonset, it results in duplicate metrics if the same component is running on multiple nodes. The `otelDeployment` collector solves this problem by collecting metrics from the cluster and not from the individual nodes.
+Adjust these values based on your cluster size and monitoring requirements.