Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Helm Chart] Loki chart version 6.18.0 app version 3.2.0 -- Ruler not sending any alerts to alertmanager. #14798

Open
gyoza opened this issue Nov 6, 2024 · 2 comments

Comments

@gyoza
Copy link

gyoza commented Nov 6, 2024

Describe the bug
Loki ruler alerts not firing using latest loki helm chart running in Distributed mode.
For the record, loki-distributed helm chart is able to process these rules and alert on them with no issues.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki Helm chart 6.18.0 app version 3.2.0
  2. Started Promtail Helm chart 6.16.5 version 3.0.0
  3. Query: (sum(count_over_time({job=~".+", env="env"} |~ ".+level=warn.+"[30s]) >= 1))

Expected behavior
Alerts sent for these queries

Environment:

  • Infrastructure: Kubernetes 1.30
  • Deployment tool: helm

Screenshots, Promtail config, or terminal output
Here is an output of all the pods we have running from the loki helm chart.

loki-index-gateway-0                                        1/1     Running   0               98m
loki-index-gateway-1                                        1/1     Running   0               98m
loki-ingester-0                                             1/1     Running   0               96m
loki-ingester-1                                             1/1     Running   0               98m
loki-pattern-ingester-0                                     1/1     Running   0               97m
loki-pattern-ingester-1                                     1/1     Running   0               98m
loki-querier-79fbbcfd9-gwwvh                                1/1     Running   0               98m
loki-querier-79fbbcfd9-tdmf2                                1/1     Running   0               98m
loki-query-frontend-6f8c7d6f4c-km5gv                        1/1     Running   0               98m
loki-query-frontend-6f8c7d6f4c-tk2vn                        1/1     Running   0               98m
loki-query-scheduler-9d897d6d7-cqvhc                        1/1     Running   0               98m
loki-results-cache-0                                        2/2     Running   0               6d
loki-rollout-operator-6c46745f4-82lfm                       1/1     Running   0               6d
loki-ruler-0                                                1/1     Running   0               97m
loki-ruler-1                                                1/1     Running   0               95m

Here is the current deployment from helm ls

loki <namespace> 72 2024-11-06 10:58:15.215559 -0700 MST deployed loki-6.18.0 3.2.0

Ruler configuration:

  rulerConfig:
    alertmanager_client:
      basic_auth_username: ${BASIC_AUTH_USERNAME}
      basic_auth_password: ${BASIC_AUTH_PASSWORD}
    alertmanager_url: http://${ALERTMANAGER_URL}:9093
    external_url: https://${EXTERNAL_URL}
    enable_alertmanager_v2: true
    enable_api: true
    enable_sharding: true
    evaluation:
      mode: remote
      query_frontend:
        address: dns:///loki-query-frontend.<namespace>.svc.cluster.local:9095
    ring:
      kvstore:
        store: memberlist
    rule_path: /tmp/loki
    storage:
      local:
        directory: /etc/loki/rules
      type: local

Rule in question which has been setup as /etc/loki/rules/fake/fake changing undotted file to .txt or .yaml changes nothing:

groups:
    - name: alerts
      rules:
        - alert: test-new-ruler
          expr: |-
            (sum(count_over_time({job=~".+", env="env"} |~ ".+level=warn.+"[30s]) >= 1))
          labels:
            severity: critical
            slack_channel: slackchannel
            source: loki
          annotations:
            message: Testing Ruler!
            summary: Testing Ruler!

logcli through the gateway returns data for this particular query_range:

# logcli query '(sum(count_over_time({job=~".+", env="env"} |~ ".+level=warn.+"[30s]) >= 1))'
https://<LOKIGATEWAY>/loki/api/v1/query_range?direction=BACKWARD&end=1730915766298050000&limit=30&query=%28sum%28count_over_time%28%7Bjob%3D~%22.%2B%22%2C+env%3D%22env%22%7D+%7C~+%22.%2Blevel%3Dwarn.%2B%22%5B30s%5D%29+%3E%3D+1%29%29&start=1730912166298050000
[
  {
    "metric": {},
    "values": [
      [
        1730912162,
        "468"
      ],
      [
        etc etc

Querier frontend is reporting status=200 and querier reports status=500 and total_bytes=0B for both attempts for the alert rule. Both of these requests have source=ruler:

loki-query-frontend-6f8c7d6f4c-tk2vn query-frontend level=info ts=2024-11-06T18:08:29.077144951Z caller=metrics.go:223 component=frontend org_id=fake latency=fast query="sum((count_over_time({job=~\".+\", env=\"env\"} |~ \".+level=warn.+\"[30s]) >= 1))" query_hash=2994804900 query_type=metric range_type=instant length=0s start_delta=8.609486ms end_delta=8.609665ms step=0s duration=6.470908ms status=200 limit=100 returned_lines=0 throughput=0B total_bytes=0B total_bytes_structured_metadata=0B lines_per_second=0 total_lines=0 post_filter_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=104.108µs splits=0 shards=1 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=1.244394ms cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=2 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s source=ruler disable_pipeline_wrappers=false
loki-querier-79fbbcfd9-gwwvh querier level=info ts=2024-11-06T18:24:29.07610384Z caller=metrics.go:223 component=querier org_id=fake latency=fast query="count_over_time({job=~\".+\", env=\"env\"} |~ \".+level=warn.+\"[30s])" query_hash=3177363205 query_type=metric range_type=instant length=0s start_delta=7.581802ms end_delta=7.581993ms step=0s duration=1.516845ms status=500 limit=100 returned_lines=0 throughput=0B total_bytes=0B total_bytes_structured_metadata=0B lines_per_second=0 total_lines=0 post_filter_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=122.273µs splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=1.277523ms cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=2 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s source=ruler disable_pipeline_wrappers=false

From logcli query through the gateway address logs total_bytes=340MB:

loki-query-frontend-6f8c7d6f4c-tk2vn query-frontend level=info ts=2024-11-06T18:16:49.286222937Z caller=metrics.go:223 component=frontend org_id=admin latency=fast query="(sum(count_over_time({job=~\".+\", env=\"env\"} |~ \".+level=warn.+\"[30s]) >= 1))" query_hash=3193268194 query_type=metric range_type=range length=1h0m0s start_delta=1h0m7.083396751s end_delta=7.083396887s step=14s duration=6.727327043s status=200 limit=30 returned_lines=0 throughput=50MB total_bytes=340MB total_bytes_structured_metadata=32MB lines_per_second=198368 total_lines=1334493 post_filter_lines=124150 total_entries=1 store_chunks_download_time=2.158175607s queue_time=110.478µs splits=2 shards=1 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=3.539578ms cache_chunk_req=116 cache_chunk_hit=116 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=27267931 cache_chunk_download_time=2.156608317s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=1 cache_result_hit=1 cache_result_download_time=752.692µs cache_result_query_length_served=29m38s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=311 ingester_requests=2 ingester_chunk_head_bytes=16MB ingester_chunk_compressed_bytes=28MB ingester_chunk_decompressed_bytes=228MB ingester_post_filter_lines=85669 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s disable_pipeline_wrappers=false

Additionally I thought it was a querier issue so i've messed around with frontend_address and scheduler_address with the frontend/worker settings which have also produced the same results.

Ive been trying to debug this issue now for over a week and nothing seems to come to fruition. Looking for some ideas as to what could be going on here. Thank you for your time!

@gyoza
Copy link
Author

gyoza commented Nov 20, 2024

Happy to add any further debugging if needed. Thanks!

@lpowalka
Copy link

lpowalka commented Dec 10, 2024

For us the problem was related to using the multi-tenant setup. In such case the ruler has to be proxied (and augmented with extra headers) to the the query frontend to allow it to obtain the logs data from all the tenants. Otherwise the rule expressions keep returning empty lines in the ruler, although the query is shown to succeed.

EDIT:
What's important to add is that this setup worked for us before Loki 3.1. Something must have changed that now requires the ruler to evaluate queries via an additional proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants