index node rollout strategy? #1072

autata · 2024-09-12T19:27:59Z

autata
Sep 12, 2024

I'm curious what sort of roll out strategy Slack uses on index nodes.

We do one at a time rollout, but since each index node kicks off a corresponding recovery task, IO load on kafka spikes. I wonder if there is a better strategy.

Does slack have a mechanism or configuration that waits for recovery nodes to re-index and push a segment before the next index node reloads? Or maybe something else?

Answered by bryanlb

Sep 30, 2024

@autata for indexers we have a deploy script that will delete up to 10% of pods in parallel, and then wait two mins before continuing onto the next phase. This looks something like the following:

#!/bin/bash

pods="$(kubectl --context=$cluster get pods -l app=$app -o name | cut -d '/' -f 2)"
# delete 10% of the pods in parallel
let max_not_running=$replicas/10
for pod in $pods; do
  kubectl --context=$cluster delete pod $pod &
  sleep 5
  
  not_running_pods="$(kubectl --context=$cluster get pods -l app=$app | grep -v STATUS | grep -v Running | wc -l)"
  while [[ $not_running_pods -gt $max_not_running ]]; do
    echo "Pausing till $not_running_pods pods get back to Running state before de…

View full answer

autata · 2024-09-12T19:28:13Z

autata
Sep 12, 2024
Author

@bryanlb

0 replies

bryanlb · 2024-09-30T17:38:42Z

bryanlb
Sep 30, 2024
Maintainer

@autata for indexers we have a deploy script that will delete up to 10% of pods in parallel, and then wait two mins before continuing onto the next phase. This looks something like the following:

#!/bin/bash

pods="$(kubectl --context=$cluster get pods -l app=$app -o name | cut -d '/' -f 2)"
# delete 10% of the pods in parallel
let max_not_running=$replicas/10
for pod in $pods; do
  kubectl --context=$cluster delete pod $pod &
  sleep 5
  
  not_running_pods="$(kubectl --context=$cluster get pods -l app=$app | grep -v STATUS | grep -v Running | wc -l)"
  while [[ $not_running_pods -gt $max_not_running ]]; do
    echo "Pausing till $not_running_pods pods get back to Running state before deploying more"
    # wait 2 mins before doing the next iteration
    sleep 120
    not_running_pods="$(kubectl --context=$cluster get pods -l app=$app | grep -v STATUS | grep -v Running | wc -l)"
  done
done

running_pods="$(kubectl --context=$cluster get pods -l app=$app | grep -v STATUS | grep Running | wc -l)"
while [[ $running_pods -lt $replicas ]]; do
    let remaining_replicas=$replicas-$running_pods
    echo "running_pods=$running_pods. Waiting for $remaining_replicas pods to reach Running state before completing deploy"
    sleep 10
    running_pods="$(kubectl --context=$cluster get pods -l app=$app | grep -v STATUS | grep Running | wc -l)"
done

environment:
  app: kaldb-index-example-dev
  cluster: logging-dev-iad
  replicas: 5

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index node rollout strategy? #1072

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

index node rollout strategy? #1072

autata Sep 12, 2024

Replies: 2 comments

autata Sep 12, 2024 Author

bryanlb Sep 30, 2024 Maintainer

autata
Sep 12, 2024

autata
Sep 12, 2024
Author

bryanlb
Sep 30, 2024
Maintainer