Guidance/ recommended configurations for running Astra at scale #795

autata · 2024-03-06T23:46:29Z

autata
Mar 6, 2024

Starting a discussion on running kaldb at scale.

Can you share insight into how Slack runs this open source project internally (what configuration params do you keep as default and which do you override)

How do you pick the number of each node type?
How do you autoscale recovery nodes?
How many managers, preprocessors etc?

Answered by bryanlb

Mar 13, 2024

I'm still working to build out a complete recommendation list, but here's a snapshot of what we're currently using for some of the more critical configs: CPU range indicates the Kubernetes request/limit configs.

Index

r5d.24xlarge
cpu: 2-5
memory: 32GB
jvm: 6GB
localdisk: 90Gi
maxBytesPerChunk: 15000000000 # 15GB
Scaled to 4MB/s per indexer - 40MB/s cluster would be 10 nodes

Recovery

r5d.24xlarge
cpu: 2-5
memory: 24GB
jvm: 20GB
localdisk: 100Gi
Auto scaled on CPU > 60%, min 2 nodes

Manager

m5.24xlarge
cpu: 0.5 - 2
memory: 12GB
jvm: 8GB
1 instance per cluster

Query

m5.24xlarge
cpu: 1-4
memory: 32GB
jvm: 28GB
requestTimeout: 60s
Scaled to 3-10 nodes, depending on query load

Cache

i3en.24xlarge

View full answer

autata · 2024-03-13T20:01:01Z

autata
Mar 13, 2024
Author

@bryanlb @vthacker can you give some guidance?

0 replies

bryanlb · 2024-03-13T21:03:01Z

bryanlb
Mar 13, 2024
Maintainer

I'm still working to build out a complete recommendation list, but here's a snapshot of what we're currently using for some of the more critical configs: CPU range indicates the Kubernetes request/limit configs.

Index

r5d.24xlarge
cpu: 2-5
memory: 32GB
jvm: 6GB
localdisk: 90Gi
maxBytesPerChunk: 15000000000 # 15GB
Scaled to 4MB/s per indexer - 40MB/s cluster would be 10 nodes

Recovery

r5d.24xlarge
cpu: 2-5
memory: 24GB
jvm: 20GB
localdisk: 100Gi
Auto scaled on CPU > 60%, min 2 nodes

Manager

m5.24xlarge
cpu: 0.5 - 2
memory: 12GB
jvm: 8GB
1 instance per cluster

Query

m5.24xlarge
cpu: 1-4
memory: 32GB
jvm: 28GB
requestTimeout: 60s
Scaled to 3-10 nodes, depending on query load

Cache

i3en.24xlarge
cpu: 5-8
memory: 42GB
jvm: 30GB
localdisk: 3300Gi
cacheSlotsPerInstance: 200
requestTimeout: 55s
Auto scaled with HPA using the hpa_cache_demand_factor metric targeting 1.0

Preprocessor

m5.24xlarge
cpu: 2-4
memory: 36GB
jvm: 28GB
Using bulk ingest targeting around 25MB-35MB/s per instance

General

ZK session timeout: 60s
ZK connect timeout: 15s
ZK sleep retries: 5s
JVM flags:
- -Dcom.linecorp.armeria.transportType=io_uring

4 replies

autata Mar 20, 2024
Author

@bryanlb thanks for the detail here. Which node types do you run as stateful sets? From my reading, it seems the only node type that requires this is the index, but do you find other node types are easier to run if they don't lose state on reload?

bryanlb Mar 20, 2024
Maintainer

@autata ah, that's an excellent question! As you alluded to, the only node type we currently run as statefulsets are the indexers, and that is primarily motivated because of the stable node name (which we use for the kafka partition as shown below). All instance types additionally use ephemeral instance volumes when disk is needed (cache, index, recovery).

export KAFKA_TOPIC_PARTITION=${HOSTNAME##*-}

autata Mar 21, 2024
Author

@bryanlb thanks. this helps a lot.

Another question around configuration. I'm looking at the diagram shared here: https://github.com/slackhq/astra/wiki

It seems pretty clear the directional arrow from astra nodes (e.g. the query node makes a call the indexer node)

But can you elaborate on the dotted-line node interactions with ZK? Does ZK push to any node types? I want to make sure I have all my networking set up correctly and don't want a somewhat silent network failure between ZK and other nodes to break things. (E.g. what does the interaction between ZK and preprocessor look like? Does it push changes to rate limiting?, what does the double arrow indicate in some of the other node-types)

bryanlb Mar 28, 2024
Maintainer

@autata for Zookeeper, we rely on both traditional create/update/delete as well as watches. The watches act closer to a push updates that trigger when changes are made to specific znodes. These watches trigger on things like cache nodes joining the fleet, on updates to rate limiting, and on new snapshots being published by indexers, to name a few.

autata · 2024-04-03T14:34:14Z

autata
Apr 3, 2024
Author

@bryanlb Another configuration question:

We currently have an indexer crashing because it is trying to reference an offset on the kafka WAL that has since expired ("error_type":"org.apache.kafka.clients.consumer.OffsetOutOfRangeException). We're in the early stages so its ok to have some data loss. I am not sure how we got to this state. Is there a go-to action to recover from this? E.g. set the offset in the astra ZK and miss some messages?

But it leads me to the question of the ideal configuration between the kafka topic retention and INDEXER_MAX_MESSAGES_PER_CHUNK.

The way I see it, since the kafka topic is a WAL, the ideal configuration would be that any partition holds more messages than INDEXER_MAX_MESSAGES_PER_CHUNK. In this scenario, we can recover from data loss of the system because kafka holds anything not yet flushed to S3.

Is this how Slack configures the Kafka retention? AFAIK, kafka topics can have retention by bytes or time on the topic as a whole. Can you share the kafka WAL configuration and corresponding INDEXER_MAX_MESSAGES_PER_CHUNK?

1 reply

bryanlb Apr 8, 2024
Maintainer

@autata we currently set our Kafka topics to have a retention of approximately 6 hours, for the exact reasons you mention. The clusters are expected to rollover within this time, otherwise as you mentioned this can result in data loss if this isn't flushed to S3 within this time.

We have noticed that there's an issue for low volume clusters where the indexer isn't configured to rollover occurs within the retention window. I've created an issue for this in #848 to prevent this from happening in the future.

We have also noticed an issue with Kafka related to the OffsetOutOfRangeException, when running the indexer in isolation.level: read_committed. In our case this was actually a known issue with Kafka and hung transactions (see https://blog.digitalis.io/hung-kafka-transactions-3cdf9ca158ff, https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense, https://issues.apache.org/jira/browse/KAFKA-14402). While planned to be resolved in 3.7.0 this ended up not shipping in this version. As mentioned in resources above, if your issue is related to a hung transaction there are built-in Kafka scripts you can use to identify and resolve this issue, until a proper fix is shipped in a later version of Kafka.

We're not currently aware of any other issues related to OffsetOutOfRangeException at this time aside from the Kafka issues, so if you do get something reproducible we'd be very interested to check it out. Feel free to open a bug report if this ends up being the case.

If you have any other questions feel free to open a new discussion item! 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance/ recommended configurations for running Astra at scale #795

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Guidance/ recommended configurations for running Astra at scale #795

autata Mar 6, 2024

Index

Recovery

Manager

Query

Cache

Replies: 3 comments · 5 replies

autata Mar 13, 2024 Author

bryanlb Mar 13, 2024 Maintainer

Index

Recovery

Manager

Query

Cache

Preprocessor

General

autata Mar 20, 2024 Author

bryanlb Mar 20, 2024 Maintainer

autata Mar 21, 2024 Author

bryanlb Mar 28, 2024 Maintainer

autata Apr 3, 2024 Author

bryanlb Apr 8, 2024 Maintainer

autata
Mar 6, 2024

Replies: 3 comments 5 replies

autata
Mar 13, 2024
Author

bryanlb
Mar 13, 2024
Maintainer

autata Mar 20, 2024
Author

bryanlb Mar 20, 2024
Maintainer

autata Mar 21, 2024
Author

bryanlb Mar 28, 2024
Maintainer

autata
Apr 3, 2024
Author

bryanlb Apr 8, 2024
Maintainer