"Performance tuning" docs should be changed to prevent users from having performance-related incidents #2017

gdubicki · 2024-07-14T10:52:37Z

What happened?

We have tried following "Performance tuning" docs to optimize our cluster and after doing it our cluster performance has become very, very bad.

In particular, the average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

The read times have been affected as well, although less painfully.

Please see the screenshot from our metrics. The "optimizations" have been applied a bit before 21:00 here.

This has led to a multiple days-long performance incident that took hours to revert.

We don't want anyone else to run into this problem ever again.

What did you expect to happen?

The Performance tuning docs should be prepended with a big callout warning that would state, that in some cases the tuning works very badly, possibly breaking performance and that it's hard to revert (basically you have to restart your nodes, which is a hacky procedure to keep the data in some cases, f.e. when using local SSDs in GKE).

I am talking about this style warning:

What perftune.py does should become easily revertible, see perftune.py should allow rollback / revert the changes it made seastar#2350, if the "Performace tuning" should become a non-experimental feature. Preferably, the Scylla operator should then enable running it in revert mode.

How can we reproduce it (as minimally and precisely as possible)?

Read the docs https://operator.docs.scylladb.com/stable/performance.html

Scylla Operator version

1.13.0

Kubernetes platform name and version

$ kubectl version
Client Version: v1.29.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5-gke.1192000

Please attach the must-gather archive.

I don't think it matters here, because the perftune.py performance reduction itself is being looked into in scylladb/seastar#2350, but here it is, if needed:

scylla-operator-must-gather-9zqx6hqsn6zb.zip

This is from after the revert of the perftune settings. For more info about what perftune did, please see scylladb/seastar#2350.

Anything else we need to know?

See also scylladb/seastar#2350 for some more info about our setup.

The text was updated successfully, but these errors were encountered:

partially fixes scylladb#2017

tnozicka · 2024-07-15T06:34:34Z

Please attach the must-gather archive.

Doesn't apply

The template specifically says it is required and helps us identify the platform and settings that led to the case you describe. We can't commit cycles to debug you issue when you don't invest time into collecting the data.

https://operator.docs.scylladb.com/stable/support/must-gather.html

gdubicki · 2024-07-15T06:45:41Z

Please attach the must-gather archive.

Doesn't apply

The template specifically says it is required and helps us identify the platform and settings that led to the case you describe. We can't commit cycles to debug you issue when you don't invest time into collecting the data.

https://operator.docs.scylladb.com/stable/support/must-gather.html

Got it, sorry. Attached!

zimnx · 2024-07-15T07:29:04Z

For GCP local SSDs it's important to disable disk writeback cache, otherwise disks are not capable of handling optimized load. Operator only does it when Scylla Enterprise is used as utility image. If you have a license, switch image in scyllaoperatorconfig.spec.scyllaUtilsImage to Enterprise one.
Or do change the cache settings manually on all your nodes.

gdubicki · 2024-07-15T07:32:00Z

Thanks @zimnx!

But do you mean we should apply the perftune and disable disk writeback cache for the tuning to work as expected?

zimnx · 2024-07-15T07:35:13Z

yes

gdubicki · 2024-07-16T10:32:20Z

In my opinion, now if scylladb/seastar#2350 would be implemented, the "experimental" info and the "needs testing in non-production and is not easy to revert" warning could be removed.

But the main problem here has been resolved, so this issue can remain closed.

Thank you for the understanding and your help, @tnozicka and @zimnx!

mykaul · 2024-07-16T11:01:55Z

@gdubicki - can you clarify if the issue was indeed the writeback cache?

gdubicki added the kind/bug Categorizes issue or PR as related to a bug. label Jul 14, 2024

scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 14, 2024

gdubicki added a commit to gdubicki/scylla-operator that referenced this issue Jul 14, 2024

Make warning about using performance tuning more explicit

8d11f10

partially fixes scylladb#2017

gdubicki added a commit to gdubicki/scylla-operator that referenced this issue Jul 14, 2024

Make warning about using performance tuning more explicit

6a2f138

partially fixes scylladb#2017

gdubicki added a commit to gdubicki/scylla-operator that referenced this issue Jul 14, 2024

Make warning about using performance tuning more explicit

e1a6f57

partially fixes scylladb#2017

gdubicki mentioned this issue Jul 14, 2024

Make warning about using performance tuning more explicit #2018

Merged

scylla-operator-bot bot closed this as completed in #2018 Jul 16, 2024

scylla-operator-bot bot closed this as completed in c95b49f Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Performance tuning" docs should be changed to prevent users from having performance-related incidents #2017

"Performance tuning" docs should be changed to prevent users from having performance-related incidents #2017

gdubicki commented Jul 14, 2024 •

edited

Loading

tnozicka commented Jul 15, 2024

gdubicki commented Jul 15, 2024

zimnx commented Jul 15, 2024

gdubicki commented Jul 15, 2024

zimnx commented Jul 15, 2024

gdubicki commented Jul 16, 2024 •

edited

Loading

mykaul commented Jul 16, 2024

"Performance tuning" docs should be changed to prevent users from having performance-related incidents #2017

"Performance tuning" docs should be changed to prevent users from having performance-related incidents #2017

Comments

gdubicki commented Jul 14, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Scylla Operator version

Kubernetes platform name and version

Please attach the must-gather archive.

Anything else we need to know?

tnozicka commented Jul 15, 2024

gdubicki commented Jul 15, 2024

zimnx commented Jul 15, 2024

gdubicki commented Jul 15, 2024

zimnx commented Jul 15, 2024

gdubicki commented Jul 16, 2024 • edited Loading

mykaul commented Jul 16, 2024

gdubicki commented Jul 14, 2024 •

edited

Loading

gdubicki commented Jul 16, 2024 •

edited

Loading