Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Performance tuning" docs should be changed to prevent users from having performance-related incidents #2017

Closed
gdubicki opened this issue Jul 14, 2024 · 7 comments · Fixed by #2018
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one.

Comments

@gdubicki
Copy link
Contributor

gdubicki commented Jul 14, 2024

What happened?

We have tried following "Performance tuning" docs to optimize our cluster and after doing it our cluster performance has become very, very bad.

In particular, the average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

The read times have been affected as well, although less painfully.

Please see the screenshot from our metrics. The "optimizations" have been applied a bit before 21:00 here.

Screenshot 2024-07-13 at 10 59 10

This has led to a multiple days-long performance incident that took hours to revert.

We don't want anyone else to run into this problem ever again.

What did you expect to happen?

  1. The Performance tuning docs should be prepended with a big callout warning that would state, that in some cases the tuning works very badly, possibly breaking performance and that it's hard to revert (basically you have to restart your nodes, which is a hacky procedure to keep the data in some cases, f.e. when using local SSDs in GKE).

I am talking about this style warning:

Screenshot 2024-07-14 at 12 43 29
  1. What perftune.py does should become easily revertible, see perftune.py should allow rollback / revert the changes it made seastar#2350, if the "Performace tuning" should become a non-experimental feature. Preferably, the Scylla operator should then enable running it in revert mode.

How can we reproduce it (as minimally and precisely as possible)?

Read the docs https://operator.docs.scylladb.com/stable/performance.html

Scylla Operator version

1.13.0

Kubernetes platform name and version

$ kubectl version
Client Version: v1.29.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5-gke.1192000

Please attach the must-gather archive.

I don't think it matters here, because the perftune.py performance reduction itself is being looked into in scylladb/seastar#2350, but here it is, if needed:

scylla-operator-must-gather-9zqx6hqsn6zb.zip

This is from after the revert of the perftune settings. For more info about what perftune did, please see scylladb/seastar#2350.

Anything else we need to know?

See also scylladb/seastar#2350 for some more info about our setup.

@gdubicki gdubicki added the kind/bug Categorizes issue or PR as related to a bug. label Jul 14, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 14, 2024
gdubicki added a commit to gdubicki/scylla-operator that referenced this issue Jul 14, 2024
gdubicki added a commit to gdubicki/scylla-operator that referenced this issue Jul 14, 2024
gdubicki added a commit to gdubicki/scylla-operator that referenced this issue Jul 14, 2024
@tnozicka
Copy link
Member

Please attach the must-gather archive.

Doesn't apply

The template specifically says it is required and helps us identify the platform and settings that led to the case you describe. We can't commit cycles to debug you issue when you don't invest time into collecting the data.

https://operator.docs.scylladb.com/stable/support/must-gather.html

@gdubicki
Copy link
Contributor Author

Please attach the must-gather archive.

Doesn't apply

The template specifically says it is required and helps us identify the platform and settings that led to the case you describe. We can't commit cycles to debug you issue when you don't invest time into collecting the data.

https://operator.docs.scylladb.com/stable/support/must-gather.html

Got it, sorry. Attached!

@zimnx
Copy link
Collaborator

zimnx commented Jul 15, 2024

For GCP local SSDs it's important to disable disk writeback cache, otherwise disks are not capable of handling optimized load. Operator only does it when Scylla Enterprise is used as utility image. If you have a license, switch image in scyllaoperatorconfig.spec.scyllaUtilsImage to Enterprise one.
Or do change the cache settings manually on all your nodes.

@gdubicki
Copy link
Contributor Author

Thanks @zimnx!

But do you mean we should apply the perftune and disable disk writeback cache for the tuning to work as expected?

@zimnx
Copy link
Collaborator

zimnx commented Jul 15, 2024

yes

@gdubicki
Copy link
Contributor Author

gdubicki commented Jul 16, 2024

In my opinion, now if scylladb/seastar#2350 would be implemented, the "experimental" info and the "needs testing in non-production and is not easy to revert" warning could be removed.

But the main problem here has been resolved, so this issue can remain closed.

Thank you for the understanding and your help, @tnozicka and @zimnx!

@mykaul
Copy link
Contributor

mykaul commented Jul 16, 2024

@gdubicki - can you clarify if the issue was indeed the writeback cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants