Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for received etcd calls #270

Open
gberche-orange opened this issue Jan 17, 2024 · 16 comments
Open

Add metrics for received etcd calls #270

gberche-orange opened this issue Jan 17, 2024 · 16 comments
Assignees

Comments

@gberche-orange
Copy link

Expected behavior

As a kine user
In order to diagnose performance issues such as chatty SQL for a standby kine instance in a multi-server cluster
I need to measure incoming etcd traffic received by each kine instance

Current behavior

Workaround

Measure client-side etcd calls from K8s api, however due to load balancing, the metrics are insufficient

@brandond
Copy link
Member

A PR would be welcome. It is unlikely that we will be able to prioritize full parity with etcd's server metrics.

@gberche-orange
Copy link
Author

Thanks for your prompt response. My team is unfortunately unable to contribute golang features at this time. Partial incomplete would likely still be useful.

@brandond
Copy link
Member

diagnose performance issues such as chatty SQL for a standby kine instance in a multi-server cluster

If you're just trying to figure out if a "standby" instance is being used, wouldn't the SQL call metrics be a good proxy for client traffic? There will always be a base level of chatter, but less client requests will still directly correlate to less SQL calls.

@gberche-orange
Copy link
Author

From manual peeking at netstat connections in the kine container, we suspect there is no incoming client traffic into the "standby" instance, yet the sql calls metrics are still pretty high. The etcd client metrics would be a robust way to get the truth.

As a workaround, we're trying to infer etcd client traffic from container network traffic by using metric container_network_receive_bytes_total cumulative bytes received excluding eth0 google/cadvisor#2615 In other words, trying to infer incoming etcd traffic into kine from the measured traffic in the different interfaces. But this is fragile too. See a sample graph below. Kine standby is on server-1
image

We're doubling with cilium flow metrics to confirm the standby kine pod is not receiving ingress connections it should not.

Note that our set up is somewhat complex, here is a draft sketch.

  • shared-services is a large k8s cluster where workloads are running. It is using a remote kine as a backing store.
  • ha-data-store is another k8s cluster with a native etcd backing store. It is smaller and dedicated to providing a scalable multi az kine cluster with a cnpg db.

image

Our constraint is that the cross-az bandwidth is limited in our environment (represented as the green wireguard bars), so we try to favor local-az traffic affinity.

@brandond
Copy link
Member

brandond commented Jan 17, 2024

I will note that kine doesn't really have a concept of being in standby. Even if there are no client connections, it will continue polling the datastore, converting rows into events (even if they don't get sent to any watchers), processing event TTLs, attempting to perform compaction, and so on.

The only thing that won't happen on the "standby" nodes is any one-off queries related to direct resource retrieval or modification - basically anything that flows through the "append" or "list" nodes in the diagram at https://github.com/k3s-io/kine/blob/master/docs/flow.md. Everything in the poll loop at the bottom runs at all times.

With that in mind, you might consider changing your deployment to favor a different architecture, or working on some optimizations that disable the poll loop when kine has no clients.

@gberche-orange
Copy link
Author

gberche-orange commented Jan 17, 2024

Thanks for highlighting the background work performed by all kine instances. We'll take more time trying to better understand it.

We initially focused on the compactor as we noticed high volume of related sql queries, although we can't yet measure their cumulative bandwidth usage. We were considering spreading the start of kine processes in order to avoid concurrent execution of the compactor (as a workaround for turning it off with #230 ). Do you think that could help reduce the overall background activity ?

Once we confirm no etcd requests hit the "standby" kine instance, we may need to try to further analyze the background activity and see if they could come from unusual/unoptimize usage pattern of the k8s api. Here are some thoughts for measurements of the background activity

With that in mind, you might consider changing your deployment to favor a different architecture,

We are considering asking the k8s scheduler to colocate a single kine with the single pg active process using pod affinity rules. When the pg process will fail over, we'll need to reschedule the kine pod. The descheduler project looked appealing but the support for descheduling pods violating podAffinity is pending, see kubernetes-sigs/descheduler#1286 Any other idea ?

I am careful however of relying too much on the k8s scheduler during recovery from az outages, as I fear the fail over time will be long. We have to test and measure such fail-over time.

@brandond
Copy link
Member

brandond commented Jan 17, 2024

We were considering spreading the start of kine processes in order to avoid concurrent execution of the compactor

Only one node will actually compact, the others will skip compaction due to the compact_rev_key value changing between runs. I wouldn't worry too much about that.

Let me take a look at what metrics etcd provides, and see how much work it would be to add some initial ones to kine.

We are considering asking the k8s scheduler to colocate a single kine with the single pg active process using pod affinity rules. When the pg process will fail over, we'll need to reschedule the kine pod.

That sounds like more of a job for a custom operator, as opposed to something that the default scheduler can handle. If you want to control placement of a pod, based on the state of a system that is opaque to Kubernetes (ie, the primary pg cluster member), that is operator territory.

@gberche-orange
Copy link
Author

gberche-orange commented Jan 18, 2024

Let me take a look at what metrics etcd provides, and see how much work it would be to add some initial ones to kine.

Thanks a lot.

With that in mind, you might consider [...] working on some optimizations that disable the poll loop when kine has no clients.

That sounds the best option, coupled with having one kine container colocated with each pg container regardless of whether it is a primary or a read-only replica. This architecture does not rely on the k8s scheduler to provide ha and fail over, and should improve up time in face of network partitions or az loss.

Our early attempt of colocating kine in the postgresql pod showed that the kine process exited early on when connecting to a read-only pg replica. With default container start command, this resulted in the whole pod failure, including postgresql.

I'm considering options for having the "standby" kine instance to wait for the backing store to be available before accepting incoming traffic. Here are some thoughts:

  • In the container start command, use a watch dog mechanism (simple shell retry loop or full fledged tool) to periodically restart the failed kine process, and possibly filtering out some error logs as to reduce log pollution
  • In kine, add a new feature flag to not exit immediately when the backing store is not available but retry periodically.

@brandond
Copy link
Member

I did a quick scan over the etcd metrics, I think these are the ones that we could theoretically support:

server/mvcc/metrics.go:	rangeCounter = prometheus.NewCounter(
server/mvcc/metrics.go:	putCounter = prometheus.NewCounter(
server/mvcc/metrics.go:	deleteCounter = prometheus.NewCounter(
server/mvcc/metrics.go:	txnCounter = prometheus.NewCounter(
server/mvcc/metrics.go:	keysGauge = prometheus.NewGauge(
server/mvcc/metrics.go:	totalPutSizeGauge = prometheus.NewGauge(

server/mvcc/metrics.go:	watchStreamGauge = prometheus.NewGauge(
server/mvcc/metrics.go:	watcherGauge = prometheus.NewGauge(
server/mvcc/metrics.go:	totalEventsCounter = prometheus.NewCounter(

server/mvcc/metrics.go:	dbCompactionTotalMs = prometheus.NewHistogram(
server/mvcc/metrics.go:	dbCompactionLast = prometheus.NewGauge(
server/mvcc/metrics.go:	dbCompactionKeysCounter = prometheus.NewCounter(
server/mvcc/metrics.go:	dbTotalSize = prometheus.NewGaugeFunc(prometheus.GaugeOpts{

server/mvcc/metrics.go:	currentRev = prometheus.NewGaugeFunc(prometheus.GaugeOpts{
server/mvcc/metrics.go:	compactRev = prometheus.NewGaugeFunc(prometheus.GaugeOpts{

server/lease/metrics.go:	leaseGranted = prometheus.NewCounter(prometheus.CounterOpts{
server/lease/metrics.go:	leaseRevoked = prometheus.NewCounter(prometheus.CounterOpts{
server/lease/metrics.go:	leaseRenewed = prometheus.NewCounter(prometheus.CounterOpts{
server/lease/metrics.go:	leaseTotalTTLs = prometheus.NewHistogram(
server/etcdserver/metrics.go:	leaseExpired = prometheus.NewCounter(prometheus.CounterOpts{

server/etcdserver/metrics.go:	currentVersion = prometheus.NewGaugeVec(prometheus.GaugeOpts{
server/etcdserver/metrics.go:	currentGoVersion = prometheus.NewGaugeVec(prometheus.GaugeOpts{

@brandond brandond self-assigned this Jan 25, 2024
@gberche-orange
Copy link
Author

awesome, thanks Brad, that would be very useful !

@gberche-orange
Copy link
Author

Re on metrics.

I'm suspecting that the loop (that you kindly highlighted in #270 (comment)), fetching all rows content, is a significant proportion of the kine-to-db traffic we're observing

rows, err := s.d.After(s.ctx, "%", last, pollBatchSize)

What metrics could help measure the impact of the loop and help focus potential future optimization efforts ?

@brandond
Copy link
Member

The SQL queries would be captured in the existing SQL metrics. totalEventsCounter will capture the events (rows) produced by this polling.

Having the poll loop not run while there are no watch clients connected may not be an achievable goal. There are some internal maintenance/cleanup routines (mostly around TTL handling) that rely on having a constant feed of events from the cluster. Compaction will also attempt to run at intervals even if there are no clients. It'll take some experimentation to see what we can do; kine was not designed for active/standby use.

@gberche-orange
Copy link
Author

Thanks again for your feedback I

totalEventsCounter will capture the events (rows) produced by this polling.

This new metric would be very useful thanks !

I'm suspecting that the loop, fetching all rows content, is a significant proportion of the kine-to-db traffic we're observing

I still need to refine my analysis w.r.t. suspected top traffic contributors, as I don't yet have automated mapping from pg statement stats into kine functions, so I currently need to perform analysis from pg statements log samples. Here is a copy of the pg statement stats of queries sorted by number of rows returned. $3 does not always appear in pg logs in my sample, and seems to be $3 = 'compact_rev_key', apparently rather hinting to compactor queries

image

image

The queryId=7735331401850268309 associated with the top statement corresponds in pg log_statement=all to the following, with $1 iterating over may kkeys.

 detail:parameters: $1 = '/registry/kustomize.toolkit.fluxcd.io/kustomizations/sf-c9dee50c-5c6e-44af-81bb-9d7ece98ca8b/instance-redis-c9dee50c-5c6e-44af-81bb-9d7ece98ca8b', $2 = 'f' message: execute st[680/1883]
284:
                SELECT *
                FROM (
                        SELECT (
                SELECT MAX(rkv.id) AS id
                FROM kine AS rkv), (
                SELECT MAX(crkv.prev_revision) AS prev_revision
                FROM kine AS crkv
                WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value
                        FROM kine AS kv
                        JOIN (
                                SELECT MAX(mkv.id) AS id
                                FROM kine AS mkv
                                WHERE
                                        mkv.name LIKE $1
                                GROUP BY mkv.name) AS maxkv
                                ON maxkv.id = kv.id
                        WHERE
                                kv.deleted = 0 OR
                                $2
                ) AS lkv
                ORDER BY lkv.theid ASC

It'll take some experimentation to see what we can do; kine was not designed for active/standby use.

Yes, the alternative for us is to only have a single active kine process at a given time, and having the "standby" kine instance to wait for the backing store to be available before accepting incoming traffic, and starting its compactor and loop (see #270 (comment)) I have on my list to prototype the container start command retry loop.

@brandond
Copy link
Member

brandond commented Jan 30, 2024

$3 does not always appear in pg logs in my sample, and seems to be $3 = 'compact_rev_key', apparently rather hinting to compactor queries

This is not an event poll or compact; the query you're digging into here is a list query. You'll see the same thing with different values for the mkv.name LIKE $1 parameter depending on what's being listed. If this is your top query, then most of your activity is coming from clients listing things. You might look at the parameters, and see if there is some particular resource type that is driving a lot of traffic.

You'll see SELECT MAX(rkv.id) AS id FROM kine AS rkv and SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key' in many queries, as etcd is required to return the current and compact revision along with the response.

@oadekoya
Copy link

oadekoya commented Apr 8, 2024

@brandond any update on adding those etcd client metrics?

@brandond
Copy link
Member

brandond commented Apr 8, 2024

Nope. You'll see a PR linked here when one is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants