Rebalances can lead to wrong lag being reported #1308

nachogiljaldo · 2024-07-08T22:01:27Z

Describe the bug

When commits are not immediate and/or some events can have a relatively high processing time and topics substain a low traffic, it can lead to lag being reported even if events were processed and committed.

Kafka Version

3.6.x

To Reproduce

This test reprocuces the behavior:
https://github.com/nachogiljaldo/kafka-go/blob/do_not_commit_offset_of_not_owned_partitions/reader_test.go#L1184

The situation is:

consumer A has partitions 1 and 2
consumer A receives msg 1 for partition 2
consumer B appears and gets assigned partition 1
consumer B receives msg 1 and 2 and commits them
consumer A finishes consuming msg 1 and commits it --> offset goes back to 1, but consumer B does not know it, so it keeps on asking for offset 2

Expected Behavior

There are 2 things I would expect:

committing a message for a partition that is not assigned to the current consumer fails
if that happens, we read all events that now became uncommited but that doesn't happen either

Observed Behavior

We do not get the missing message which leads to a fake lag when the traffic has little traffic.

nachogiljaldo · 2024-09-11T10:57:15Z

An alternative solution would be that if the offset goes back in time (because another consumer) the consumer that is assigned to it receives that event again, so it has the chance to reprocess and acknowledge it. That one seems better to me because it should be race-condition free.

nachogiljaldo · 2024-09-13T10:13:19Z

Ok, I believe I found the culprit and the solution could come from 2 places.

There's 2 problems the way I see it:
a) reader does not verify the partitions that are going to be committed against the generation's Assignments. This opens the gate for a race-condition between the consumer that received a partition and commits pending to be done in the old generation.
b) That would not be a big deal if the offset was not cached at the connection level. But because it keeps as offset the last offset it has seen, if the offset "goes back in time" due to a) the connection is not aware of and it does not re-read the messages, which appear as lagging until new events are sent / processed and committed.

…at are not valid anymore.

nachogiljaldo added the bug label Jul 8, 2024

nachogiljaldo mentioned this issue Sep 10, 2024

Offset is going ahead and we are missing message with lag #1276

Open

nachogiljaldo mentioned this issue Sep 15, 2024

Do not commit offsets for past generations if partition not owned #1330

Open

nachogiljaldo added a commit to nachogiljaldo/kafka-go that referenced this issue Nov 15, 2024

(segmentio#1308) - optionally fail commit attempts for generations th…

49e263b

…at are not valid anymore.

nachogiljaldo added a commit to nachogiljaldo/kafka-go that referenced this issue Nov 15, 2024

(segmentio#1308) - optionally fail commit attempts for generations th…

8091163

…at are not valid anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebalances can lead to wrong lag being reported #1308

Rebalances can lead to wrong lag being reported #1308

nachogiljaldo commented Jul 8, 2024

nachogiljaldo commented Sep 11, 2024

nachogiljaldo commented Sep 13, 2024

Rebalances can lead to wrong lag being reported #1308

Rebalances can lead to wrong lag being reported #1308

Comments

nachogiljaldo commented Jul 8, 2024

nachogiljaldo commented Sep 11, 2024

nachogiljaldo commented Sep 13, 2024