Reliability issue during failures to write to KDB #10

jarkaxi · 2024-05-24T08:21:49Z

Hi,

I've been reading through your code to see how it works. I did spot an issue with the reliability of the process and wanted to highlight this.

In https://github.com/DataIntellectTech/kdb-chronicle-queue/blob/main/Adapter/src/main/java/uk/co/aquaq/kdb/adapter/chronicle/ChronicleToKdbAdapter.java#L215 it does the following:

read messages from chronicle queue and accumulate in kdb envelope
try send to KDB
if this fails now reset the queue state back to where it was before we were reading
if application crashes between 2 and 3 you now have a gap in the messages and no way of detecting that gap

Best,

Jark

jarkaxi · 2024-05-31T08:24:07Z

To add feedback from Chronicle around ideal setup:

The mechanism I prefer is one of the below:
A common use case is for a service to read from a queue and integrate with an external system e.g. a tick DB. For this purpose, you should ensure that the AbstractEvent.eventId or eventTime is written to the tick DB in this case. Now when the integration service is restarted, it can query the tick DB for the last eventTime that it has stored, and start replaying from there/filter events before then.

Alternatively you can use a Chronicle Queue named tailer (StartFromStrategy=NAMED) to record where you were up to in reading the queue. However, this will only work if you can guarantee that events have been committed to the external system if your method reader returns without an exception - see below.

If your integration service is reading from a Chronicle Queue using either try (DocumentContext dc = tailer.readingDocument()) { … or with a method reader, then throwing an exception from inside the try block, or from the method reader method, will rollback the read from the queue. The next time you try to read from the tailer you will get the same message. This provides a simple re-try mechanism if the external service is transiently unavailable.

The first mechanism could be used to fix this adapter but you would have to add eventTime to your KDB rows.
The second one works well if you send each row individually to kdb, which is not super-efficient, but can work fine

BGillenDI · 2024-06-05T10:04:12Z

Hi @jarkaxi,

We'll be picking this up from today and will assign someone to look into it immediately and hopefully have an update for you soon

Thanks for spotting and the leg work with Chronicle also

Brien

BGillenDI · 2024-06-16T22:55:24Z

Hi @jarkaxi Tin has merged a fix on this last week, I've left the issue open in the hopes you get a chance to take a look before we resolve

jarkaxi · 2024-06-25T08:23:57Z

Hi @BGillenDI,

That looks like it should work, might be worth noting that you've changed the messaging guarantee form "exactly once" to "at least once" - e.g. KDB could potentially receive the same message more than once - I'm assuming that doesn't have an implication normally?

this is in the case that trySend will send the message to KDB but then it crashes straight after KDB has received & confirmed reception of the message

Tin-Pui · 2024-06-26T13:20:42Z

Hi @jarkaxi, thanks for pointing this out. You are correct, the fix effectively removes a high chance (higher with larger kdb.envelope.size) that a crash causes dropped messages, but it introduces a low chance (lower with larger kdb.envelope.size) that a crash causes duplicate messages in KDB. We can include the index from the ChronicleQueue that can be sent to KDB optionally as a configuration to help detect this if that is preferable? It would require a schema change in KDB for it to work.

jarkaxi · 2024-07-15T13:02:08Z

Hi @Tin-Pui ,

Apologies for the late reply, I believe what you're suggesting is the preferred way of doing this and what Chronicle have advised. Storing the EventTime (which is guaranteed unique per event stream) with the object and retrying from the latest point that you have successfully written is best.

On start-up you then would read from the back of the queue backwards to the last eventTime written and then start replaying from there.

Best,

Jark

BGillenDI assigned Tin-Pui Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability issue during failures to write to KDB #10

Reliability issue during failures to write to KDB #10

jarkaxi commented May 24, 2024

jarkaxi commented May 31, 2024

BGillenDI commented Jun 5, 2024

BGillenDI commented Jun 16, 2024

jarkaxi commented Jun 25, 2024

Tin-Pui commented Jun 26, 2024

jarkaxi commented Jul 15, 2024

Reliability issue during failures to write to KDB #10

Reliability issue during failures to write to KDB #10

Comments

jarkaxi commented May 24, 2024

jarkaxi commented May 31, 2024

BGillenDI commented Jun 5, 2024

BGillenDI commented Jun 16, 2024

jarkaxi commented Jun 25, 2024

Tin-Pui commented Jun 26, 2024

jarkaxi commented Jul 15, 2024