Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliability issue during failures to write to KDB #10

Open
jarkaxi opened this issue May 24, 2024 · 6 comments
Open

Reliability issue during failures to write to KDB #10

jarkaxi opened this issue May 24, 2024 · 6 comments
Assignees

Comments

@jarkaxi
Copy link

jarkaxi commented May 24, 2024

Hi,

I've been reading through your code to see how it works. I did spot an issue with the reliability of the process and wanted to highlight this.

In https://github.com/DataIntellectTech/kdb-chronicle-queue/blob/main/Adapter/src/main/java/uk/co/aquaq/kdb/adapter/chronicle/ChronicleToKdbAdapter.java#L215 it does the following:

  1. read messages from chronicle queue and accumulate in kdb envelope
  2. try send to KDB
  3. if this fails now reset the queue state back to where it was before we were reading
  4. if application crashes between 2 and 3 you now have a gap in the messages and no way of detecting that gap

Best,

Jark

@jarkaxi
Copy link
Author

jarkaxi commented May 31, 2024

To add feedback from Chronicle around ideal setup:

The mechanism I prefer is one of the below:
A common use case is for a service to read from a queue and integrate with an external system e.g. a tick DB. For this purpose, you should ensure that the AbstractEvent.eventId or eventTime is written to the tick DB in this case. Now when the integration service is restarted, it can query the tick DB for the last eventTime that it has stored, and start replaying from there/filter events before then.

Alternatively you can use a Chronicle Queue named tailer (StartFromStrategy=NAMED) to record where you were up to in reading the queue. However, this will only work if you can guarantee that events have been committed to the external system if your method reader returns without an exception - see below.

If your integration service is reading from a Chronicle Queue using either try (DocumentContext dc = tailer.readingDocument()) { … or with a method reader, then throwing an exception from inside the try block, or from the method reader method, will rollback the read from the queue. The next time you try to read from the tailer you will get the same message. This provides a simple re-try mechanism if the external service is transiently unavailable.

The first mechanism could be used to fix this adapter but you would have to add eventTime to your KDB rows.
The second one works well if you send each row individually to kdb, which is not super-efficient, but can work fine

@BGillenDI
Copy link

Hi @jarkaxi,

We'll be picking this up from today and will assign someone to look into it immediately and hopefully have an update for you soon

Thanks for spotting and the leg work with Chronicle also

Brien

@BGillenDI
Copy link

Hi @jarkaxi Tin has merged a fix on this last week, I've left the issue open in the hopes you get a chance to take a look before we resolve

@jarkaxi
Copy link
Author

jarkaxi commented Jun 25, 2024

Hi @BGillenDI,

That looks like it should work, might be worth noting that you've changed the messaging guarantee form "exactly once" to "at least once" - e.g. KDB could potentially receive the same message more than once - I'm assuming that doesn't have an implication normally?

this is in the case that trySend will send the message to KDB but then it crashes straight after KDB has received & confirmed reception of the message

@Tin-Pui
Copy link
Contributor

Tin-Pui commented Jun 26, 2024

Hi @jarkaxi, thanks for pointing this out. You are correct, the fix effectively removes a high chance (higher with larger kdb.envelope.size) that a crash causes dropped messages, but it introduces a low chance (lower with larger kdb.envelope.size) that a crash causes duplicate messages in KDB. We can include the index from the ChronicleQueue that can be sent to KDB optionally as a configuration to help detect this if that is preferable? It would require a schema change in KDB for it to work.

@jarkaxi
Copy link
Author

jarkaxi commented Jul 15, 2024

Hi @Tin-Pui ,

Apologies for the late reply, I believe what you're suggesting is the preferred way of doing this and what Chronicle have advised. Storing the EventTime (which is guaranteed unique per event stream) with the object and retrying from the latest point that you have successfully written is best.

On start-up you then would read from the back of the queue backwards to the last eventTime written and then start replaying from there.

Best,

Jark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants