Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we debug why "timeout" issues are happening #303

Open
namanchikara opened this issue May 4, 2023 · 1 comment
Open

How do we debug why "timeout" issues are happening #303

namanchikara opened this issue May 4, 2023 · 1 comment

Comments

@namanchikara
Copy link

namanchikara commented May 4, 2023

Hi @lni, firstly thanks for the contribution of Dragonboat to the OSS world! We're currently using Dragonboat on one of the applications that's supposed to be highly scaleable, we've got 5 nodes (4 core CPU, 16 GB ram each), and we're using BadgerDB for the state machine.

Now to make it fail-safe and to have a "recoverable" system we're trying to test the system in different scenarios. One such scenario is where we bring one of the nodes (out of 5) down, and then while transactions are flowing in (around 2k TPS) we wait for a minute or so and then bring that node back up. If my understanding is correct, this node is supposed to recover by identifying who the leader is and how far behind it is from the leader, if no snapshot(s) has/have been created so far then the leader will send the logs to this node otherwise it will send the snapshot.

What we're currently witnessing is, in the above scenario, in the application logs of the new node we get log messages:

error timeout: shard is not ready

What's more interesting is if we:

  1. Start the run (2k TPS)
  2. Bring down one node
  3. Stop the transactions after a while
  4. Bring the node back up
  5. Start the transactions after the node is back-up

Then it's able to recover and process the transactions.

The only difference is the step 3, if we don't stop the transactions and try to bring it back up (which would be the production use case) then we're seeing the timeout issues. I hope I'm able to convey what we're trying to do and hoping you can help us with some pointers on how we can debug it further.

From the logs, it also seems like it's from the SyncPropose method. We have info level logs enabled on the dragonboat and only error level for our application and BadgerDB, please let me know if you need any more info from our end.

@lni
Copy link
Owner

lni commented Jun 5, 2023

If my understanding is correct, this node is supposed to recover by identifying who the leader is and how far behind it is from the leader, if no snapshot(s) has/have been created so far then the leader will send the logs to this node otherwise it will send the snapshot.

you are correct.

error timeout: shard is not ready

did that message just go away after a while? my understanding is that you were trying to use that recovered node when it is still in the process being recovered (not ready yet).

you may want to check why it is taking longer than expected, probably slow state machine recovery?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants