Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactions getting rejected and going into a loop #96

Open
gautamdhameja opened this issue May 30, 2024 · 8 comments
Open

Transactions getting rejected and going into a loop #96

gautamdhameja opened this issue May 30, 2024 · 8 comments
Assignees

Comments

@gautamdhameja
Copy link
Collaborator

gautamdhameja commented May 30, 2024

When sending a large number of batched (batch_all) transactions to the Mythos parachain (tried on westend and a private testnet), after some time, the transactions start getting rejected from the chain and their status goes into a loop. The test was done using 800,000 new accounts (not existing on chain), divided into batches of 500 to 1000 transactions of balances.transfer_keep_alive() call. A burst of 20 batches was sent at a time and a new burst was sent once all transactions were finalized from the previous burst.

See screenshot below:


In addition to the transaction status getting into a loop, following was also observed:

  • The block times had spikes and were not consistent. Some spikes were up to 80 seconds.
  • The nonce returned by the blockchain were inconsistent too.

The issue was not reproduced on a local zombienet. But it is being reproduced consistently on westend.

PS: The bug was initially reported by the Mythical team on May 22. The core dev team at Parity is looking into it since then. This issue for tracking the next steps.

@gautamdhameja
Copy link
Collaborator Author

Update: After initial testing and investigations, it was found that if the block size (block execution time) is reduced to 1s instead of original 2s, then this issue is not observed in westend. However, this reduces the chain throughput and should only be considered a work around while the exact issue is identified. Further testing is needed to confirm that the workaround is actually working.
The Mythical team has been provided with a new docker image and a runtime to test this in their private testnet on May 29. Further observations are awaited.

@gautamdhameja
Copy link
Collaborator Author

Update: A fix was deployed on westend and a few rounds of testing was done. It seems the network is more stable after the fix and the loop issue is not seen anymore. The transactions are still going into a retracted status, but that is because of a tx pool and will be addressed via the new tx pool impl. Next steps: testing to be done with the new tx pool.

@gautamdhameja
Copy link
Collaborator Author

Update: This PR that fixes one part of the issue related to performance problem with events - paritytech/polkadot-sdk#1223

And, the other part of the issue is related to tx pool implementation that is being addressed here - paritytech/polkadot-sdk#4639

Both of these fixes are currently deployed with the westend-mythos network but they still have some rough edges to be addressed, before we call this completely fixed. Sometimes transactions are getting stuck in the pool. The core dev team is looking into it.

With these updates, the issue with transactions getting into the loop and getting into retracted status seem to be resolved. However, the following issues are being observed:

  • There are still a lot of empty blocks even in the new pool
  • The block times are not stable

@bkchr
Copy link
Member

bkchr commented Jun 6, 2024

  • There are still a lot of empty blocks even in the new pool

Yeah that is related to a missing optimization that I already briefly discussed with @michalkucharczyk.

  • The block times are not stable

We looked into this as well. It looks like that the Kagome validator (testing of an alternative Polkadot node implementation) is involved when the parachain is not progressing. I will discuss that we disable them for a moment and try to test again!

@michalkucharczyk
Copy link

michalkucharczyk commented Jun 17, 2024

Sometimes transactions are getting stuck in the pool.

This seems to be fixed. I was not able to reproduce it with token-dropper-tx-prepare/token-dropper-send-txs utils (~70 iterations during weekend).
Fix is on the following branches:

@gautamdhameja
Copy link
Collaborator Author

gautamdhameja commented Jun 17, 2024

Thanks, @michalkucharczyk. That is really good progress.

I think, in this case, we should only call something fixed when it is merged in the SDK main branch. We can't run a live parachain using unmerged branches. But we can keep this issue updated with the progress made.

@michalkucharczyk
Copy link

Thanks, @michalkucharczyk. That is really good progress.

I think, in this case, we should only call something fixed when it is merged in the SDK main branch. We can't run a live parachain using unmerged branches. But we can keep this issue updated with the progress made.

Yeah, sure I meant fixed in new-txpool which is not yet merged.

@michalkucharczyk
Copy link

michalkucharczyk commented Jun 19, 2024

There are still a lot of empty blocks even in the new pool

I took a look into the problem of empty blocks.

The empty block occur in two cases:

  • block-builder builds upon the block that is not at the tip of the fork. Sample log (BLOCK53f02):
 substrate: [Parachain] 🏆 Imported #51 (BLOCK50f01 → BLOCK51f01)
 txpool: [Parachain] insert_new_view: retracted_views: [BLOCK46f01, BLOCK49f01, BLOCK47f01, BLOCK48f01, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 100) views:[1;[(51, 46, 0)]] event:NewBestBlock { hash: BLOCK51f01, tree_route: None }  took:14.434356ms
 txpool: [Parachain] finalize_route: retracted_views: [BLOCK49f01, BLOCK47f01, BLOCK48f01, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 91) views:[1;[(51, 46, 0)]] event:Finalized { hash: BLOCK46f01, tree_route: [] }  took:1.515293ms
 basic-authorship: [Parachain] Attempting to push transactions from the pool at BLOCK51f01.
 substrate: [Parachain] 🏆 Imported #52 (BLOCK51f01 → BLOCK52f02)
 txpool: [Parachain] insert_new_view: retracted_views: [BLOCK51f01, BLOCK49f01, BLOCK47f01, BLOCK48f01, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 100) views:[1;[(52, 46, 0)]] event:NewBestBlock { hash: BLOCK52f02, tree_route: None }  took:14.365586ms
 txpool: [Parachain] finalize_route: retracted_views: [BLOCK51f01, BLOCK49f01, BLOCK48f01, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 91) views:[1;[(52, 46, 0)]] event:Finalized { hash: BLOCK47f01, tree_route: [] }  took:603.967µs
 substrate: [Parachain] 🏆 Imported #53 (BLOCK52f02 → BLOCK53f02)
 txpool: [Parachain] insert_new_view: retracted_views: [BLOCK52f02, BLOCK51f01, BLOCK49f01, BLOCK48f01, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 100) views:[1;[(53, 46, 0)]] event:NewBestBlock { hash: BLOCK53f02, tree_route: None }  took:14.812874ms
 txpool: [Parachain] finalize_route: retracted_views: [BLOCK52f02, BLOCK51f01, BLOCK49f01, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 91) views:[1;[(53, 46, 0)]] event:Finalized { hash: BLOCK48f01, tree_route: [] }  took:1.298803ms
 basic-authorship: [Parachain] Attempting to push transactions from the pool at BLOCK53f02.
 substrate: [Parachain] 🏆 Imported #54 (BLOCK53f02 → BLOCK54f01)
 txpool: [Parachain] insert_new_view: retracted_views: [BLOCK52f02, BLOCK51f01, BLOCK49f01, BLOCK53f02, BLOCK50f01]
 txpool: [Parachain] maintain: txs:(0, 100) views:[1;[(54, 46, 0)]] event:NewBestBlock { hash: BLOCK54f01, tree_route: None }  took:13.6745ms
 txpool: [Parachain] fatp::ready_at BLOCK53f02 (retracted:true)
 basic-authorship: [Parachain] Attempting to push transactions from the pool at BLOCK53f02.

This can be easily handled, and the fix is already on the branch.

  • block-builder builds upon the block that was not notified to tx-pool. This seems to happen when new relay-chain epoch starts. Sample log (BLOCK100f02):
 substrate: [Parachain] 🏆 Imported #100 (BLOCK99f02 → BLOCK100f01)
 txpool: [Parachain] insert_new_view: retracted_views: [BLOCK99f02, BLOCK96f01, BLOCK97f01, BLOCK95f02, BLOCK98f01]
 txpool: [Parachain] maintain: txs:(0, 100) views:[1;[(100, 58, 0)]] event:NewBestBlock { hash: BLOCK100f01, tree_route: None }  took:13.878662ms
 txpool: [Parachain] finalize_route: retracted_views: [BLOCK99f02, BLOCK96f01, BLOCK97f01, BLOCK98f01]
 txpool: [Parachain] maintain: txs:(0, 93) views:[1;[(100, 58, 0)]] event:Finalized { hash: BLOCK95f02, tree_route: [] }  took:822.076µs
 substrate: [Parachain] 🏆 Imported #101 (BLOCK100f01 → BLOCK101f02)
 txpool: [Parachain] insert_new_view: retracted_views: [BLOCK99f02, BLOCK96f01, BLOCK97f01, BLOCK100f01, BLOCK98f01]
 txpool: [Parachain] maintain: txs:(0, 100) views:[1;[(101, 58, 0)]] event:NewBestBlock { hash: BLOCK101f02, tree_route: None }  took:13.561724ms
 txpool: [Parachain] finalize_route: retracted_views: [BLOCK99f02, BLOCK97f01, BLOCK100f01, BLOCK98f01]
 txpool: [Parachain] maintain: txs:(0, 93) views:[1;[(101, 58, 0)]] event:Finalized { hash: BLOCK96f01, tree_route: [] }  took:877.636µs
 babe: [Relaychain] 👶 New epoch 2 launching at block BLOCK201 (block slot 286454547 >= start slot 286454547).
 txpool: [Parachain] fatp::ready_at BLOCK99f02 (retracted:true)
 basic-authorship: [Parachain] Attempting to push transactions from the pool at BLOCK99f02.
 substrate: [Parachain] 🆕 Imported #100 (BLOCK99f02 → BLOCK100f02)
 sync: [Parachain] Reannouncing block BLOCK100f02 is_best: false
 txpool: [Parachain] fatp::ready_at BLOCK100f02 pending keys: [SERI, BLOCK1f02, BLOCK100f02]
 basic-authorship: [Parachain] Timeout fired waiting for transaction pool at block #100 (BLOCK100f02). Proceeding with production.
 substrate: [Parachain] 🆕 Imported #101 (BLOCK100f02 → BLOCK101f03)

I see two options here:

  • notify all imported blocks to tx-pool,
  • implement hack: for unknown block return any set of transactions from the older txpool views (which I don't really like bc we will have stale transactions in log what maybe misleading during log analysis). But we would need this eitherway as the fallback for all other potential cases. [edit: I briefly discussed this with Basti, we can filter out the transactions that are in blocks not known to txpool, and implement sth like light-maintain process in ready_at.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants