fix: resolve indexer infinite loop #82

pirtleshell · 2024-10-21T21:03:17Z

on very slow drives or when run with limited resources, a node can have a delay between the block existing & being saved and the block_results getting saved. if the block exists, but the block_results do not, an infinite loop occurs. the indexer will repeatedly request the block and block_results until they both exist. the lack of delay can further constrain the node's resources and result in many calls for block_results before they are committed.

this commit updates the condition for waiting to include whenever an error occurred during indexing. if the indexer fails to find the block_results it will bombard the node with requests for it without backing off. this change causes errors to trigger a wait. after waiting for either a new block or for the timeout, the block results are more likely to exist.

on very slow drives or when run with limited resources, a node can have a delay between the block existing & being saved and the block_results getting saved. if the block exists, but the block_results do not, an infinite loop occurs. the indexer will repeatedly request the block and block_results until they both exist. the lack of delay can further constrain the node's resources and result in many calls for block_results before they are committed. this commit updates the condition for waiting to include whenever an error occurred during indexing. if the indexer fails to find the block_results it will bombard the node with requests for it without backing off. this change causes errors to trigger a wait. after waiting for either a new block or for the timeout, the block results are more likely to exist.

pirtleshell · 2024-10-21T21:33:59Z

i discovered this bug when attempting to sync a node that had a very very slow drive (unwarmed & from a snapshot). after the chain service started, the node output 1000s of errors like

ERR failed to fetch block result err="could not find results for height #12268540" height=12268541 indexer=evm module=server

i first confirmed my understanding... stopping the node & starting it again, i saw block 12268540 sync, and then 1000s of the same error, but for block 12268541 occurred. this means that after a little time, the block results manage to get committed. the problem is just that the indexer assumes the bock results always exist if the block does. however, cometbft does not store the block and its results at the same time. it saves the new block from peers, cometbft recognizes that as a new height, then the app processes the block and passes back the results to be saved.

when the drive is slow (the save of block results has high latency) or the node has limited resources (the app's processing has high latency), there is a window of time in which the block is saved but the results are not.

before this commit, the evm indexer would inundate the node with requests that fail (further limiting the resources the app has to process the block & for cometbft to save the results). now, if an error occurs during indexing, the indexer will wait either until a new block is received (the previous block's results are guaranteed to be saved) or after a timeout (1 minutes).

the already-existing wait-for-new-blocks loop is used for the error condition

i installed this on the box i was experiencing the problem on. instead of infinitely looping the erroring queries, it failed once, waited, and then continued successfully from that point forward

nddeluca

Thank you for this fix! Could we upstream to main branch as well?

on very slow drives or when run with limited resources, a node can have a delay between the block existing & being saved and the block_results getting saved. if the block exists, but the block_results do not, an infinite loop occurs. the indexer will repeatedly request the block and block_results until they both exist. the lack of delay can further constrain the node's resources and result in many calls for block_results before they are committed. this commit updates the condition for waiting to include whenever an error occurred during indexing. if the indexer fails to find the block_results it will bombard the node with requests for it without backing off. this change causes errors to trigger a wait. after waiting for either a new block or for the timeout, the block results are more likely to exist.

pirtleshell requested review from DracoLi, nddeluca, rhuairahrighairidh, drklee3, evgeniy-scherbina and karzak October 21, 2024 21:14

use non-generic error variable name

105f1cf

nddeluca approved these changes Oct 22, 2024

View reviewed changes

evgeniy-scherbina approved these changes Oct 22, 2024

View reviewed changes

pirtleshell marked this pull request as ready for review October 22, 2024 20:03

pirtleshell merged commit e3cbae3 into kava/release/v0.26.x Oct 22, 2024
15 checks passed

pirtleshell deleted the rp-indexer-infinte-loop branch October 22, 2024 20:23

pirtleshell mentioned this pull request Oct 22, 2024

deps: bump ethermint to fix indexer infinite loop Kava-Labs/kava#2038

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve indexer infinite loop #82

fix: resolve indexer infinite loop #82

pirtleshell commented Oct 21, 2024

pirtleshell commented Oct 21, 2024

nddeluca left a comment

fix: resolve indexer infinite loop #82

fix: resolve indexer infinite loop #82

Conversation

pirtleshell commented Oct 21, 2024

pirtleshell commented Oct 21, 2024

nddeluca left a comment

Choose a reason for hiding this comment