Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(sequence_test): Data race condition when committing BaseApp.setState in gosdk test #1903

Open
Unique-Divine opened this issue May 31, 2024 · 2 comments · Fixed by #1904 or #1913
Open

Comments

@Unique-Divine
Copy link
Member

Unique-Divine commented May 31, 2024

Summary

A data race was getting past some CI tests and creating inconsistency in
successful runs. It occured in the following test that broadcasts many
transactions in rapid succession, expecting each to succeed and be queriable by
tx hash.

Reading the error logs, I found that:

  1. The write operation panicking with a data race was the setState method of
    the BaseApp from the Cosmos-SDK.
  2. The call stack shows that the write happens during the commit phase of the
    blockchain consensus process.
  3. A read operation occurs in the CreateQueryContext method of the BaseApp.
  4. The call stack indicates the data race read is happening during the handling
    of a gRPC query request related to the "x/auth" module.
[Expand to see test]
package gosdk_test

import (
	"encoding/json"

	"github.com/MakeNowJust/heredoc/v2"
	cmtcoretypes "github.com/cometbft/cometbft/rpc/core/types"

	"github.com/NibiruChain/nibiru/gosdk"
)

// TestSequenceExpectations validates the behavior of account sequence numbers
// and transaction finalization in a blockchain network. It ensures that sequence
// numbers increment correctly with each transaction and that transactions can be
// queried successfully after the blocks are completed.
func (s *TestSuite) TestSequenceExpectations() {
	t := s.T()
	t.Log("Get sequence and block")
	// Go to next block
	_, err := s.network.WaitForNextBlockVerbose()
	s.Require().NoError(err)

	accAddr := s.val.Address
	getLatestAccNums := func() gosdk.AccountNumbers {
		accNums, err := s.nibiruSdk.GetAccountNumbers(accAddr.String())
		s.NoError(err)
		return accNums
	}
	seq := getLatestAccNums().Sequence

	t.Logf("starting sequence %v should not change from waiting a block", seq)
	s.NoError(s.network.WaitForNextBlock())
	newSeq := getLatestAccNums().Sequence
	s.EqualValues(seq, newSeq)

	t.Log("broadcast msg n times, expect sequence += n")
	numTxs := uint64(5)
	seqs := []uint64{}
	txResults := make(map[string]*cmtcoretypes.ResultTx)
	for broadcastCount := uint64(0); broadcastCount < numTxs; broadcastCount++ {
		s.NoError(s.network.WaitForNextBlock()) // Ensure block increment

		from, _, _, msgSend := s.msgSendVars()
		txResp, err := s.nibiruSdk.BroadcastMsgsGrpcWithSeq(
			from,
			seq+broadcastCount,
			msgSend,
		)
		s.NoError(err)
		txHashHex := s.AssertTxResponseSuccess(txResp)

		s.T().Log(heredoc.Docf(
			`Query for tx %v should fail b/c it's the same block and finalization 
			cannot have possibly occurred yet.`, broadcastCount))
		txResult, err := s.nibiruSdk.TxByHash(txHashHex)
		jsonBz, _ := json.MarshalIndent(txResp, "", "  ")
		s.Assert().Errorf(err, "txResp: %s", jsonBz)

		txResults[txHashHex] = txResult
		seqs = append(seqs, getLatestAccNums().Sequence)
	}

	s.T().Log("expect sequence += n")
	newNewSeq := getLatestAccNums().Sequence
	txResultsJson, _ := json.MarshalIndent(txResults, "", "  ")
	s.EqualValuesf(int(seq+numTxs-1), int(newNewSeq), "seqs: %v\ntxResults: %s", seqs, txResultsJson)

	s.T().Log("After the blocks are completed, tx queries by hash should work.")
	for times := 0; times < 2; times++ {
		s.NoError(s.network.WaitForNextBlock())
	}

	s.T().Log("Query each tx by hash (successfully)")
	for txHashHex := range txResults {
		_, err := s.nibiruSdk.TxByHash(txHashHex)
		s.NoError(err)
	}
}

Ref: Failed GitHub Action on 2024-05-31

@Unique-Divine
Copy link
Member Author

Unique-Divine commented May 31, 2024

Date 2024-05-31: Some findings on e7e708d

Looking through the commit history, the problem seems to surface after the version changes for

  • Cosmos-SDK v0.47.10 -> v0.47.11
  • CometBFT v0.37.4 -> v0.37.5

Based purely on this, I tried an experiment where I run

just clean-cache && go mod tidy
just test-coverage-integration 2>&1
# and pipe the output to Neovim to read/debug

Test Results

Note that these results were inconclusive, as the race condition issue doesn't seem to be affected by the version increment of the Cosmos-SDK or CometBFT. I re-ran the below process 3 times with fresh cache state and found inconsistent results.

  • ✅ cosmos-sdk v0.47.10, cometbft v0.37.4
  • ❌ cosmos-sdk v0.47.11, cometbft v0.37.5
  • ❌ cosmos-sdk v0.47.10, cometbft v0.37.5
  • ❌ cosmos-sdk v0.47.11, cometbft v0.37.4 (using require)

@Unique-Divine
Copy link
Member Author

This is still happening pretty frequently

image

Example: https://github.com/NibiruChain/nibiru/actions/runs/9425664163/job/25967637912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ⚡ Building 🧱
2 participants