Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

found and fixed a bug in raft integration tests #4596

Merged
merged 1 commit into from
Jan 31, 2024

Conversation

pfi79
Copy link
Contributor

@pfi79 pfi79 commented Jan 8, 2024

Often fails this test

Failure occurs when waiting for the orderer to delete a channel

Here is an explanation: why they added the removal of the channel with orderer

I'll describe what's going on:

  • the orderer is in a constant cycle periodically performing the function
  • send a command to remove channel. It passes the following functions serveRemove, RemoveChannel (r.lock monopole is taken in this function), Halt (the actual removal of the channel is expected here)
  • let's return to the main loop, this is where the actual deletion should take place, it wants to execute the function SwitchFollowerToChain, but she can't, she needs a r.lock monopole.

==================================

The error in the raft tests is due to the struggle for monopoly control of r.lock in the SwitchChainToFollower and RemoveChannel functions.

RemoveChannel grabs monopoly control and waits for the channel to terminate. During termination, SwitchChainToFollower tries to grab monopoly control and fails.

Here is my variant of solving the error. If someone suggests a more correct and reddish way, it would be great.

@pfi79 pfi79 force-pushed the fix-raft-error branch 9 times, most recently from e18f9b6 to c9b660d Compare January 8, 2024 21:00
@pfi79 pfi79 marked this pull request as ready for review January 8, 2024 21:01
@pfi79 pfi79 requested a review from a team as a code owner January 8, 2024 21:01
@pfi79 pfi79 force-pushed the fix-raft-error branch 4 times, most recently from dc4ebca to e90788b Compare January 10, 2024 05:34
@yacovm
Copy link
Contributor

yacovm commented Jan 12, 2024

can you perhaps write a test that reproduces this problem?

@yacovm
Copy link
Contributor

yacovm commented Jan 12, 2024

@tock-ibm do you want to take a look at this?

@pfi79
Copy link
Contributor Author

pfi79 commented Jan 12, 2024

can you perhaps write a test that reproduces this problem?

I was wondering how to do it in a manageable way. I couldn't do it with integration tests. It's a complete asynchronous interaction.
Ideally, I need to suspend the peer in a certain place, send a command to remove the channel and continue the peer.

Here is a test test often crashes. For the last 2-3 months all errors in raft integration tests are the same error.

For example:

https://github.com/hyperledger/fabric/actions/runs/7493668256/job/20399967195
https://github.com/hyperledger/fabric/actions/runs/7483402876/job/20368656381
https://github.com/hyperledger/fabric/actions/runs/7462441029/job/20304890927

@pfi79
Copy link
Contributor Author

pfi79 commented Jan 12, 2024

My solution works, but it's not pretty, I don't like it. But it fixes the error.

@pfi79
Copy link
Contributor Author

pfi79 commented Jan 12, 2024

Here is an example of logs with normal operation

[e][OrdererOrg.orderer1] 2024-01-11 19:50:21.851 UTC 00ac INFO [orderer.common.follower] pullAfterJoin -> Pulling after join channel=testchannel
[e][OrdererOrg.orderer1] 2024-01-11 19:50:21.855 UTC 00ad INFO [orderer.common.follower] pullAfterJoin -> Pulled after join channel=testchannel
[e][OrdererOrg.orderer1] 2024-01-11 19:50:21.855 UTC 00ae INFO [orderer.common.follower] pull -> Block pulling finished successfully, going to switch from follower to a consensus.Chain channel=testchannel
[e][OrdererOrg.orderer1] 2024-01-11 19:50:21.855 UTC 00af INFO [orderer.common.follower] halt -> Stopped channel=testchannel

Here are the logs at the time of the error. After that orderer1 does nothing else. Channel removal is in progress (halt -> Stopped)

[e][OrdererOrg.orderer1] 2024-01-11 19:50:22.351 UTC 00ca INFO [orderer.common.follower] pullAfterJoin -> Pulling after join channel=testchannel
STEP: Removing channel from the first orderer 01/11/24 19:50:22.352
[e][OrdererOrg.orderer1] 2024-01-11 19:50:22.354 UTC 00cb INFO [orderer.common.follower] halt -> Stopped channel=testchannel
[e][OrdererOrg.orderer1] 2024-01-11 19:50:22.356 UTC 00cc INFO [orderer.common.follower] pullAfterJoin -> Pulled after join channel=testchannel
[e][OrdererOrg.orderer1] 2024-01-11 19:50:22.356 UTC 00cd INFO [orderer.common.follower] pull -> Block pulling finished successfully, going to switch from follower to a consensus.Chain channel=testchannel

@pfi79 pfi79 force-pushed the fix-raft-error branch 3 times, most recently from 24ddb80 to 88e1110 Compare January 14, 2024 21:26
@pfi79 pfi79 changed the title find and fix error found and fixed a bug in raft integration tests Jan 15, 2024
orderer/common/follower/follower_chain.go Outdated Show resolved Hide resolved
orderer/common/follower/follower_chain.go Outdated Show resolved Hide resolved
@pfi79 pfi79 force-pushed the fix-raft-error branch 2 times, most recently from e3e60a5 to 4062018 Compare January 22, 2024 08:18
Signed-off-by: Fedor Partanskiy <pfi79@mail.ru>
@C0rWin C0rWin merged commit 17e0a83 into hyperledger:main Jan 31, 2024
14 checks passed
@pfi79 pfi79 deleted the fix-raft-error branch January 31, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants