-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inconsistent state between WAL and saved Snapshot #3584
base: release-2.2
Are you sure you want to change the base?
Conversation
@@ -80,7 +80,7 @@ func TestOpenWAL(t *testing.T) { | |||
for i := 0; i < 10; i++ { | |||
store.Store( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this test demonstrates the problem you describe.
Can you reproduce the problem in a unit test?
@@ -395,3 +395,82 @@ func TestApplyOutOfDateSnapshot(t *testing.T) { | |||
assertFileCount(t, 12, 1) | |||
}) | |||
} | |||
|
|||
func TestAbortWhenWritingSnapshot(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the unit test to reproduce the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but I meant a unit test that uses an instance of a Fabric Raft chain, in orderer/consensus/etcdraft/chain.go
, not a unit test that uses pure etcd.io/raft packages.
We need to be sure that a Fabric Raft chain instance can encounter the bespoken problem, so that:
- We know we really have a problem, because it might be that Fabric sidesteps this problem via some mechanism and etcd has this problem.
- If the problem occurs in a later point due to a code change, a unit test will notify us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem can be reproduced with the following steps:
-
Orderer A has not participated in consensus for a while.
-
Other orderers generate new blocks in consensus.
-
Other orderers generate a new snapshot.
-
Orderer A back to normal and receives the new snapshot from other orderers.
-
Orderer A persists the new snapshot but crashes before calling
rs.saveSnap(snapshot)
. -
Orderer A restarts.
@Param-S @jiangyaoguo what is your opinion on this? |
What's the progress for this PR? |
if err := rs.wal.Save(hardstate, entries); err != nil { | ||
return err | ||
} | ||
|
||
if !raft.IsEmptySnap(snapshot) { | ||
if err := rs.saveSnap(snapshot); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we need to swap the order of writing the snapshot entries and snapshot file(needs change in saveSnap) as it is done in etcdserver https://github.com/etcd-io/etcd/blob/6c2f5dc78af6b6970d48cecaac515c58a91efca8/server/storage/storage.go#L66
Able to recreate the issue by terminating the orderer process just after saving walsnap entries and before saving snap file. func (rs *RaftStorage) saveSnap(snap raftpb.Snapshot) error {
terminate the process here
I will continue to go through the PR changes more and update here. |
What's the progress? @Param-S |
I could not spend time on this last week. I will work on this next couple of days & confirm. |
What's the progress for this PR? |
I'm sorry, but we're just too busy to review it. |
@Mergifyio rebase |
Signed-off-by: zghh <1069308575@qq.com>
…it test to reproduce the problem. Signed-off-by: zghh <1069308575@qq.com>
✅ Branch has been successfully rebased |
Trying to get checks re-triggered... let me try to Close and Re-open. |
What's the progress for this PR? @Param-S |
This one has been stale for some time, what is the status of it? A few questions that may help to clarify the severity: Why was this opened against release-2.2 instead of main branch? Is the problem resolved on main branch and release-2.5 already given the upgrade of etcdraft in those branches? What is the overall impact after the problem occurs? The Description says a "panic error occurs after the restart". Will the panic occur after every subsequent restart? Is there any resolution of the problem, or the orderer node must be abandoned and a new orderer node created to replace it? Has this problem been observed in practice? |
Type of change
Description
It is a bug on etcd.
When an orderer has not participated in the consensus for some time and crashes during the process of writing the latest snapshot to the file, a panic error occurs after the restart.
This bug has been fixed in etcd, but not in fabric.
This PR fixes it into the fabric.