Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(resharding): Skip old ShardUIds in get_new_flat_storage_head() #12505

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

marcelo-gonzalez
Copy link
Contributor

This function finds the chunk in the last final block by looking for the one with the same shard ID as one of the shards in the current block's epoch. But if there is a skipped chunk in a child shard in the first block after a resharding, then the old chunk in the child's shard_index will be the parent's chunk from the previous epoch. Then this shard ID comparison doesn't work and we get an error. Fix it by comparing the current ShardUID to the ShardUID of the chunk, and returning None if they don't match.

This function finds the chunk in the last final block by looking for
the one with the same shard ID as one of the shards in the current
block's epoch. But if there is a skipped chunk in a child shard in the
first block after a resharding, then the old chunk in the child's
shard_index will be the parent's chunk from the previous epoch.  Then
this shard ID comparison doesn't work and we get an error. Fix it by
comparing the current ShardUID to the ShardUID of the chunk, and
returning None if they don't match.
@marcelo-gonzalez
Copy link
Contributor Author

This can be triggered by cargo test -p integration-tests --features nightly -- --exact --nocapture test_loop::tests::resharding_v3::test_resharding_v3_shard_shuffling if you run it with this diff:

diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index e949fdbb5..b62312fd2 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -791,27 +791,7 @@ impl Chain {
             me,
             &prev_hash,
         )?;
-        let prev_block = self.get_block(&prev_hash)?;
 
-        if prev_block.chunks().len() != epoch_first_block.chunks().len()
-            && !shards_to_state_sync.is_empty()
-        {
-            // Currently, the state sync algorithm assumes that the number of chunks do not change
-            // between the epoch being synced to and the last epoch.
-            // For example, if shard layout changes at the beginning of epoch T, validators
-            // will not be able to sync states at epoch T for epoch T+1
-            // Fortunately, since all validators track all shards for now, this error will not be
-            // triggered in live yet
-            // Instead of propagating the error, we simply log the error here because the error
-            // do not affect processing blocks for this epoch. However, when the next epoch comes,
-            // the validator will not have the states ready so it will halt.
-            error!(
-                "Cannot download states for epoch {:?} because sharding just changed. I'm {:?}",
-                epoch_first_block.header().epoch_id(),
-                me
-            );
-            debug_assert!(false);
-        }
         if shards_to_state_sync.is_empty() {
             Ok(None)
         } else {
diff --git a/integration-tests/src/test_loop/tests/resharding_v3.rs b/integration-tests/src/test_loop/tests/resharding_v3.rs
index 537672fc0..1376f3ca0 100644
--- a/integration-tests/src/test_loop/tests/resharding_v3.rs
+++ b/integration-tests/src/test_loop/tests/resharding_v3.rs
@@ -604,7 +604,6 @@ fn test_resharding_v3_double_sign_resharding_block() {
 
 // TODO(resharding): fix nearcore and un-ignore this test
 #[test]
-#[ignore]
 fn test_resharding_v3_shard_shuffling() {
     let params = TestReshardingParameters::new()
         .shuffle_shard_assignment()

Then if you grep for ERROR in the logs you'll see a lot of this:

ERROR catchup:run_catchup: near_chain::chain: Error processing block during catch up, retrying

The error there isn't printed, but this is what's causing it. Then if you run the same test with the same diff on top of this PR, the test will pass (although it still fails with some scary Transaction overwrites itself: Store Update error if you change the run_until success_condition to return true after the epoch height == 6 :/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant