Flake e2e: ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated #4392

flavio-fernandes · 2024-05-24T13:53:24Z

Which jobs are flaking?

ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated

[It] the ACL logs are updated accordingly

.../ovn-kubernetes/test/e2e/acl_logging.go:121

Which tests are flaking?

• [FAILED] [21.581 seconds]
ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated [It] the ACL logs are updated accordingly
/home/vagrant/ovn-kubernetes/test/e2e/acl_logging.go:121

  [FAILED] May 24 06:09:25.765: Timed out after 15.000s.
  Expected
      <bool>: false
  to be true
  In [It] at: /home/vagrant/ovn-kubernetes/test/e2e/acl_logging.go:130 @ 05/24/24 06:09:25.765

Since when has it been flaking?

Sorry, I don't know.

Reason for failure (if possible)

The test does this:

ovn-kubernetes/test/e2e/acl_logging.go

Lines 104 to 130 in 49e1aea

    
           BeforeEach(func() { 
        
           	By("poking some more...") 
        
           	clientPod := pods[pokerPodIndex] 
        
           	pokedPod := pods[pokedPodIndex] 
        
           	framework.Logf( 
        
           		"Poke pod %s (on node %s) from pod %s (on node %s)", 
        
           		pokedPod.GetName(), 
        
           		pokedPod.Spec.NodeName, 
        
           		clientPod.GetName(), 
        
           		clientPod.Spec.NodeName) 
        
           	Expect( 
        
           		pokePod(fr, clientPod.GetName(), pokedPod.Status.PodIP)).To(HaveOccurred(), 
        
           		"traffic should be blocked since we only use a deny all traffic policy") 
        
           }) 
        
           It("the ACL logs are updated accordingly", func() { 
        
           	clientPodScheduledPodName := pods[pokerPodIndex].Spec.NodeName 
        
           	composedPolicyNameRegex := fmt.Sprintf("NP:%s:%s", nsName, egressDefaultDenySuffix) 
        
           	Eventually(func() (bool, error) { 
        
           		return assertACLLogs( 
        
           			clientPodScheduledPodName, 
        
           			composedPolicyNameRegex, 
        
           			denyACLVerdict, 
        
           			updatedAllowACLLogSeverity) 
        
           	}, maxPokeRetries*pokeInterval, pokeInterval).Should(BeTrue()) 
        
           })

setNamespaceACLLogSeverity
It pokes the pods to generate the log
it loops waiting for the logs to be seen in ovn-controller

The issue is that there is no delay between the setting the acl log and poking, so in a slow vm it may
take a bit of time until ovn is fully configured with it and that may happen after the poke took place.

A proposed solution would be to make the poking also happen while waiting, so it gets generated as
expected.

Anything else we need to know?

It is a race in the test. I have found the issue and will be making a PR for it shortly. :)

To reproduce, these are the steps I took:

# Bring up cluster using kind.sh or kind-helm.sh

# It may be interesting to open a secondary shell and look at ovn-controller log.
# This particular test creates acl_logging on ovn-worker2

$ docker exec ovn-worker2 tail -F /var/log/openvswitch/ovn-controller.log
 
# on another shell, run this test in a loop. It should get the failure after a few
# loops:

$ cd test/e2e && \
  while : ; do \
  go test -v . -ginkgo.v \
  -ginkgo.focus 'the\sACL\slogs\sare\supdated\saccordingly' \
  -ginkgo.flake-attempts 1 -provider skeleton \
  -kubeconfig ${KUBECONFIG} --num-nodes=2 || break ; \
  echo --- ; done

The text was updated successfully, but these errors were encountered:

Fixes waiting for ACL logging in a test where the namespace's ACL logging level is updated. To reproduce, use these steps: cd test/e2e && \ while : ; do \ go test -v . -ginkgo.v \ -ginkgo.focus 'the\sACL\slogs\sare\supdated\saccordingly' \ -ginkgo.flake-attempts 1 -provider skeleton \ -kubeconfig ${KUBECONFIG} --num-nodes=2 || break ; \ echo --- ; done Fixes: ovn-org#4392 Signed-off-by: Flavio Fernandes <ffernandes@nvidia.com>

flavio-fernandes added the kind/ci-flake Flakes seen in CI label May 24, 2024

flavio-fernandes self-assigned this May 24, 2024

flavio-fernandes added the area/e2e-testing label May 24, 2024

flavio-fernandes mentioned this issue May 24, 2024

e2e flake fix: Fixes race in e2e ACL logging #4393

Merged

girishmg closed this as completed in 0a68530 Jul 1, 2024

girishmg closed this as completed in #4393 Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flake e2e: ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated #4392

Flake e2e: ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated #4392

flavio-fernandes commented May 24, 2024 •

edited

Loading

Flake e2e: ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated #4392

Flake e2e: ACL Logging for NetworkPolicy when the namespace's ACL logging annotation is updated #4392

Comments

flavio-fernandes commented May 24, 2024 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Reason for failure (if possible)

Anything else we need to know?

flavio-fernandes commented May 24, 2024 •

edited

Loading