Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod-to-internet-connectivity installation check sometimes failed in CI #6432

Open
tnqn opened this issue Jun 12, 2024 · 4 comments
Open

pod-to-internet-connectivity installation check sometimes failed in CI #6432

tnqn opened this issue Jun 12, 2024 · 4 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@tnqn
Copy link
Member

tnqn commented Jun 12, 2024

Describe the bug

I saw the failure at least twice:
https://github.com/antrea-io/antrea/actions/runs/9461915489/job/26063912880?pr=6427

[kind-kind] -------------------------------------------------------------------------------------------
[kind-kind] Running test: pod-to-internet-connectivity
[kind-kind] -------------------------------------------------------------------------------------------
[kind-kind] Validating connectivity from Pod antrea-test-1m2pj/test-client-84c4bb5558-wswd6 to the world (google.com)...
[kind-kind] /agnhost command '/agnhost connect google.com:80 --timeout=3s' failed: command terminated with exit code 1
[kind-kind] /agnhost stderr: TIMEOUT
[kind-kind] Test pod-to-internet-connectivity failed: Pod antrea-test-1m2pj/test-client-84c4bb5558-wswd6 was not able to connect to google.com: command terminated with exit code 1

agnhost connect only tries to establish TCP connection, not sure why 3s is not enough.

If it keeps failing, maybe we should consider increasing the timeout or switch to another address, like some github service.

@tnqn tnqn added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 12, 2024
@antoninbas
Copy link
Contributor

For what it's worth, I think I saw it failed even back when we had the timeout set to 5s instead of 3s.
We know it's not a DNS issue, as the error message would be different.
It could be some firewall on the Google side. I agree with you, api.github.com may be a good option.

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
@Dyanngg Dyanngg removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2024
@Dyanngg
Copy link
Contributor

Dyanngg commented Sep 17, 2024

Wrote a simple bash script and executed in a local agnhost pod.

#!/bin/bash
timeout_duration=3s
iterations=50

usage() {
    echo "Usage: $0 [--timeout <seconds>] [--iterations <count>]"
    echo "  --timeout <seconds>   Specify the timeout duration in seconds for each command execution"
    echo "  --iterations <count>  Specify the number of times to execute the command (default: 50)"
    exit 1
}

while [[ $# -gt 0 ]]; do
    case $1 in
        --timeout)
            timeout_duration=$2
            shift 2
            ;;
        --iterations)
            iterations=$2
            shift 2
            ;;
        *)
            usage
            ;;
    esac
done

google_failure_count=0
github_failure_count=0

for ((i=1; i<=iterations; i++))
do
    /agnhost connect google.com:80 --timeout=$timeout_duration
    if [ $? -ne 0 ]; then
        ((google_failure_count++))
        echo "google.com t/o at try $i"
    fi

    /agnhost connect api.github.com:80 --timeout=$timeout_duration
    if [ $? -ne 0 ]; then
         ((github_failure_count++))
         echo "api.github.com t/o at try $i"
     fi
done

echo "Number of google.com failures: $google_failure_count"
echo "Number of api.github.com failures: $github_failure_count"

Some outputs:

bash-5.0# ./test.sh --timeout 3s --iterations 25
google.com t/o at try 1
TIMEOUT
Number of google.com failures: 1
Number of api.github.com failures: 0

bash-5.0# ./test.sh --timeout 2s --iterations 25
Number of google.com failures: 0
Number of api.github.com failures: 0

bash-5.0# ./test.sh --timeout 1s --iterations 50
TIMEOUT
google.com t/o at try 1
TIMEOUT
google.com t/o at try 37
Number of google.com failures: 2
Number of api.github.com failures: 0

bash-5.0# ./test.sh --timeout 1s --iterations 200
Number of google.com failures: 0
Number of api.github.com failures: 0

bash-5.0# ./test.sh --timeout 1s --iterations 500
TIMEOUT
api.github.com t/o at try 126
Number of google.com failures: 0
Number of api.github.com failures: 1

bash-5.0# ./test.sh --timeout 2s --iterations 1000
TIMEOUT
google.com t/o at try 1
Number of google.com failures: 1
Number of api.github.com failures: 0

bash-5.0# ./test.sh --timeout 1s --iterations 1000
TIMEOUT
google.com t/o at try 1
TIMEOUT
api.github.com t/o at try 533
TIMEOUT
google.com t/o at try 534
TIMEOUT
api.github.com t/o at try 534
TIMEOUT
api.github.com t/o at try 739
TIMEOUT
google.com t/o at try 923
TIMEOUT
api.github.com t/o at try 923
TIMEOUT
api.github.com t/o at try 961
Number of google.com failures: 3
Number of api.github.com failures: 5

In general api.github.com does seem to be more stable compared to google.com except when the timeout is set to be very small. However what I observed is clearly the 1st attempt to connect to google.com has a really high chance of failure, which imo is most likely DNS related. We could switch to api.github.com for this test, but adding a second try for agnhost connect would also likely to fix this issue

@Dyanngg Dyanngg self-assigned this Sep 17, 2024
@tnqn
Copy link
Member Author

tnqn commented Sep 20, 2024

We could switch to api.github.com for this test, but adding a second try for agnhost connect would also likely to fix this issue

I think we should use the more stable target as the purpose of the check is to detect connectivity issues, adding retries may hide some intermittent issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

No branches or pull requests

3 participants