Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker inspect --bootstrap ignores an error and exits successfully when one of two nodes fails #2730

Open
3 tasks done
moleus opened this issue Oct 10, 2024 · 0 comments
Open
3 tasks done
Labels
kind/bug Something isn't working status/triage

Comments

@moleus
Copy link
Contributor

moleus commented Oct 10, 2024

Contributing guidelines

I've found a bug and checked that ...

  • ... the documentation does not mention anything about my problem
  • ... there are no open or closed issues that are related to my problem

Description

In our CI we setup buildx builder with 2 nodes - arm64 and amd64 to build multi-arch docker images. CI runs docker inspect --bootstrap to check that both of them are up and ready. When one node fails (because of a timeout) the inspect command ignores it, prints the summary and completes with exit code 0.

Expected behaviour

docker inspect --bootstrap should exit with non-zero exit-code when one of the nodes fails to start. When command fails it's possible to catch a non-zero status code and retry bootstraping the nodes.

Actual behaviour

When single node fails the driver.Boot returns an error which is stored in the channel errCh. printer.Wait() never returns an error. So err is nil and len(errCh) == 1. So the last return statement is executed with true, nil. For more details see my gist with minimal reproducible example.

buildx/builder/builder.go

Lines 178 to 206 in 65c4756

baseCtx := ctx
eg, _ := errgroup.WithContext(ctx)
errCh := make(chan error, len(toBoot))
for _, idx := range toBoot {
func(idx int) {
eg.Go(func() error {
pw := progress.WithPrefix(printer, b.NodeGroup.Nodes[idx].Name, len(toBoot) > 1)
_, err := driver.Boot(ctx, baseCtx, b.nodes[idx].Driver, pw)
if err != nil {
b.nodes[idx].Err = err
errCh <- err
}
return nil
})
}(idx)
}
err = eg.Wait()
close(errCh)
err1 := printer.Wait()
if err == nil {
err = err1
}
if err == nil && len(errCh) == len(toBoot) {
return false, <-errCh
}
return true, err
}

printer.Wait():

func (p *Printer) Wait() error {
p.closeOnce.Do(func() {
close(p.status)
<-p.done
})
return p.err
}

pw.err is assigned here

pw.warnings, pw.err = d.UpdateFrom(ctx, pw.status)

but UpdateFrom never sets the error, except the ctx.Done case.

func (d Display) UpdateFrom(ctx context.Context, ch chan *client.SolveStatus) ([]client.VertexWarning, error) {

Buildx version

v0.14.0

Docker info

No response

Builders list

Configuration

Here is the link to this gist with code to reproduce bug in the (b *Builder) Boot() function:
https://gist.github.com/moleus/6bf1ad37f47dd86ed12f4c4345601e22

Build logs

+ docker buildx inspect --bootstrap
#1 [arm64-demo-jenkins-job-name internal] booting buildkit
#1 waiting for 1 pods to be ready
#1 ...
#2 [amd64-demo-jenkins-job-name internal] booting buildkit
#2 ...
#1 [arm64-demo-jenkins-job-name internal] booting buildkit
#1 waiting for 1 pods to be ready 93.3s done
#1 DONE 93.4s
#2 [amd64-demo-jenkins-job-name internal] booting buildkit
#2 waiting for 1 pods to be ready 109.5s done
#2 ERROR: expected 1 replicas to be ready, got 0
------
 > [amd64-demo-jenkins-job-name internal] booting buildkit:
------
Name:          back-demo-buildx        <- *comment* this should not be printed. Should exit with 1 
Driver:        kubernetes
Last Activity: 2024-10-10 09:56:43 +0000 UTC
Nodes:
Name:                  amd64-demo-jenkins-job-name
Endpoint:              kubernetes:///back-demo-buildx?deployment=amd64-demo-jenkins-job-name&kubeconfig=
Driver Options:        loadbalance="random" namespace="demo-namespace" ...
Status:                inactive
BuildKit daemon flags: --oci-worker-gc-keepstorage 8000 --allow-insecure-entitlement=network.host
Platforms:             linux/amd64*
Name:                  arm64-demo-jenkins-job-name
Endpoint:              kubernetes:///back-demo-buildx?deployment=arm64-demo-jenkins-job-name&kubeconfig=
Driver Options:        loadbalance="random" namespace="demo-namespace" ...
Status:                inactive
BuildKit daemon flags: --oci-worker-gc-keepstorage 8000 --allow-insecure-entitlement=network.host
Platforms:             linux/arm64*

Additional info

I suggest replacing

	if err == nil && len(errCh) == len(toBoot) {  // errCh contains only one error -> condition fails
		return false, <-errCh
	}
	return true, err  // returns true, nil

with

	if err == nil && len(errCh) > 0 {
		return false, <-errCh
	}
	return true, err  // returns true, nil

so inspect fails when one of the nodes fails to setup

@moleus moleus changed the title docker inspect --bootstrap ignores an error when one of two nodes fails and exits successfully docker inspect --bootstrap ignores an error and exits successfully when one of two nodes fails Oct 10, 2024
@tonistiigi tonistiigi added the kind/bug Something isn't working label Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working status/triage
Projects
None yet
Development

No branches or pull requests

2 participants