Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deployment error #1176

Open
rumart opened this issue Feb 26, 2024 · 31 comments
Open

[BUG] Deployment error #1176

rumart opened this issue Feb 26, 2024 · 31 comments
Labels
bug Something isn't working

Comments

@rumart
Copy link

rumart commented Feb 26, 2024

Describe the bug
The VEBA deployment doesn't finish and throws an error when deploying the RabbitMQ cluster

To Reproduce
Steps to reproduce the behavior:
I've deployed the OVA as described in the docs
Waited for around 20 minutes, but none of the web endpoints work (Connection refused)

Expected behavior
The deployment to finish and the endpoints to work

Screenshots
Screenshot of bootstrap-debug.log

image

Version (please complete the following information):

  • VEBA Form Factor: Appliance
  • VEBA Version: v.0.8.0

Additional context
When troubleshooting I saw that the deployment stopped in what seems to be setup-05-knative.sh script.

I commented out scripts 1 through 4 in setup.sh and reran setup.sh

After a short while the script stopped with this message:
image

Checked the setup-05-knative.sh script and found that the VEBA_BOM_FILE variable was defined after it being used in the file

image

The ytt command on line 44 uses $VEBA_BOM_FILE, but the variable is first defined on line 51.

I moved that line above line 44 and reran setup.sh

Now the deployment could finish and I can access the web endpoints

@rumart rumart added the bug Something isn't working label Feb 26, 2024
Copy link

Howdy 🖐   rumart ! Thank you for your interest in this project. We value your feedback and will respond soon.

@rumart
Copy link
Author

rumart commented Feb 26, 2024

Here's a screenshot of kubectl get pods -A before re-running the setup file

image

@rguske
Copy link
Contributor

rguske commented Feb 26, 2024

Hi @rumart, the VEBA_BOM_FILE variable is already set in setup-04-kubernetes.sh for the first time - HERE.
I can see on your screenshot that the installation didn't finish successfully. The vmware-sources ns is e.g. missing. We've faced this issue before and actually, it should be fixed with #1170.
We have to dig into it.

@rumart
Copy link
Author

rumart commented Feb 26, 2024

Yeah, so when I comment out setup-04 it doesn't pick up on the BOM variable, but nevertheless, since it get's defined in setup-05 could it just be moved up a bit? Or should it be removed altogether?

Thanks for looking into it

@rguske
Copy link
Contributor

rguske commented Feb 26, 2024

I don't think that the issue is caused by not setting the VEBA_BOM_FILE variable. We have the suspicion that it is timing-related. Have you tried deploying it again? To what kind of environment are you deploying VEBA to?

@rumart
Copy link
Author

rumart commented Feb 26, 2024

I agree, the VEBA_BOM_FILE issue is because I've re-run the script without running the setup-04 which sets it the first time. Was more thinking of fixing that setup-05 file separately..

Anyways, I'm running it on a small home lab vSAN cluster. Have tried redeploying a few times, all stopping on the same error message.

I'll try to run it on a different env later tonight to see if that changes anything

@rumart
Copy link
Author

rumart commented Feb 26, 2024

I've tried on a single ESXi host not running anything else, storage on NVME. I've added more CPU and RAM to the appliance. Still errors out on the same step

I ssh'd to the appliance as soon as it was available and tailed the bootstrap-debug.log. The error failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev" happens after just a couple of minutes. As far as I understand there's a 10 minute timeout on most of the commands?

@rguske
Copy link
Contributor

rguske commented Feb 27, 2024

IIRC, the 10 minutes are the default for the kubectl wait command if you don't specify --timeout separately. I really wonder about this issue. I deployed it in my homelab (2-node vSAN cluster) as well and it worked like a charm. Anyway, like I said, William had this issue before as well but reordering the command executions did the trick. When I have time, I'll try to add another wait condition to the script(s)(if necessary!). Thanks @rumart

@lamw
Copy link
Contributor

lamw commented Feb 27, 2024

I suspect that the current "wait" conditions are actually passing, unless you login and it looks to be waiting for default 10m as mentioned by Robert. If it truly is a timing, we can always enhance the OVF properties to allow that to be customizable but I'm not sure if thats actually the case and we may need some other wait condition. If we can debug this further Robert, then we can spin up a custom build to verify for @rumart

@jm66
Copy link

jm66 commented Apr 23, 2024

Just as @rumart, first error I got:

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.109.26.244:443: connect: connection refused

Second try, I increased the timeout value and kept going.

Third try stumbled upon the following:

/root/setup/setup-05-knative.sh: line 44: VEBA_BOM_FILE: unbound variable

Which had to work around to keep the installation going.

@rguske
Copy link
Contributor

rguske commented Jun 12, 2024

@rumart I owe you a deep apology for not getting back to you earlier. Would you be open to troubleshoot your issue further? I've just added another wait condition to the setup-05-knative.sh script and have built a new appliance (test)version. I'd love to follow the deployment in your test-environment. Maybe we could run a Zoom session?
What really helps to get started is the following approach:

  • deploy a new VEBA instance but do not power it on
  • make a snapshot
  • open a terminal window and use a window multiplexer tool like e.g. tmux
  • create two terminal windows
  • power on VEBA and as far it has the IP configured, connect via ssh to it - on both windows!
  • run tail -f /var/log/bootstrap-debug.log on the one window and watch kubectl get pods -A on the other window
  • the made snapshot can be used to reset VEBA every time it is necessary - also to e.g. add a new command to one of the scripts (but you need to be very fast when adding a new command 😉 )

From there you can perfectly follow the progress.

image

The new build can be downloaded for testing purposes here: DOWNLOAD

@rumart
Copy link
Author

rumart commented Jun 13, 2024

Thanks @rguske. I've been busy with other things so haven't had the time myself.
I'm very interested in troubleshooting further and get this up and running.

@rguske
Copy link
Contributor

rguske commented Jun 13, 2024

Thanks @rguske. I've been busy with other things so haven't had the time myself. I'm very interested in troubleshooting further and get this up and running.

Sure, just let me know when you have the time and ping me on Discord or Slack (CNCF Workspace). Looking forward finding the rc.

@rumart
Copy link
Author

rumart commented Jun 13, 2024 via email

@rguske
Copy link
Contributor

rguske commented Jun 13, 2024

Seems I cannot download the testversion..

On 13 Jun 2024, at 08:58, Robert Guske @.***> wrote: Thanks @rguske https://github.com/rguske. I've been busy with other things so haven't had the time myself. I'm very interested in troubleshooting further and get this up and running. Sure, just let me know when you have the time and ping me on Discord or Slack (CNCF Workspace). Looking forward finding the rc. — Reply to this email directly, view it on GitHub <#1176 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIR6R7QM6CG4N3SMZCO7HLZHE7J5AVCNFSM6AAAAABJF4UKS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUG4YTCMRQHA. You are receiving this because you were mentioned.

I've authorized you now 👍🏻

@benwa
Copy link

benwa commented Jun 14, 2024

Just to add in, yesterday, we were on vCenter 7.0.3 and I was able to deploy. Today, after an update to vCenter 8.0.2, I get the same error as @rumart.

rabbitmqcluster.rabbitmq.com/veba-rabbit created
Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.98.98.40:443: connect: connection refused

@rguske
Copy link
Contributor

rguske commented Jun 17, 2024

Thanks a lot for your input @benwa. I don't think this issue is related to the vSphere version, since the first "real" interaction with the vCenter Server is at line 22 in script 06. when the VSphereSourcegets created. It really seems to be a timing issue. I still try to find out which component probably needs a dedicated wait condition.

@benwa
Copy link

benwa commented Jun 25, 2024

Welp, I redownloaded the ova from the Flings site and ran a checksum. It was different. Redeployed and I'm all good now.

@lamw lamw closed this as completed Jun 26, 2024
@rumart
Copy link
Author

rumart commented Jun 26, 2024 via email

@rguske rguske reopened this Jun 28, 2024
@rguske
Copy link
Contributor

rguske commented Jun 28, 2024

Issue still exists.

@rguske
Copy link
Contributor

rguske commented Jun 28, 2024

@rumart I've now added a sleep 30 to setup-05-knative.sh. I haven't found the problematic part yet. Could you give this version a try? DOWNLOAD.
Screenshot 2024-06-28 at 21 26 42
Thy

@rumart
Copy link
Author

rumart commented Jun 29, 2024

Now I'm able to deploy successfully. Tested several times without issues

@rguske
Copy link
Contributor

rguske commented Jun 30, 2024

Now I'm able to deploy successfully. Tested several times without issues

Interesting! Thanks lot for verifying Rudi. However, I will try to narrow it down. There must be different way.
We'd really appreciate if you'd be open to test further builds. Thy :)

@royiversen78
Copy link

First time VEBA user eager to get this working, but I'm also experencing this issue
VEBA 0.8.0
vCenter 8.0.3

/var/log/bootstrap-debug.log

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.105.248.31:443: connect: connection refused

@rguske
Copy link
Contributor

rguske commented Jul 19, 2024

First time VEBA user eager to get this working, but I'm also experencing this issue

VEBA 0.8.0

vCenter 8.0.3

/var/log/bootstrap-debug.log

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.105.248.31:443: connect: connection refused

Thanks for reporting it. Could you please try the version provided in this comment HERE? Thy

@royiversen78
Copy link

Thanks for reporting it. Could you please try the version provided in this comment HERE? Thy

That link doesn't work anymore. Google Drive says:

Sorry, the file you have requested does not exist.

Make sure that you have the correct URL and the file exists.

@rguske
Copy link
Contributor

rguske commented Jul 29, 2024

I will provide a new link in a bit. I was on vacation and back on the issue now. The issue looks similar to what is described here: https://cert-manager.io/docs/troubleshooting/webhook/

So, it looks to me that the Kubernetes API server is trying to call the rabbitmq-broker-webhook when we are installing the RabbitMQ cluster via kubectl apply -f ${RABBITMQ_CONFIG}.

Even tough, the following is included in our script which should ensure that everything is in READY state.

kubectl wait --for=condition=available deploy/rabbitmq-broker-webhook --timeout=${KUBECTL_WAIT} -n knative-eventing

@rguske
Copy link
Contributor

rguske commented Jul 30, 2024

@royiversen78 use this LINK temporarily.

@royiversen78
Copy link

@royiversen78 use this LINK temporarily.

I'm getting the same issue with this version

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.108.11.231:443: connect: connection refused

lamw added a commit that referenced this issue Oct 27, 2024
Add a pause to the 05-knative.sh as a workaround for #1176
@rguske
Copy link
Contributor

rguske commented Oct 27, 2024

@rumart @royiversen78 we added a pause to the installation to ensure service dependencies and availabilities.
Changes just got merged. #1268

If you'd like to test its functionality, please DM me (preferred on CNCF Slack) and I will provide you a download link to the OVA.
Thanks

@dahmanator
Copy link

The 15 second sleep fix worked for me. It was a battle to get the OVA rebuilt with the fix, but once the rebuilt OVA was used, the VEBA completed first boot configuration successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants