Container networking on Kubernetes broken after Server 2022 July 2024 / KB5040437 (OS Build 20348.2582) update #516

avin3sh · 2024-07-17T19:06:34Z

Describe the bug
Pod networking breaks after installing the July CU on Windows Server 2022. For eg, ping microsoft.com from within the container returns General failure. The pod is not reachable from the other pods or through a Service.

Uninstalling KB5040437 fixes the issue.

To Reproduce

Setup a Windows worker with Calico VXLAN CNI provider (https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/manual-install/standard)
Install the July cummulative update on the Server 2022 worker
Exec into a running container on the worker
ping or curl any address

Expected behavior

The pod should be able to reach to external network as well should be reachable from other pods

Configuration:

Edition: Windows Server
Base Image being used: Windows Server Core
Container engine: containerd
Container Engine version: 1.6.31

/label Windows on Kubernetes

The text was updated successfully, but these errors were encountered:

avin3sh · 2024-07-20T04:03:42Z

+@jsturtevant who saw this in kubernetes/test-infra#33042

ntrappe-msft · 2024-07-22T23:59:56Z

@grcusanz / @kestratt have either of you seen this Issue popping up for you?

zylxjtu · 2024-07-23T16:39:49Z

I'm having issues with my AKS testing with latest images. it looks to be related

avin3sh · 2024-07-25T12:58:30Z

Folks over Calico re-pointed to this issue suggesting the issue isn't at Calico's end (projectcalico/calico#9019 (comment)).

Currently our workers can't have up to date Security patches because of this. I noticed the ADO label so I am hoping we will have some update soon 🤞

kysu1313 · 2024-07-25T18:52:31Z

We are having this exact same issue with our windows deployments which are using mcr.microsoft.com/dotnet/framework/aspnet:4.8.1-windowsservercore-ltsc2022.

Any update or suggestion on a fix would be greatly appreciated.

ntrappe-msft · 2024-07-31T18:28:06Z

@kysu1313 Do you have the Windows patch KB5040437 installed too?

orest-gulman · 2024-07-31T21:24:51Z

I do have same issue. After uninstaling KB5040437 - network conectivity is established.

avin3sh · 2024-08-01T06:24:44Z

@ntrappe-msft Can you confirm if we can expect a fix in August CUs ? We have been holding off upgrading to July CU but leaving our cluster unpatched for two or more months consequently has security concerns.

ntrappe-msft · 2024-08-01T18:16:31Z

@avin3sh We're getting this assigned to an engineer right now. Once we do that, they can inform everyone of what the timeline looks like.

Nova-Logic · 2024-08-09T08:24:31Z

3 weeks went, and we’re not only have a fix for a bug, that making windows containers unusable, but don’t even have a timeline. That looks very strange

ntrappe-msft · 2024-08-09T18:36:00Z

@Nova-Logic Sorry for the delay, we know this is a big blocker. We've switched it to a new engineer and should have an update to provide next week.

vemsec · 2024-08-12T15:47:53Z

@avin3sh how are you uninstalling KB5040437? I received this error when attempting to uninstall via wusa /uninstall /kb:5040437 /norestart:
"Security Update for Microsoft Windows (KB5040437) is required by your computer and cannot be uninstalled."

I also have the General failure ping errors on a fully updated Windows Server 2019 as well.. Calico seems completely broken for Windows in general right now..

avin3sh · 2024-08-13T20:31:39Z

No mention of this issue in today's patches. I am guessing this was not addressed ?

davidgiga1993 · 2024-08-14T05:24:56Z

How is this still not fixed? We can't update any of our windows nodes as the patch can't even be uninstalled..

avin3sh · 2024-08-14T09:18:47Z

I just tried and can confirm the August patch / KB5041160 / does not fix the issue. The patch contains Important CVEs which leaves our cluster potentially vulnerable if not patched. @ntrappe-msft I appreciate an engineer is already assigned this issue but is it possible for us to get some update on the fix ?

avin3sh · 2024-08-16T11:11:07Z

We are coming to the end of another week, can we please have the update we were promised

We've switched it to a new engineer and should have an update to provide next week.

ntrappe-msft · 2024-08-16T21:26:19Z

Unfortunately, I don't have news to share yet of a fix. We're waiting on a response from the engineer assigned. We'll bump this Issue up in priority.

davidgiga1993 · 2024-08-21T12:35:55Z

Any update? At least a rough estimate / schedule? Currently k8s windows container network is simply broken and not usable. We soon are forced to terminate all our windows nodes as we can't patch them anymore due to this issue.

beedle2017 · 2024-08-22T09:44:06Z

We are a large customer of Windows Containers and are deeply concerned that this issue remains unresolved.

Neither the July nor August security updates even acknowledge this issue under the "Known issues in this update" section.

We are curious what criteria a Containers issue must meet to warrant expedited support and official mention in monthly updates. Does "everything about container networking is broken after July" not meet these criteria?

The support on this problem so far has raised several internal questions about stability of Windows Containers as a platform. The way Microsoft handles this problem will dictate how seriously we would be able to take Windows Containers for any initiatives going forward.

Nova-Logic · 2024-08-22T11:36:57Z

It's really sad, but I believe we should admit this:
1)Since fix still not available it seems Microsoft don't have sufficient resources to support it and to continue it's development
2)Windows containers are not and would not be a production-grade solution. Release of that CU's that broke container networking is the clear evidence that Microsoft just had not tested that CU with windows containers(or not tested it properly just relying on the fact that if container started—all is ok)
3)Those, who relied on it should migrate to powershell dsc/terraform/both due to p2

It's hard to ruin product reputation more than Microsoft did — release the CU that broke container networking and then just ghost the customers, for more than a month. MS even didn't bothered (or it's possible that actually MS still didn't fully aware of the problem) to write about the issues in known problems.

We(I mean community) can try to check if Microsoft cares about this product by spreading that insane story everywhere across dev/devops/tech bloggers and look at MS reaction.

avin3sh · 2024-08-25T10:19:43Z

As we head into another week, do we have any new update ? As we inch closer to next month's patches, the growing uncertainty about the fix means we will have to force the hosts to update anyway and look at some alternative for hosting the workloads - can't leave the Windows workers unpatched for three months in a row.

All of this tedious, extra work can be avoided or at least planned better if there is some transparency on how Windows Containers team is planning to tackle this issue.

If this issue is affecting even the official sig-windows Kubernetes e2e tests, not prioritizing this problem paints a very bad picture of Windows Containers as a product, for both existing and future potential customers.

I tried some experimentation with Docker Swarm with overlay networking but couldn't reproduce this specific scenario, which seems to suggest the issue might be specific to encapsulation mode or ACLs on HNS Endpoints -- but again my guess as is as good as anyone else's and without some insights into the issue from the product team, it is difficult to even think of a workaround.

Nova-Logic · 2024-08-27T12:12:10Z

27 August, still no fix

jwilsonCX · 2024-08-27T16:20:59Z

I apologize for my ignorance, but I'd really appreciate if someone here in the community can clarify the nature and scope of this issue for me.

My understanding from the thread above is that Microsoft's July update for Windows Server 2022 has somehow borked networking for Windows pods/containers deployed to Kubernetes nodes running that version of Windows Server. However, do we know the extent to which the various local/cloud flavours of Kubernetes environment(s) might affected? For example, has anyone observed this same behaviour when using the latest versions of the Amazon "Kubernetes optimized AMIs" in EKS, or similar counterparts in AKS?

As for what might be causing the issue, I wonder if there is a potential for some underlying dependency issue with the [versions of the] tools used to build the Windows container images themselves? For example, the version/patching of the Windows base image that the container is built from?

Regardless, the apparent lack of any cogent response from Microsoft is it's definitely... disquieting.

jwilsonCX · 2024-08-28T18:44:46Z

Hi @grcusanz, are you in a position to better describe the exact nature and scope of the problem as you understand it at this time? For example, is it limited to HNS implementations as some have posited above, or is CNI impacted too?

JamesKehr · 2024-08-30T15:57:21Z

Hi everyone, please follow these steps and comment to let me know if it resolves the issue with the July or August update installed.

Open regedit (Registry Editor).
Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State
Add or update the following value to the State key:

Name : FwPerfImprovementChange
Type : DWORD
Value : 0

[required] Reboot or restart the HNS (Host Networking Service) service.
Test

CAUTION! Network connectivity will be lost to all containers on the node during an HNS restart! Container networking should automatically recover. Please report back if you have a different experience.

Nova-Logic · 2024-08-30T16:27:49Z

@JamesKehr at this moment looks like it helped, would continue testing on this weekend and post follow-up on Monday

JamesKehr · 2024-08-30T17:01:59Z

@JamesKehr at this moment looks like it helped, would continue testing on this weekend and post follow-up on Monday

Thank you for the confirmation, @Nova-Logic! Please let me know if the status changes.

grcusanz · 2024-08-30T23:03:57Z

Thanks James for identifying and sharing the workaround! The initial fix that caused this was implemented to resolve a customer issue with Calico network policy at scale. It shipped in April, disabled by default in Windows, but was enabled by default for AKS nodes. There were no issues that we were aware of with this fix in AKS. Following our standard process, this then became enabled in Windows by default in July. James's workaround is the first step, we're now investigating the root cause of why this fix broke networking in July and will report back here when we have more info, and again when we have a permanent fix available.

avin3sh · 2024-09-02T10:28:16Z

Thanks for sharing the background @grcusanz.

There were no issues that we were aware of with this fix in AKS. Following our standard process, this then became enabled in Windows by default in July

This seem to suggest there are missing gaps somewhere in the test/release process. Given the scale of effect a simple change like this had, would the team be open to cover all various common configurations mentioned over this issue, since these seem to be popular with Windows Containers customers aside the standard AKS setup with Azure CNI - it looks like networking tests covering Calico VXLAN/overlay may have helped identify this problem early on and prevented the change going into monthly patches.

doctorpangloss · 2024-09-08T18:21:01Z

Hi everyone, please follow these steps and comment to let me know if it resolves the issue with the July or August update installed.
1. Open regedit (Registry Editor).

2. Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State

3. Add or update the following value to the State key:
Name : FwPerfImprovementChange Type : DWORD Value : 0
4. Reboot [required].

5. Test

Can I do this before updating, or will this be overwritten by the update?

wech71 · 2024-09-09T06:58:19Z

Can I do this before updating, or will this be overwritten by the update?

As the key does not exist after updating, I strongly guess it will not be overwritten. So yes, I guess.

But to be sure, just check after updating if it still is 0 🤷

JamesKehr · 2024-09-09T15:19:07Z

@doctorpangloss you can safely add the registry value prior to updating. The default value applies only when the registry value is not present. A present reg value will always take precedence over the default value.

@wech71 Spot on!

JamesKehr · 2024-09-09T15:25:37Z

I updated the steps to include a no reboot option. The registry value is read during the start of the HNS service. Restarting the HNS service will cause the reg value change to be read and container networking will be rebuilt.

CAUTION! Network connectivity will be lost to all containers on the node during the HNS restart! Container networking should automatically recover. Please report back if you have a different experience.

aaabdallah · 2024-09-14T09:04:17Z

Thank you @JamesKehr for the workaround.
If a Powershell equivalent is helpful to others, here it is (note that the first command may fail if the key already exists, but with no harm). Of course, the third command will forcibly reboot the computer.

New-Item -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\hns\State'

New-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\hns\State' -Name 'FwPerfImprovementChange' -Value 0 -PropertyType DWord

Restart-Computer -Force

JamesKehr · 2024-09-17T17:00:00Z

@aaabdallah Thank you for confirmation and the PowerShell commands!

tomasvaclavik · 2024-09-23T08:21:47Z

Hello,
I've apllied your registry fix and can cofirm it worked until worker nodes crashed (due technical problems, not kubernetes related) and all started pods had network unaccesible again, had to drain node and reboot to fix.
I can confirm, that after dirty reboot of worker node (without drain first), network gets broken on pods running on that node. (unless they are terminated and started on another node by cluster). Drain + reboot + uncordon fixes it.
Registry stays set.

vincent-guillemette-fr · 2024-09-25T09:59:36Z

Hello,
Affected too, the workaround helped, looking forward to a definitive fix.

avin3sh · 2024-10-24T07:38:15Z

October CU confirms that this is fixed:

[Containers (known issue)] Fixed: Container networking on Kubernetes might not work as you expect. Containers fail to reach external networks or communicate between pods. It might affect you when you use Calico to set up container networking on development or production instances. If affected, containers will not connect to the internet. The host’s firewall also blocks network traffic. When you ping external addresses, like ‘microsoft.com,’ you might get a general failure error message.

Is it safe to not proactively apply the workaround when adding new worker nodes or rebuilding existing ones ?

ntrappe-msft · 2024-10-24T20:58:44Z

Is it safe to not proactively apply the workaround when adding new worker nodes or rebuilding existing ones ?

Hi, thanks for asking a follow-up question. We're currently waiting on a response from the responsible team.

Use master hashrel build for Win FVs. Use k8s and kind versions from metadata.mk in Win FVs. Extract latest KUBE_VERSION from az images to use in capz cluster (as they might not exactly match the versions from metadata.mk). Bump capz versions. Add node IP bootstrapping on k8s v1.29+ (as kubelet no longer sets node IPs on external cloud-providers). Change generated ssh/scp helpers to use full node IPs. Enable felix debug logging and collect pod logs at the end of tests. Add more logging on powershell commands in windows policy_test.go Add workaround for microsoft/Windows-Containers#516 to CAPZ Win FVs. Disable Felix CAPZ Windows FVs temporarily.

microsoft-github-policy-service · 2024-11-25T15:15:40Z

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

avin3sh added bug Something isn't working triage New and needs attention labels Jul 17, 2024

ntrappe-msft added Windows on Kubernetes Windows Containers using Kubernetes 🔖 ADO Has corresponding ADO item labels Jul 23, 2024

This was referenced Jul 24, 2024

Windows CNI broken after latest EKS image update projectcalico/calico#9043

Closed

Container networking broken after Windows Server 2022 July 2024 / KB5040437 (OS Build 20348.2582) update projectcalico/calico#9019

Closed

ntrappe-msft removed the triage New and needs attention label Aug 1, 2024

jsturtevant linked a pull request Aug 30, 2024 that will close this issue

Apply registry key to fix WS2022 networking kubernetes-sigs/windows-testing#466

Open

adrianm-msft added the Networking Connectivity and network infrastructure label Sep 4, 2024

haodeon mentioned this issue Sep 12, 2024

[Question] Windows node networking Azure/AKS-Edge#177

Open

Ponmuthu-hub mentioned this issue Sep 13, 2024

Liveness Probe on CSI SMB pod was restarting multiple times on Windows Server 2022 Node kubernetes-csi/csi-driver-smb#846

Closed

This was referenced Sep 24, 2024

Windows Pods IPs Unreachable projectcalico/calico#9272

Closed

Windows node traffic interruptions projectcalico/calico#9219

Closed

ntrappe-msft removed the P0 Needs attention ASAP label Oct 2, 2024

Breee mentioned this issue Oct 14, 2024

Networking issues with calico on windows nodes - no internet connectivity kubernetes-sigs/sig-windows-tools#378

Open

coutinhop added a commit to coutinhop/calico that referenced this issue Nov 8, 2024

Add workaround for microsoft/Windows-Containers#516

526917e

coutinhop mentioned this issue Nov 8, 2024

Disable CAPZ Win FV tests temporarily projectcalico/calico#9421

Merged

3 tasks

Container networking on Kubernetes broken after Server 2022 July 2024 / KB5040437 (OS Build 20348.2582) update #516

Container networking on Kubernetes broken after Server 2022 July 2024 / KB5040437 (OS Build 20348.2582) update #516

Comments

avin3sh commented Jul 17, 2024 • edited Loading

avin3sh commented Jul 20, 2024 • edited Loading

ntrappe-msft commented Jul 22, 2024

zylxjtu commented Jul 23, 2024

avin3sh commented Jul 25, 2024 • edited Loading

kysu1313 commented Jul 25, 2024

ntrappe-msft commented Jul 31, 2024

orest-gulman commented Jul 31, 2024 • edited Loading

avin3sh commented Aug 1, 2024

ntrappe-msft commented Aug 1, 2024

Nova-Logic commented Aug 9, 2024

ntrappe-msft commented Aug 9, 2024

vemsec commented Aug 12, 2024

avin3sh commented Aug 13, 2024

davidgiga1993 commented Aug 14, 2024

avin3sh commented Aug 14, 2024 • edited Loading

avin3sh commented Aug 16, 2024

ntrappe-msft commented Aug 16, 2024

davidgiga1993 commented Aug 21, 2024

beedle2017 commented Aug 22, 2024

Nova-Logic commented Aug 22, 2024

avin3sh commented Aug 25, 2024

Nova-Logic commented Aug 27, 2024

jwilsonCX commented Aug 27, 2024 • edited Loading

jwilsonCX commented Aug 28, 2024

JamesKehr commented Aug 30, 2024 • edited Loading

Nova-Logic commented Aug 30, 2024

JamesKehr commented Aug 30, 2024

grcusanz commented Aug 30, 2024

avin3sh commented Sep 2, 2024

doctorpangloss commented Sep 8, 2024

wech71 commented Sep 9, 2024

JamesKehr commented Sep 9, 2024

JamesKehr commented Sep 9, 2024

aaabdallah commented Sep 14, 2024

JamesKehr commented Sep 17, 2024

tomasvaclavik commented Sep 23, 2024 • edited Loading

vincent-guillemette-fr commented Sep 25, 2024

avin3sh commented Oct 24, 2024

ntrappe-msft commented Oct 24, 2024

microsoft-github-policy-service bot commented Nov 25, 2024

avin3sh commented Jul 17, 2024 •

edited

Loading

avin3sh commented Jul 20, 2024 •

edited

Loading

avin3sh commented Jul 25, 2024 •

edited

Loading

orest-gulman commented Jul 31, 2024 •

edited

Loading

avin3sh commented Aug 14, 2024 •

edited

Loading

jwilsonCX commented Aug 27, 2024 •

edited

Loading

JamesKehr commented Aug 30, 2024 •

edited

Loading

tomasvaclavik commented Sep 23, 2024 •

edited

Loading