-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to attach bpf program to pod interface #8897
Comments
Never seen that libbpf error. Might be that there is no qdisc set up at the device? You could you
It is weird to see policy programs attached 2x. Also the [ie] vs [EI] is strange, might be some upgrade issue 🤔 What version did you upgrade from? |
You can use |
given the lowercase The error message comes from https://github.com/torvalds/linux/blob/92e5605a199efbaee59fb19e15d6cc2103a04ec2/net/sched/cls_api.c#L2316 Are the failing pods those that eisted before upgrade or does it happen to new pods as well? |
We upgrade from calico 3.25.0 (apply new manifest without delete old version)
It's happen also with new pods (statefulset recreate with the same pod name)
I restart statefulset pod (network connection in pod is normal) NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
airflow newfmm-db-airflow-0 2/2 Running 0 6m10s 10.222.47.32 stage1-6495d69d65xsx98l-7s8td
❯ k exec -it newfmm-db-airflow-0 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "postgres" out of: postgres, prometheus-postgres-exporter
# curl -k https://kubernetes.default
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {
},
"code": 403
}
root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
cache
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[195661] id 195661
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[195660] id 195660 After restarting postgresql, we will watch when it crashes again and collect information from |
After 4 days container in pod transitioning to unhealthy state. ❯ k -n airflow get pods -o wide newfmm-db-airflow-0
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
newfmm-db-airflow-0 1/2 Running 0 4d5h 10.222.47.32 stage1-6495d69d65xsx98l-7s8td <none> <none> Some diagnostic root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
cache
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool prog | grep calid2b
# empty :(
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[400338] id 400338
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[400339] id 400339
[root@stage1-6495d69d65xsx98l-7s8td /]# tc -s qdisc show dev calieca2ba1ed43
qdisc noqueue 0: root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc clsact ffff: parent ffff:fff1
Sent 191201052976 bytes 376857785 pkt (dropped 53362, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
#count of dropped packets increase every second Logs from calico-node in debug and dump more logs in attach 2024-06-23T18:22:34.654135139+03:00 stdout F 2024-06-23 15:22:34.654 [DEBUG][58967] felix/bpf_ep_mgr.go 1430: Ignoring interface that doesn't match the host data/l3 interface regex iface="calieca2ba1ed43"
2024-06-23T18:22:34.658959833+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/attach.go 353: Already have a clsact qdisc on this interface iface="calieca2ba1ed43"
2024-06-23T18:22:34.65897197+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 1896: readiness: 0 iface="calieca2ba1ed43"
2024-06-23T18:22:34.659075818+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 2742: ensureProgramAttached 4 hook=ingress iface="calieca2ba1ed43" type="workload"
2024-06-23T18:22:34.659079655+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/map.go 130: needs generic object file from_wep_fib_no_log_co-re.o AttachType=hook.AttachType{Hook:0, Family:4, Type:"workload", LogLevel:"off", FIB:true, ToHostDrop:false, DSR:false}
2024-06-23T18:22:34.659081934+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 1896: readiness: 0 iface="calieca2ba1ed43"
2024-06-23T18:22:34.659166532+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 133: generic object file already loaded from_wep_fib_no_log_co-re.o layout=hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}
2024-06-23T18:22:34.659169858+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/bpf_ep_mgr.go 2742: ensureProgramAttached 4 hook=egress iface="calieca2ba1ed43" type="workload"
2024-06-23T18:22:34.659171724+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 66: UpdateMapEntry(28, [62 0 0 0], [99 0 0 0])
2024-06-23T18:22:34.65917393+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 130: needs generic object file to_wep_no_log_co-re.o AttachType=hook.AttachType{Hook:1, Family:4, Type:"workload", LogLevel:"off", FIB:true, ToHostDrop:false, DSR:false}
2024-06-23T18:22:34.65921784+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0x1c mapInfo=&maps.MapInfo{Type:3, KeySize:4, ValueSize:4, MaxEntries:10000}
2024-06-23T18:22:34.65933278+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 133: generic object file already loaded to_wep_no_log_co-re.o layout=hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}
2024-06-23T18:22:34.659336321+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 66: UpdateMapEntry(28, [63 0 0 0], [99 0 0 0])
2024-06-23T18:22:34.659417834+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0x1c mapInfo=&maps.MapInfo{Type:3, KeySize:4, ValueSize:4, MaxEntries:10000}
2024-06-23T18:22:34.664465985+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 291: Found old calico program prog=tc.attachedProg{pref:"32768", handle:"0x1"}
2024-06-23T18:22:34.664703578+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 291: Found old calico program prog=tc.attachedProg{pref:"32768", handle:"0x1"}
2024-06-23T18:22:34.664710538+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_jump2 k 4 v 4
2024-06-23T18:22:34.66471307+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_progs2 k 4 v 4
2024-06-23T18:22:34.664794065+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_arp2 k 8 v 12
2024-06-23T18:22:34.664797995+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_ct3 k 16 v 88
2024-06-23T18:22:34.664800341+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_fe3 k 16 v 20
2024-06-23T18:22:34.66492409+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_be k 8 v 8
2024-06-23T18:22:34.664934394+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_aff k 16 v 16
2024-06-23T18:22:34.664936359+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_jump2 k 4 v 4
2024-06-23T18:22:34.664938181+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_state4 k 4 v 464
2024-06-23T18:22:34.664979423+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_progs2 k 4 v 4
2024-06-23T18:22:34.664989684+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 452: ConfigureProgram HookLayout=hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}
2024-06-23T18:22:34.665120085+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_arp2 k 8 v 12
2024-06-23T18:22:34.665124979+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_ct3 k 16 v 88
2024-06-23T18:22:34.665140656+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_fe3 k 16 v 20
2024-06-23T18:22:34.665142986+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_be k 8 v 8
2024-06-23T18:22:34.665278386+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_aff k 16 v 16
2024-06-23T18:22:34.665282781+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_state4 k 4 v 464
2024-06-23T18:22:34.665401168+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 452: ConfigureProgram HookLayout=hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}
2024-06-23T18:22:34.666112544+03:00 stdout F libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-23T18:22:34.666293187+03:00 stdout F 2024-06-23 15:22:34.666 [WARNING][58967] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:62, Iface:"calieca2ba1ed43", LogLevel:"off"}, LogFilte
r:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0xb}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-23T18:22:34.667112806+03:00 stdout F libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-23T18:22:34.66730246+03:00 stdout F 2024-06-23 15:22:34.667 [WARNING][58967] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:63, Iface:"calieca2ba1ed43", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0xb}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-23T18:22:34.667438952+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 125: Set bpfMap="cali_iface" k={ifIndex: 3984} v={flags: workload,ready, XDPPolicy: -1, IngressPolicy: 63, EgressPolicy: 62, IngressFilter: -1, EgressFilter: -1, name: calieca2ba1ed43}
2024-06-23T18:22:34.667445162+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 142: Set: Key in dataplane already, ignoring. bpfMap="cali_iface"
2024-06-23T18:22:34.667515079+03:00 stdout F 2024-06-23 15:22:34.667 [WARNING][58967] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="calieca2ba1ed43" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-0" endpoint_id:"eth0"
2024-06-23T18:22:34.667528735+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/caching_map.go 153: Applying deletions to DP map. name="cali_iface"
2024-06-23T18:22:34.667533857+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/caching_map.go 129: Applying updates to DP map. name="cali_iface"
2024-06-23T18:22:34.667626354+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/bpf_ep_mgr.go 1327: CompleteDeferredWork done.
2024-06-23T18:22:34.667631888+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/endpoint_mgr.go 445: Reporting endpoint status. dirtyEndpoints=set.Set{}
2024-06-23T18:22:34.667792509+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 276: Asked to resync with the dataplane on next update. family="inet"
2024-06-23T18:22:34.667796401+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 215: Asked to resync with the dataplane on next update.
2024-06-23T18:22:34.667798229+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/wireguard.go 656: Wireguard is not enabled, skipping sync ipVersion=0x4
2024-06-23T18:22:34.668229581+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 342: Resyncing ipsets with dataplane. family="inet"
2024-06-23T18:22:34.668234483+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state. ipsetFamily="inet"
2024-06-23T18:22:34.668237291+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=0 inDataplaneNotDesired=0 ipsetFamily="inet" totalNumInDP=0
2024-06-23T18:22:34.668239276+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 243: Doing full resync of BPF IP sets map bpf family="inet"
2024-06-23T18:22:34.668246501+03:00 stdout F 2024-06-23 15:22:34.668 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0xe mapInfo=&maps.MapInfo{Type:11, KeySize:20, ValueSize:4, MaxEntries:1048576}
2024-06-23T18:22:34.669262736+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/ipsets.go 406: List of ipsets: [] family="inet"
2024-06-23T18:22:34.669269369+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state.
2024-06-23T18:22:34.669271576+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=1 inDataplaneNotDesired=0 totalNumInDP=0
2024-06-23T18:22:34.669343478+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state.
2024-06-23T18:22:34.669347348+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=12 inDataplaneNotDesired=0 totalNumInDP=0 |
This looks like programs are attached. I will have a look at the logs. |
Sorry, this is my mistake.
root@stage1-6495d69d65xsx98l-tkf2b:/# ip ro get 10.222.59.94
10.222.59.94 dev cali58c64a0c8be src 10.220.1.3 uid 0
cache
d65xsx98l-tkf2b:/# bpftool net| grep cali58c64a0c8be
cali58c64a0c8be(2817) clsact/ingress cali_tc_preambl:[9162962] id 9162962
cali58c64a0c8be(2817) clsact/egress cali_tc_preambl:[9162963] id 9162963
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58
6479986: sched_cls name p4i_cali58c64a0 tag 0dbf0d10b02282f0
6479987: sched_cls name p4e_cali58c64a0 tag b1d839878b133ee8 |
The problem also exists on the controller and node version v3.27.3. |
We're seeing the same problem on k8s 1.29 + kernel 6.5 + calico 3.72.2 If this is at all helpful:
|
We were able to achieve reproducibility of the problem on v3.27.4 and kubernetes version 1.29.
The revision of the pod will be updated.
And the revision of the endpoint
|
@legolego621 what was the annotation that you patched? |
Any annotations. For example I added test=value1....n. |
Editing the pod labels works. |
the problem manifests itself after about a day, when the pod is patched 150 times per minute |
We did this artificially to reproduce the problem.29.07.2024, 13:53, "Shaun Crampton" ***@***.***>:
when the pod is patched 150 times per minute
What causes the pod to be patched 150 times per minute (or is that just your test to repro the issue)? Calico shouldn't do that, just want to check it's something that you're doing outside Calico.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There are a few things that use annotations to track internal state - one of them is patroni for HA postgres. |
Yep, missed your earlier comment, thanks. I think @sridhartigera has found the root cause now and he's working on a fix. Each time we attach a program, the kernel uses a new priority number and eventually it runs out of priorities. |
Context
After upgrading calico to 3.27.2 version in clusters with PostgreSQL managed by Patroni and depoyed by zalando operator, we started to see errors in the pods with DB.
At the same time wee see errors in calico node. There are those errors:
If we check logs before those errors we can see that BPF programs were updated for that inteface every 10 seconds for a long period of time.
The inteface cali58c64a0c8be lacks some bpf programs:
calid2e5dc9 - a healthy interface
cali58c64a0 - a faulty interface
Using calico-bpf we see the following information:
Expected Behavior
Those errors should not happen and the connection should stay stable.
Current Behavior
After the events described in the context pod is not able to establish connect to anything anymore.
Possible Solution
Steps to Reproduce (for bugs)
Your Environment
Felix configuration
The text was updated successfully, but these errors were encountered: