Calico node crashing without error message on Raspberry Pi 4 connected with wireless wlan0 #8819

tkislan · 2024-05-14T11:57:27Z

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
      - blockSize: 26
        cidr: 10.244.0.0/16
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
    nodeAddressAutodetectionV4:
      interface: tun0

I'm using openvpn network to connect edge devices with master node running in the cloud
I have Intel nuc device working as expected, from the same network as the problematic raspberry pi

ip addr output

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether dc:a6:32:9f:c1:27 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether dc:a6:32:9f:c1:28 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.23/24 metric 600 brd 192.168.1.255 scope global dynamic wlan0
       valid_lft 11832sec preferred_lft 11832sec
4: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 500
    link/none
    inet 10.8.0.7/24 scope global tun0
       valid_lft forever preferred_lft forever
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:02:f2:d8:dc brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
9: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 10.244.210.192/32 scope global tunl0
       valid_lft forever preferred_lft forever
50: calib9ebbc1fedc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-d27bd61c-6107-d514-2f04-31a40d632e19
54: cali82b6a9674c3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-b6010d39-9714-044e-86b7-7d308d8f310c

ethernet port is not used, and tun0 interface should be used, configured through autodetection, where wlan0 is the interface that is connected to the internet

there are no logs indicating any kind of error, the calico-node just ends up in Completed state, and is being restarted
and other pods fail dns resolve, probably because kube-proxy pod is crashing as well

but what is very suspicious is, that there are multiple logs in calico-node, with EndpointId=eth0, which doesn't make sense, because it is disabled and not used

logs:
calico-node-describe.txt
calico-node.log
csi-node-driver.log
kube-proxy-describe.txt
kube-proxy.log

Expected Behavior

Current Behavior

Endless CrashLoopBackOff, no pods working on the node

Possible Solution

Steps to Reproduce (for bugs)

Install calico tigera operator
kubeadm join raspberry pi with wlan0 interface

Context

Your Environment

Calico version: quay.io/tigera/operator:v1.32.7 docker.io/calico/cni:v3.27.3
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes bootstraped with kubeadm
Operating System and version: Ubuntu 24.04 LTS
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

tomastigera · 2024-05-14T16:42:09Z

@tkislan could it be related to #8726 ?

tkislan · 2024-05-14T17:05:10Z

Doesn't seem to have helped when I unloaded the kernel module, and restarted the pods

  Warning  Unhealthy       20s (x2 over 21s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy       16s                kubelet            Readiness probe failed: 2024-05-14 16:56:55.526 [INFO][243] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.8.0.13,10.8.0.1,10.8.0.6

at least here it seems it's getting killed because of the healtcheck

# ls -l /var/run/calico
total 0
srw-rw---- 1 root root  0 May 14 19:00 bird.ctl
srw-rw---- 1 root root  0 May 14 19:00 bird6.ctl
drwx------ 2 root root 40 May 14 18:56 cgroup
-rw------- 1 root root  0 May 14 18:56 ipam.lock

but the files exist on the host

# ./calicoctl node checksystem
Checking kernel version...
		6.8.0-1004-raspi    					OK
Checking kernel modules...
		nf_conntrack_netlink					OK
		xt_addrtype         					OK
		xt_icmp             					OK
		ip_set              					OK
		ip6_tables          					OK
		ip_tables           					OK
		ipt_rpfilter        					OK
		xt_mark             					OK
		xt_multiport        					OK
		vfio-pci            					OK
		xt_bpf              					OK
		ipt_REJECT          					OK
		xt_rpfilter         					OK
		ipt_set             					OK
		xt_icmp6            					OK
		ipt_ipvs            					OK
		xt_conntrack        					OK
		xt_set              					OK
		xt_u32              					OK
System meets minimum system requirements to run Calico!

let me know what more information I can provide .. I'm really desperate here .. have been trying to figure this out for the past 3 days

coutinhop · 2024-06-04T16:35:53Z

@tkislan could you please enable debug logging (by setting logSeverityScreen to Debug in the default FelixConfiguration), and see if that gives us more info?

kubectl patch felixconfiguration default --type merge --patch='{"spec":{"logSeverityScreen":"Debug"}}'

caseydavenport · 2024-06-04T16:38:44Z

but what is very suspicious is, that there are multiple logs in calico-node, with EndpointId=eth0, which doesn't make sense, because it is disabled and not used

This is referring to the endpoint name within the container, not the host's eth0, so I think this is OK and a red herring.

Typically, when calico/node just stops without any indication, it's due to kubelet or something external to Calico shutting us down for some reason.

Looking at the logs, it appears like calico/node is report that it is "live", so it is unlikely to be due to the liveness probe.

I think you may want to look at the kubelet or container runtime logs here to see if either of those suggest they are terminating the calico/node pod.

caseydavenport · 2024-06-18T16:22:28Z

Any news on this issue? Did you get a chance to look at the kubelet / runtime logs to see if either is killing Calico?

tomastigera added the kind/support label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico node crashing without error message on Raspberry Pi 4 connected with wireless wlan0 #8819

Calico node crashing without error message on Raspberry Pi 4 connected with wireless wlan0 #8819

tkislan commented May 14, 2024

tomastigera commented May 14, 2024

tkislan commented May 14, 2024

coutinhop commented Jun 4, 2024 •

edited

Loading

caseydavenport commented Jun 4, 2024

caseydavenport commented Jun 18, 2024

Calico node crashing without error message on Raspberry Pi 4 connected with wireless wlan0 #8819

Calico node crashing without error message on Raspberry Pi 4 connected with wireless wlan0 #8819

Comments

tkislan commented May 14, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

tomastigera commented May 14, 2024

tkislan commented May 14, 2024

coutinhop commented Jun 4, 2024 • edited Loading

caseydavenport commented Jun 4, 2024

caseydavenport commented Jun 18, 2024

coutinhop commented Jun 4, 2024 •

edited

Loading