Debugging K8s nginx ingress webhook timeouts
What's in This Post
failed calling webhook (wait what?)
If you've found this article because you're 10 pages into your google search trying to get an answer to why you're seeing:
1Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io"
you can benefit from my suffering.
It's your container network
There's a few reasons why you can't reach the webhook and almost all of them relate to pods not being able to reach other pods or services. If you are getting a server error or invalid data from the webhook, this post isn't for you.
You can stop reading now if you that's all you wanted to know. Otherwise read on for a quick way to validate that your CNI is broken.
Check pod to pod and pod to service networking
Add a pod via which you can test your CNI:
1# dnsutils.yml
2apiVersion: v1
3kind: Pod
4metadata:
5 name: dnsutils
6 namespace: default
7spec:
8 containers:
9 - name: dnsutils
10 image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
11 command:
12 - sleep
13 - "3600"
14 imagePullPolicy: IfNotPresent
15 restartPolicy: Always
1k8s-master:~$ kubectl apply -f dnsutils.yml
Find a pod on another node to the dnsutils pod.
1k8s-master:~$ kubectl get pods -o wide
2NAME READY STATUS RESTARTS AGE IP NODE
3dnsutils 1/1 Running 0 42h 10.244.2.4 k8s-w-2
4hello-pod 1/1 Running 0 42h 10.244.1.4 k8s-w-1
5hostnames-85c858bb76-65vkw 1/1 Running 0 41h 10.244.3.4 k8s-w-3
6hostnames-85c858bb76-78fzn 1/1 Running 0 41h 10.244.1.5 k8s-w-1
7hostnames-85c858bb76-lcvbn 1/1 Running 0 41h 10.244.3.3 k8s-w-3
I'll ping hello-pod
1k8s-master:~$ kubectl exec dnsutils -- ping 10.244.1.4
2PING 10.244.1.4 (10.244.1.4): 56 data bytes
364 bytes from 10.244.1.4: seq=0 ttl=63 time=1.624 ms
If that didn't work then you can skip the next test, your CNI is broken and you need to find out why.
Next, let's check that you can reach a service from a pod. I have a service which targets my hello-pod
.
1k8s-master:~$ kubectl get svc
2NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
3test-hello-svc ClusterIP 10.108.88.156 <none> 80/TCP 41h
A wget
to that service will confirm if my pods can reach it.
1# can we reach the IP?
2k8s-master:~$ kubectl exec dnsutils -- wget -qO- 10.108.88.156
3<html><body><p>Hello there!</p></body></html>
4# can we look up the svc via coredns?
5k8s-master:~$ kubectl exec dnsutils -- wget -qO- test-hello-svc
6<html><body><p>Hello there!</p></body></html>
My container network is broken, now what?
Troubleshooting your container network will depend on which CNI you use, so I wont attempt to document all possible options. If you're committed to identifying the issue and don't wish to just redeploy your kubernetes cluster, then tcpdump
is your friend. Some tips in your analysis journey:
- You'll be doing your troubleshooting on worker nodes. The less containers they host, the easier this will be.
- Depending on your CNI, packets pass through several virtual interfaces on their way out of your node. Make sure you have a rough idea of the topology otherwise you wont make sense of all the interfaces you'll see on the node.
- When inspecting traffic with
tcpdump
consider the following- GENEVE based CNI plugins like ovn-kubernetes use udp port 6081
1k8s-w-1:~$ sudo tcpdump -ni brens18 "udp port 6081"
- VXLAN based CNI plugins may use the older standard port of udp port 8472 (flannel uses this) or the newer port of udp port 4789. So check for both if you aren't sure.
1k8s-w-1:~$ sudo tcpdump -ni ens18 "udp port 8072 || udp port 4789"
- IPTABLES rules added by kubernetes, your container runtime and your CNI should be reviewed. If you have an otherwise quiet node, you can run a ping from a container and watch the counters increment. This might help you identify a broken rule.
1# '-Z' clears counters
2# '-v' shows counters
3# '-n' avoids reverse DNS for rules
4# '-L' with no a CHAIN label will list all rules.
5k8s-w-1:~$ watch sudo iptables -L -n -v -Z
Leave a comment if you have any other tips for quick analysis of a broken k8s cluster.