Debugging K8s nginx ingress webhook timeouts

What's in This Post

failed calling webhook (wait what?)

If you've found this article because you're 10 pages into your google search trying to get an answer to why you're seeing:

1Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io" 

you can benefit from my suffering.

It's your container network

There's a few reasons why you can't reach the webhook and almost all of them relate to pods not being able to reach other pods or services. If you are getting a server error or invalid data from the webhook, this post isn't for you.

You can stop reading now if you that's all you wanted to know. Otherwise read on for a quick way to validate that your CNI is broken.

Check pod to pod and pod to service networking

Add a pod via which you can test your CNI:

 1# dnsutils.yml
 2apiVersion: v1
 3kind: Pod
 4metadata:
 5  name: dnsutils
 6  namespace: default
 7spec:
 8  containers:
 9  - name: dnsutils
10    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
11    command:
12      - sleep
13      - "3600"
14    imagePullPolicy: IfNotPresent
15  restartPolicy: Always
1k8s-master:~$ kubectl apply -f dnsutils.yml

Find a pod on another node to the dnsutils pod.

1k8s-master:~$ kubectl get pods -o wide
2NAME                                     READY   STATUS    RESTARTS   AGE   IP            NODE    
3dnsutils                                 1/1     Running   0          42h   10.244.2.4    k8s-w-2
4hello-pod                                1/1     Running   0          42h   10.244.1.4    k8s-w-1
5hostnames-85c858bb76-65vkw               1/1     Running   0          41h   10.244.3.4    k8s-w-3
6hostnames-85c858bb76-78fzn               1/1     Running   0          41h   10.244.1.5    k8s-w-1
7hostnames-85c858bb76-lcvbn               1/1     Running   0          41h   10.244.3.3    k8s-w-3

I'll ping hello-pod

1k8s-master:~$ kubectl exec dnsutils -- ping 10.244.1.4
2PING 10.244.1.4 (10.244.1.4): 56 data bytes
364 bytes from 10.244.1.4: seq=0 ttl=63 time=1.624 ms

If that didn't work then you can skip the next test, your CNI is broken and you need to find out why.

Next, let's check that you can reach a service from a pod. I have a service which targets my hello-pod.

1k8s-master:~$ kubectl get svc
2NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
3test-hello-svc              ClusterIP   10.108.88.156    <none>        80/TCP              41h

A wget to that service will confirm if my pods can reach it.

1# can we reach the IP?
2k8s-master:~$ kubectl exec dnsutils -- wget -qO- 10.108.88.156
3<html><body><p>Hello there!</p></body></html>
4# can we look up the svc via coredns?
5k8s-master:~$ kubectl exec dnsutils -- wget -qO- test-hello-svc
6<html><body><p>Hello there!</p></body></html>

My container network is broken, now what?

Troubleshooting your container network will depend on which CNI you use, so I wont attempt to document all possible options. If you're committed to identifying the issue and don't wish to just redeploy your kubernetes cluster, then tcpdump is your friend. Some tips in your analysis journey:

  • You'll be doing your troubleshooting on worker nodes. The less containers they host, the easier this will be.
  • Depending on your CNI, packets pass through several virtual interfaces on their way out of your node. Make sure you have a rough idea of the topology otherwise you wont make sense of all the interfaces you'll see on the node.
  • When inspecting traffic with tcpdump consider the following
    • GENEVE based CNI plugins like ovn-kubernetes use udp port 6081
    1k8s-w-1:~$ sudo tcpdump -ni brens18 "udp port 6081"
    
    • VXLAN based CNI plugins may use the older standard port of udp port 8472 (flannel uses this) or the newer port of udp port 4789. So check for both if you aren't sure.
    1k8s-w-1:~$ sudo tcpdump -ni ens18 "udp port 8072 || udp port 4789"
    
  • IPTABLES rules added by kubernetes, your container runtime and your CNI should be reviewed. If you have an otherwise quiet node, you can run a ping from a container and watch the counters increment. This might help you identify a broken rule.
1# '-Z' clears counters 
2# '-v' shows counters 
3# '-n' avoids reverse DNS for rules
4# '-L' with no a CHAIN label will list all rules.
5k8s-w-1:~$ watch sudo iptables -L -n -v -Z

Leave a comment if you have any other tips for quick analysis of a broken k8s cluster.