Kubernetes configuration/administration common issues and questions

kubelet GC deletes very frequently unused docker images/containers

Symptoms :
kubelet logs show messages such as :
kubelet[5105]: I0428 15:55:45.440934 5105 image_gc_manager.go:305] [imageGCManager]: Disk usage on image filesystem is at 86% which is over the high threshold (85%). Trying to free 4572454912 bytes down to the low threshold (80%).
Solution :
Edit /var/lib/kubelet/config.yaml and specify lower conditions to trigger the Kubelet GC.
See here for an example (add internal link).

kubectl keys/certificates location

Using kubectl requires a kubeconfig file or any kube config file that specified user(s), cluster(s) and context(s) information.
An important data of them are the certificates and the key.
Ca, certificate, and key files location :
– /var/lib/kubelet/pki (may be outdated)
– /etc/kubernetes/pki/apiserver-kubelet-client.crt and apiserver-kubelet-client.key
To know the most recent crt/key, check the files used by the kubectl process.

Ca, certificate, and key files in base64 encoded form may be found at two location
– /etc/kubernetes/admin.conf or kubelet.conf
That information should be up-to-date
– ~/.kube/kubeconfig. That information may be outdated if we renewed the kubernetes certificates

kubelet configuration is missing

Symptoms :
kubelet don’t manage to start with the following error message :
janv. 15 14:09:12 david-Virtual-Machine kubelet[15420]: F0115 14:09:12.570333 15420 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error f janv. 15 14:09:12 david-Virtual-Machine systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a janv. 15 14:09:12 david-Virtual-Machine systemd[1]: kubelet.service: Failed with result ‘exit-code’.
Solution :
– copy the /var/lib/kubelet/config.yaml from a node where kubelet works to the current node
– or reinit the cluster : kubectl init

kubernetes certificates have expired

Client certificates generated by kubeadm expire after 1 year.
Symptoms :
kubectl fails to execute command with the error :

Unable to authenticate the request due to an error

Solution : identify the expired certificated and update them
Certificates are stored in /etc/kubernetes/pki/.

– To identify their expiration date :
find /etc/kubernetes/pki/ -name *.crt -exec sh -c "echo {} && (openssl x509 -in {} -noout -text | grep "After") " \;
Or more simply with kubectl command :
kubeadm alpha certs check-expiration

– To renew all certificates :
With quite recent versions (end 2019) of kubernetes :
kubeadm alpha certs renew all
It make certificates to expire in 1 year (as the default during the kubernetes installation).

kubernetes certificates configuration incorrect

Symptoms :
kubectl fails to execute some commands with the error :

error: You must be logged in to the server (Unauthorized)

kubectl may succeed to run some commands as kubectl get pods and may fail to run some others as kubectl logs fooPod.
Causes
– client certificates renews. These will generate new certificates but the kubectl client configuration will still use the old certificate data

Solution : update the kubectl client configuration of the user with the new certificate configuration.
We need to retrieve new values for client-certificate-data and client-key-data yaml keys either in /etc/kubernetes/kubelet.conf or in /etc/kubernetes/admin.conf.
Then we replace stale values by new values for these two keys defined in the client configuration located in ~/.kube/config
No restart is required in theory but sometimes…see below

kubernetes system pods don’t take into consideration certificate renews

Symptoms :
The certificates were renewed after valid « Not Before » and « Not After » date but kubectl goes on producing some errors.
1) kubectl fails to execute some commands with the error :

error: You must be logged in to the server (Unauthorized)

kubectl may succeed to run some commands as kubectl get pods and may fail to run some others as kubectl logs fooPod.
2) the api-server logs shows an infinite loop of error messages as :

authentication.go:104] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid
authentication.go:104] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid
authentication.go:104] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid

And some others system pods such as kube-controller-manager show alike message.

Cause kube-apiserver pod don’t take into consideration certificate renews
Solution : delete the kube-apiserver on the master node to allow kubernetes to recreate a new one or kill directly the docker containers tuple for it

Cause other system pods don’t take into consideration certificate renews
Solution : restart docker service

Cause the system date is incorrect
Solution : set correctly the system date

Master node issue : the network is not setup

Symptoms :
kubectl get/describe nodes indicates that the master is not ready with the following error message :
Ready False KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Solution : apply the deployment of the network pod

Node issue : etcd fails at startup

Symptoms :
– kubectl with any command fail with no error message
– kubelet is running but shows error message at startup related to etcs and kube-api.
– docker ps -a shows that etcd as well as kube-api-server are started and exited with error code in an infinite loop
etcd logs show error messages such as :
kubernetes etcdmain: listen tcp FOO-NODE-IP:2380: bind: cannot assign requested address kube-api-server logs show error messages such as : ….

Solution (by order to apply) :
– check that the etcs yaml deployment for k8s doesn’t use a URL/hostname distinct from the node IP/resolved hostname. If the node has a dynamic allocated IP, the issue is there.
To know if the issue is a mismatch in the host/node, compare the IP returned by ifconfig | grep -5 eth0 and the IPs/hostnames specified in /etc/kubernetes/manifests/etcd.yaml for theses attributes (not exhaustive) :
–advertise-client-urls
–initial-advertise-peer-urls
–listen-client-urls
–listen-peer-urls
About –initial-cluster, it should be the node hostname as value.

– reset the node configuration if not working.

Master node issue : kube conf certificate part in $HOME/.kube/config not updated/correct

Symptoms :
As a master node, I started the cluster without error but when I query the cluster (from the master), I get the following error message :
Unable to connect to the server: x509: certificate signed by unknown authority.
Generally in that case most of get queries with kubectl returns that error : (ex: kubectl get nodes or kubectl get pods).

Solution (by order to apply) : check the certificate correctness

– use the alpha certificate check to check date validity :
sudo kubeadm alpha certs check-expiration

– check the whole certificate (certificate-authority-data) in $HOME/.kube/config :
echo certificate-value | base64 -d | openssl x509 -text -noout

– If all is ok, the problem is probably that certificates are valid in terms of date but don’t match to those stored in the pki folder of /etc/kubernetes.
We could check that by decoding the certificate part stored in the .kube/config file and by comparing that value to which one of /etc/kubernetes/pki/ca.crt :
echo CERTIFICATE_VALUE_IN_HOME_KUB_CONFIG_FILE | base64 -d | diff - /etc/kubernetes/pki/ca.crt If content differ, we should update the .kube/config folder by applying the 4) step above : « Follow instructions provided by the kubeadm init command output ».

Worker node issue : join with an incorrect cluster URL

Symptoms :
As a node machine, when I try to join a cluster, I get the following error message :
request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).

Solution (by order to apply) : check the provided URL correctness and connectivity

If you have access to the master node :
– retrieve the master URL with : kubectl cluster-info and compare that to the URL provided in the join subcommand.

On the worker node :

– ping the API Server IP address

– curl the API Server URL to check that the server responds. We are using an unsecure connection (it doesn’t matter) and if that is ok a 403 should be returned :
curl -v --insecure https://API_SERVER_IP:6443

Worker node issue : incorrect date in the worker OS

Symptoms :
As a node machine I try to join a cluster, while I checked that the API server certificate is valid, I get the following error message during the connection to the API server :
certificate has expired or is not yet valid.

Solution (by order to apply) : check the system date on the candidate worker OS :
timedatectl

If the date time is not correct apply, that on the candidate worker OS:
– Set the system clock datetime to the current datetime : ntpd -g -q

– Set the hardware clock from the system clock:# hwclock -wu
– Restart the OS

No kube client context configured on the machine

Symptoms :
kubectl anyCommand prints the error message: « The connection to the server localhost:8080 was refused – did you specify the right host or port? »

Solution (by order to apply) :
– Check that client config is empty : kubectl config view
– Copy the master config client to the machine where kubectl fails :

mkdir ~/.kube
vim ~/.kube/config
and copy the config file from mster node to nodes config file

Worker node issue from kubectl (master) : Node offline or IP Tables of the node is incorrectly configured

Symptoms :
Some kubectl commands performed from the master node about resources deployed on a specific node (for example : getting logs) return errors such as
Error: ‘dial tcp 192.168.0.5:3000: connect: no route to host’.
where 192.168.0.5 is the node ip and 3000 a port (bound by kubelet or any k8s apps) and that should be opened on the node.
Solution (by order to apply) :
– Ensure that the host is up and reachable from the master node
Pinging the ip may help.
– If the host is up and reachable from the master node, the problem is probably related to IP Tables of the node that blocks the input connection.
Look for a REJECT with such a pattern :
iptables -L | grep "reject-with icmp-host-prohibited"
If that is found, delete rules about them :

sudo iptables -D  INPUT -j REJECT --reject-with icmp-host-prohibited
sudo iptables -D  FORWARD -j REJECT --reject-with icmp-host-prohibited

To make the rules persistent after reboot we could do store the changes in /etc/sysconfig/iptables and restore them at startup with /sbin/iptables-restore:

## save changes
iptables-save > /etc/sysconfig/iptables
# ip6table is not necessarily required if we didn't change it
ip6tables-save > /etc/sysconfig/ip6tables
 
# load changes : add that command in the machine startup script
cat /etc/sysconfig/iptables |  /sbin/iptables-restore

Or more simple way with the service iptables that handles for us the loading at startup: /sbin/service iptables save