28 February 2018

Debugging your kubernetes cluster

Whether you are a developer or an operations person, Kubernetes is an integral part of your daily routine if you are working with a microservice architecture. Therefore, it is always preferable to perform all your research and development on a test cluster before implementing it in an enterprise environment.

if you have not experienced setting up your own cluster yet, here is an article that will walk you through the process of launching your own k8s-cluster in different fashions.    

there are possible failures one can encounter in their k8s cluster so here is the basic approach one can follow to debug into failures.

 1. if a node is going into notReady state as below

 $ kubectl get nodes
   NAME          STATUS   ROLES    AGE    VERSION
   controlPlane  Ready    master   192d   v1.12.10+1.0.15.el7
   worker1       Ready    <none>   192d   v1.12.10+1.0.14.el7
   worker2       NotReady <none>   192d   v1.12.10+1.0.14.el7

 # check if any of your cluster component is un-healthy or failed
 $ kubectl get pods -n kube-system
   NAME                                  READY   STATUS    RESTARTS   AGE
 calico-kube-controllers-bc8f7d57-p5vhk  1/1     Running   1          192d
 calico-node-9xtfr                       1/1     NodeLost  1          192d
calico-node-tpjz9 1/1 Running 1 192d calico-node-vh766 1/1 Running 1 192d coredns-bb49df795-9fn9g 1/1 Running 1 192d coredns-bb49df795-qq6cm 1/1 Running 1 192d etcd-bld09758002 1/1 Running 1 192d kube-apiserver-bld09758002 1/1 Running 1 192d kube-controller-manager-bld09758002 1/1 Running 1 192d kube-proxy-57n8h 1/1 NodeLost 1 192d kube-proxy-gvbkh 1/1 Running 1 192d kube-proxy-tzknm 1/1 Running 1 192d kube-scheduler-bld09758002 1/1 Running 1 192d

 # describe node to check what is causing issue
 $ kubectl describe node nodeName
 
 # ssh to node and ensure kubelet/docker services are running
 $ sudo systemctl status kubelet
 $ sudo systemctl status docker

 # troubleshoot services if not-running in depth
 $ sudo journalctl -u kubelet

 # if services are getting failed frequently try to reset daemon
 $ sudo systemctl daemon-reload


 
 Things to remember - In kube-system namespace for every cluster

 kube-apiserver  1pod/Master
 kube-controller 1pod/Master
kube-scheduler 1pod/Master
 etcd-node       1pod/Master

 core-dns - 2 pods/cluster and can be anywhere M/W

 calico/Network 1 pod/node every M/W (enable n/wing b/w pods)

 kube-proxy 1 pod/node every M/W (enable n/wing b/w nodes)

Some common failures


 # If fails with 
   [ERROR CRI]: container runtime is not running:
   rm /etc/containerd/config.toml
   systemctl restart containerd

 # If kubelet service fails at -
   Active: activating (auto-restart) (Result: exit-code) since Wed 2019-06-28 1
   first check "sudo journalctl -xeu kubelet" logs, if it says - 
            failed to load kubelet config file
   check the Drop-in file which kubelet is loading while activating the service

   $ sudo systemctl status kubelet

   if it is not loading - /etc/systemd/system/kubelet.service.d then correct it, 
   or remove anyother file that it is attempting to load & reload the daemon and restart service.

   $ systemctl daemon-reload
   $ systemctl restart kubelet

  also ensure kubelet, kubeadm, kubernetes-cni, kubernetes-cni-plugins & kubectl are of same minor version

 # If fails with 
   lower docker version update docker:
   docker.io (used for older versions 1.10.x)
   docker-engine (is used for before 1.13.x )
   docker-ce ( used for higher version since 17.03)
 $ apt-get install docker-engine 

 # If fails with 
   Unable to connect to the server: net/http: 
   request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
 $ kubectl version // check kubectl client and server version - they shouldn't mismatch

 # if fails with
   [WARNING Service-Kubelet]: kubelet service is not enabled, 
   please run 'systemctl enable kubelet.service'
 $ systemctl enable kubelet.service

 # If fails with
   [ERROR Swap]: running with swap on is not supported. Please disable swap.
 $ swapoff -a
 $ sed -i '/ swap / s/^/#/' /etc/fstab

 # if fails with
   [ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
   use command with flag --ignore-preflight-errors=NumCPU
   this will actually skip the issue. Please note, this is OK to use in Dev/test perhaps not in production. 

 # Run again 
 $ kubeadm init --apiserver-advertise-address=MasterIP \ 
   --pod-network-cidr=192.168.0.0/16 \ 
   --ignore-preflight-errors=NumCPU

 # if getting error
   The connection to the server localhost:8080 was refused - did you specify the right host or port?
   or on running 'kubectl version' it only gives client version but above error on server version
   if there is no context configured in your client you get such error
 $ kubectl config view
   apiVersion: v1
   clusters: []
   contexts: []
   current-context: ""
   kind: Config
   preferences: {}
   users: []

   you might have forgot to run below commands
   
 $ mkdir -p $HOME/.kube
 $ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
 $ sudo chown $(id -u):$(id -g) $HOME/.kube/config

 # Make a note - to keep the k8s-cluster functional, following containers should be up & running all the time
 - kube-controller-manager > kube-scheduler > kube-etcd > kube-apiserver > kube-proxy + n/w containers (flannel/calico)  

 # Some magic commands -
   systemctl stop kubelet
   systemctl stop docker
   iptables --flush
   iptables -tnat --flush
   systemctl start kubelet
   systemctl start docker

   journalctl -xeu kubelet | less

 # Also, ensure following ports are open on masterNode -
   sudo firewall-cmd --permanent --add-port=6443/tcp
   sudo firewall-cmd --permanent --add-port=2379-2380/tcp
   sudo firewall-cmd --permanent --add-port=10250/tcp
   sudo firewall-cmd --permanent --add-port=10251/tcp
   sudo firewall-cmd --permanent --add-port=10252/tcp
   sudo firewall-cmd --permanent --add-port=10255/tcp
   sudo firewall-cmd –reload

   Ensure following ports are open on workerNode -
   sudo firewall-cmd --permanent --add-port=10251/tcp
   sudo firewall-cmd --permanent --add-port=10255/tcp
   firewall-cmd --reload

 # to check if the bridge traffic is flowing through the firewall? should return 1

   cat /proc/sys/net/bridge/bridge-nf-call-iptables
   cat /proc/sys/net/bridge/bridge-nf-call-ip6tables

 # if namespace stuck in terminating state 

   kubectl get ns mynamespace -o json > ns.json

   edit the ns.json file & empty the finalizer [""] and remove the kubernetes from it. even remove " "

   kubectl proxy

   open another terminal and run below command 

   curl -k -H "Content-Type: application/json" -X PUT --data-binary @ns.json http://127.0.0.1:8001/api/v1/namespaces/mynamespace/finalize

   boom namespace is gone by now


if you want to restore your cluster from a previously working state, take a backup of etcd directory which acts as a cluster database that holds all the kluster resource data present under /var/lib/etcd and may look like below 


fix the issue and restore the database and then restart its etcd-container.

Troubleshooting your application deployed on your kluster

Apart from doing kubectl logs you should investigate what is preventing your application to run, you're probably running into one of the following - 
  • The image can't be pulled
  • there's a missing secret of volume
  • no space in the cluster to schedule the workload
  • taints or affinity rules preventing the pod from being scheduled
in such case we have a magic command which one should get tattooed as there is not shortform to run this command that may help to investigate the issue


   kubectl get events --sort-by=.metadata.creationTimestamp -A
 
   kubectl top node



happy troubleshooting...

No comments:

Post a Comment