POD evicted and unable to be launched again!

Briefing

When we did the regular check on our official EOS node on the main-net, we found that one full node was evicted and can't be launched again. Obviously, the node was shut down and not working.

Symptom

Status of the Pod

kubectl get po -owide

kubectl describe pod <pod-name>

We found one funny factor that -

0/2 nodes are available: 1 node(s) had disk pressure, 1 node(s) had taints that the pod didn't tolerate.

Note: EOS9CAT currently has 1 x master node (pod not tolerated) + 1 x worker node.

Reason

The kubelet needs to preserve node stability when available compute resources are low. This is especially important when dealing with incompressible compute resources, such as memory or disk space. If such resources are exhausted, nodes become unstable.

Running 2 x full nodes + 1 bp node on one server seems that exhausted the disk I/O (SSD already).

In terms of EOS9CAT monitoring, after the EOS main-net was officially launched for more than 1 month, more and more transactions are being made into the blocks.

Currently, the nodes require very high network bandwidth and disk I/O than we expected before the EOS launching day.

Workaround

Enable the master node to hold the pods

Reference: Creating a single master cluster with kubeadm - Master Isolation

kubectl taint nodes --all node-role.kubernetes.io/master-
Re-configure the Persistent Volume for the master node
- Change the nfs server address to be same with the master node
```
nfs:
  # FIXME: use the right IP
  server: <nfs server ip address>
  path: "path/to/folder"
```
- kubectl create -f <pv.yaml>
Transfer the snapshot to the master node
- stop the running node
- copy all the blocks/ and state/ to the node's folder which is located in the master node
Label each nodes and add the nodeSelector into the pod yaml file
- Reference: Assign Pods to Nodes
  
  kubectl label nodes <your-master-name> role=master
  
  kubectl label nodes <your-node1-name> role=node1
- Add the {.spec.nodeSelector} into the deployment yaml file
```
  nodeSelector:
    role: master
```
Deploy the deployment in the Kubernetes

6.Now the I/O chart from each node after the change.

master node

node1 node

Conclusion

Fully utilize the resource of the bare metal could give us more stabilities.
A good PV (PersistentVolume) design could promote the whole performance of the pod fail-over.
EOS synchronization requires robust network bandwidth and the disk I/O, especially several nodes are sharing the resource of one bare metal server.

Contact/About us

If you are an advanced blockchain user, feel free to use any of those tools that you are comfortable with.

If you like what we do and believe in EOS9CAT, vote for eosninecatbp! Waiting for your support. Have a question, send an email to us or visit our website.

FOLLOW US on Facebook, Telegram, Medium, SteemIt, Github, Meetup E0S9CAT, Reddit, Twitter, and LinkedIn.