• 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏吧

kubernetes – kubelet失联导致POD持续not ready

大数据 开发技术 3周前 (05-12) 20次浏览

这周线上k8s踩到bug一枚,下面是整个问题的分析过程。

问题

重启kubelet节点的网络,导致node上的POD全部处于Ready: False状态,即无法对外提供服务。

线上k8s代码版本是1.15.5,一开始看的是1.15最新小版本,代码结构完全不同,查了半天都没查明白,汗。。。

node污点

node平时是没有污点的,一旦出现失联就会涉及到2种污点:

  • node.kubernetes.io/not-ready: Node is not ready. This corresponds to the NodeCondition Ready being “False”.

  • node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being “Unknown”.

更新node污点的同时,也会伴随更新Node的对应Condition(这里只关注NodeReady):

	taintKeyToNodeConditionMap = map[string]v1.NodeConditionType{ schedulerapi.TaintNodeNotReady:           v1.NodeReady, schedulerapi.TaintNodeUnreachable:        v1.NodeReady, schedulerapi.TaintNodeNetworkUnavailable: v1.NodeNetworkUnavailable, schedulerapi.TaintNodeMemoryPressure:     v1.NodeMemoryPressure, schedulerapi.TaintNodeDiskPressure:       v1.NodeDiskPressure, schedulerapi.TaintNodePIDPressure:        v1.NodePIDPressure, } pkg/controller/nodelifecycle/node_lifecycle_controller.go:107 

简单理解,就是node.kubernetes.io/not-ready会同时更新NodeReady Condition为False,node.kubernetes.io/unreachable会同时更新NodeReady Condition为Unknown,效果都是NoExecute也就是立即驱逐node上的POD。

node lifecycle controller

Controller Manager中的node lifecycle controller会周期性检查所有node的状态,默认间隔是5秒。

	// Incorporate the results of node health signal pushed from kubelet to master. go wait.Until(func() { if err := nc.monitorNodeHealth(); err != nil { klog.Errorf("Error monitoring node health: %v", err) } }, nc.nodeMonitorPeriod, stopCh) 

lifecycle已经监听了etcd中node的变化并同步其状态到内存,上述定时器定期扫描内存中的node信息,做出动作。

从函数注释提供了一些关键信息:

// monitorNodeHealth verifies node health are constantly updated by kubelet, and // if not, post "NodeReady==ConditionUnknown". // For nodes who are not ready or not reachable for a long period of time. // This function will taint them if TaintBasedEvictions feature was enabled. // Otherwise, it would evict it directly. func (nc *Controller) monitorNodeHealth() error { // We are listing nodes from local cache as we can tolerate some small delays // comparing to state from etcd and there is eventual consistency anyway. 

简单总结一下:

1)lifecycle内存中实时同步着etcd的node数据,是kubelet去更新的。

2)正常来说,kubelet会定期更新node信息,这样lifecycle会被动同步到node变化(理解为心跳)。

3)但是如果kubelet有一阵子没有上报node信息,那么lifecyle周期性检测的时候就会标记node为unknown状态。

4)如果unknown状态持续很久,那么lifecycle就会对node生效污点,驱逐上面的pod。

那么,node更新时间>多久才算unknown呢?从lifecycle代码里可以看到一段注释,

	// Controller will not proactively sync node health, but will monitor node // health signal updated from kubelet. There are 2 kinds of node healthiness // signals: NodeStatus and NodeLease. NodeLease signal is generated only when // NodeLease feature is enabled. If it doesn't receive update for this amount // of time, it will start posting "NodeReady==ConditionUnknown". The amount of // time before which Controller start evicting pods is controlled via flag // 'pod-eviction-timeout'. // Note: be cautious when changing the constant, it must work with // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease // controller. The node health signal update frequency is the minimal of the // two. // There are several constraints: // 1. nodeMonitorGracePeriod must be N times more than  the node health signal //    update frequency, where N means number of retries allowed for kubelet to //    post node status/lease. It is pointless to make nodeMonitorGracePeriod //    be less than the node health signal update frequency, since there will //    only be fresh values from Kubelet at an interval of node health signal //    update frequency. The constant must be less than podEvictionTimeout. // 2. nodeMonitorGracePeriod can't be too large for user experience - larger //    value takes longer for user to see up-to-date node health. nodeMonitorGracePeriod time.Duration 

这个值默认是40秒。

当kubelet所在node网络重启时,在apiserver日志中看到了断连日志:

02:12:57.866468   10697 writers.go:163] apiserver was unable to write a JSON response: http2: stream closed 

时间点是02:12:57。

过了一段时间,在cm日志(node lifecycle controller)中,我们看到lifecycle对node的状态进行了变更:

I0510 02:13:17.033201 384937 controller_utils.go:180] Recording status change NodeNotReady event message for node 10.42.118.62 

时间点是02:13:17,距离02:12:57正好40秒,也就说明lifecycle在扫描node时发现node心跳停止了40秒。

超过40秒没有心跳(Kubelet stopped posting node status)时,tryUpdateNodeHealth方法会对node更新所有Condition字段更新为Unknown:

	if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) { // NodeReady condition or lease was last set longer ago than gracePeriod, so // update it to Unknown (regardless of its current value) in the master. } else { klog.V(4).Infof("node %v hasn't been updated for %+v. Last ready condition is: %+v", node.Name, nc.now().Time.Sub(savedNodeHealth.probeTimestamp.Time), observedReadyCondition) if observedReadyCondition.Status != v1.ConditionUnknown { currentReadyCondition.Status = v1.ConditionUnknown currentReadyCondition.Reason = "NodeStatusUnknown" currentReadyCondition.Message = "Kubelet stopped posting node status." // LastProbeTime is the last time we heard from kubelet. currentReadyCondition.LastHeartbeatTime = observedReadyCondition.LastHeartbeatTime currentReadyCondition.LastTransitionTime = nc.now() } 

经过更新后,currentReadyCondition就是Unknown,即node失联状态。

此后,判断node状态从ready: True迁移到了ready非True(实际是unknown),则对失联node进行一波处理:

			// Report node event. if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue { nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady") if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil { utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err)) } } 

1)创建了NodeNotReady的Event

2)更新node上所有POD的Condition Ready为False,因此看到如下日志:

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033353 384937 event.go:258] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"10.42.118.62", UID:"f41af641-3f22-45a6-bff8-49945179eb78", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node 10.42.118.62 status is now: NodeNotReady 
May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033228 384937 controller_utils.go:124] Update ready status of pods on node [10.42.118.62] May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.184202 384937 controller_utils.go:144] Updating ready status of pod app-api-smzdm-com-64f9fbd859-mrp6k to false May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.188499 384937 controller_utils.go:144] Updating ready status of pod bannerservice-smzdm-com-58476c8f4d-ct5h4 to false 

本轮检测完成后5秒,新一轮的monitor检测开始,因此当前node状态已经是unknown了,所以进入了另外一个分支,对node进行污点标记:

if observedReadyCondition.Status == v1.ConditionUnknown { if nc.useTaintBasedEvictions { // We want to update the taint straight away if Node is already tainted with the UnreachableTaint if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) { taintToAdd := *UnreachableTaintTemplate if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) { klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.") } } else if nc.markNodeForTainting(node) { klog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.", node.Name, decisionTimestamp, ) } 

即标记了node.kubernetes.io/not-ready:NoExecute,希望驱逐POD立即离开node。

此时,对应的日志输出也符合预期:

May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509023  384937 node_lifecycle_controller.go:759] Node 10.42.118.62 is unresponsive as of 2020-05-10 02:13:22.508996764 +0800 CST m=+1658350.928182036. Adding it to the Taint queue. May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509776  384937 node_lifecycle_controller.go:759] Node 10.42.163.43 is unresponsive as of 2020-05-10 02:13:22.509768722 +0800 CST m=+1658350.928953992. Adding it to the Taint queue. 

时间点是02:13:32,即上次monitor周期02:13:17之后的大约5秒。

此后5秒,monitor检查周期再次拉起,检查到kubelet已经更新过node心跳时间(被动同步来的),因此开始移除污点:

May 10 02:13:27 10-19-143-199 hyperkube[384937]: I0510 02:13:27.511828  384937 node_lifecycle_controller.go:784] Node 10.42.163.43 is healthy again, removing all taints May 10 02:13:32 10-19-143-199 hyperkube[384937]: I0510 02:13:32.523365  384937 node_lifecycle_controller.go:784] Node 10.42.118.62 is healthy again, removing all taints 

时间点是02:13:77,代码是:

if observedReadyCondition.Status == v1.ConditionTrue { if nc.useTaintBasedEvictions { removed, err := nc.markNodeAsReachable(node) if err != nil { klog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name) } if removed { klog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name) } } else { if nc.cancelPodEviction(node) { klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name) } } } 

1)在markNodeAsReachable就是删除node上的污点,打印打印了healthy again的日志。

2)但是并没有对上面的POD恢复ready: True。

进一步分析

按道理node曾经上过污点,虽然又取消了,但是kubelet应该可以收到这次历史变动,应该会对POD做驱逐动作。

其实只要Kubelet驱逐了POD,那么就不会出现这种POD处于ready:False的情况了,可为什么Kubelet没有驱逐呢?

我看了一下pod的tolerations,发现pod对node的上述noExecute污点有容忍时间:

    tolerations:     - effect: NoExecute       key: node.kubernetes.io/not-ready       operator: Exists       tolerationSeconds: 300     - effect: NoExecute       key: node.kubernetes.io/unreachable       operator: Exists       tolerationSeconds: 300 

也就是说能容忍污点5分钟,因此导致了POD没有被驱逐,这就是整个事情的来龙去脉了。

如果文章帮助了你,请帮我点击1次谷歌广告,或者微信赞助1元钱,感谢!

知识星球有更多干货内容,对我认可欢迎加入:


喜欢 (0)