k8s部署的kafka集群,其中两个kafka节点频繁crashbackoff

▓千年祇园 卐 发表于: 2023-09-15   最后更新时间: 2023-09-18 10:04:06   551 游览

在一个由3个节点组成的kafka集群里,有2个kafka节点一直重启失败,下面是其日志

[2023-09-15 06:16:30,240] INFO [Log partition=poc_alarm_queue-2, dir=/bitnami/kafka/data] Recovering unflushed segment 0 (kafka.log.Log)
[2023-09-15 06:16:30,240] INFO [Log partition=poc_alarm_queue-2, dir=/bitnami/kafka/data] Loading producer state till offset 0 with message format version 2 (kafka.log.Log)
[2023-09-15 06:16:31,981] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2023-09-15 06:16:31,983] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2023-09-15 06:16:31,987] ERROR [KafkaServer id=1] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
        at kafka.server.KafkaServer.shutdown(KafkaServer.scala:602)
        at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:54)
        at kafka.Kafka$.$anonfun$main$3(Kafka.scala:80)
        at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38)
        at java.base/java.lang.Thread.run(Thread.java:834)
[2023-09-15 06:16:31,990] ERROR Halting Kafka. (kafka.server.KafkaServerStartable)

此外,kafka集群连接zk集群,zk集群看起来是正常的,用zkServer.sh status查看3个节点基本ok, 不过也有其客户端连接超时和失败的日志信息:

2023-09-15 06:13:03,026 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.4.12:31708, session = 0x3000004f3460053
        at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
        at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-09-15 06:13:06,011 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x10000043eed0061
2023-09-15 06:13:06,188 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530054
2023-09-15 06:13:08,440 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x10000043eed0060, timeout of 18000ms exceeded
2023-09-15 06:13:08,440 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x10000043eed0060
2023-09-15 06:13:20,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3000004f3460053, timeout of 18000ms exceeded
2023-09-15 06:13:20,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460053
2023-09-15 06:13:38,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530053, timeout of 18000ms exceeded
2023-09-15 06:13:38,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530053
2023-09-15 06:13:50,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530054, timeout of 18000ms exceeded
2023-09-15 06:13:50,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530054
2023-09-15 06:14:04,748 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3000004f3460054
2023-09-15 06:14:04,791 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460054
2023-09-15 06:14:04,903 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530055
2023-09-15 06:14:28,637 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x10000043eed0062
2023-09-15 06:14:28,807 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3000004f3460055
2023-09-15 06:14:48,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530055, timeout of 18000ms exceeded
2023-09-15 06:14:48,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530055
2023-09-15 06:14:53,043 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.4.12:60734, session = 0x3000004f3460055
        at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
        at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-09-15 06:14:56,129 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530056
2023-09-15 06:14:56,306 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530057
2023-09-15 06:15:06,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3000004f3460055, timeout of 18000ms exceeded
2023-09-15 06:15:06,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460055
2023-09-15 06:15:40,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530057, timeout of 18000ms exceeded
2023-09-15 06:15:40,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530057

请问这种情况一般如何定位解决?

发表于 2023-09-15
添加评论

收到了停机信号,kafka还在初始化中,导致关闭失败。

排查思路:

基于k8s安装的kafka需要注意容器健康探测验证,否则一直被k8s认为不健康而重启。

  • initialDelaySeconds: 120
    告诉 kubelet 在第一次执行 probe 之前要的等待120秒钟

参考:kubernetes配置Liveness和Readiness探针

你的答案

查看kafka相关的其他问题或提一个您自己的问题