kafka与zk异常断开

地球人 发表于: 2021-07-08   最后更新时间: 2021-07-15 11:50:58   154 游览

3台zk,4台kafka,kafka启动后运行7天左右时发现kafka日志中存在:

WARN Client session timed out, have not heard from server in 72318ms for sessionid 0x27a5c2e2cfb0001 (org.apache.zookeeper.ClientCnxn) ;

此类告警;
然后kafka自己重连zk,自动连接上,但过了10小时左右,kafka与zk越来越频繁会话断开重连,

然后直到会话完全断开,
[2021-07-07 14:12:12,805] WARN Unable to reconnect to ZooKeeper service, session 0x37a5c2eeb6a0008 has expired (org.apache.zookeeper.ClientCnxn)
[2021-07-07 14:12:12,805] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient)
[2021-07-07 14:12:12,805] INFO Unable to reconnect to ZooKeeper service, session 0x37a5c2eeb6a0008 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
。。。。。。

kafka进程在,但是此台kafka已经不再/breker/ids中了

KAFKA TOPIC信息:

[BEGIN] 2021/7/8 15:52:39
[root@kafka2 kafka]# bin/kafka-topics.sh --describe --zookeeper XX.XX.XX.XX:2181,XX.XX.XX.XX:2181,XX.XX.XX.XX:2181
Topic:ED    PartitionCount:4    ReplicationFactor:1    Configs:
    Topic: ED    Partition: 0    Leader: 1    Replicas: 1    Isr: 1
    Topic: ED    Partition: 1    Leader: 3    Replicas: 3    Isr: 3
    Topic: ED    Partition: 2    Leader: 0    Replicas: 0    Isr: 0
    Topic: ED    Partition: 3    Leader: 2    Replicas: 2    Isr: 2
Topic:__consumer_offsets    PartitionCount:200    ReplicationFactor:1    Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
    Topic: __consumer_offsets    Partition: 0    Leader: 1    Replicas: 1    Isr: 1
    ...
    Topic: __consumer_offsets    Partition: 93    Leader: 2    Replicas: 2    Isr: 2
    Topic: __consumer_offsets    Partition: 94    Leader: 3    Replicas: 3    Isr: 3
    Topic: __consumer_offsets    Partition: 95    Leader: 0    Replicas: 0    Isr: 0
    Topic: __consumer_offsets    Partition: 96    Leader: 2    Replicas: 2    Isr: 2
    Topic: __consumer_offsets    Partition: 97    Leader: 3    Replicas: 3    Isr: 3
    ...
    Topic: __consumer_offsets    Partition: 197    Leader: 0    Replicas: 0    Isr: 0
    Topic: __consumer_offsets    Partition: 198    Leader: 2    Replicas: 2    Isr: 2
    Topic: __consumer_offsets    Partition: 199    Leader: 3    Replicas: 3    Isr: 3

[END] 2021/7/8 15:52:51

kafka日志级别设置如下:

log4j.rootLogger=INFO, stdout, kafkaAppender

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n

log4j.appender.kafkaAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.kafkaAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.kafkaAppender.File=${kafka.logs.dir}/server.log
log4j.appender.kafkaAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.kafkaAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

log4j.appender.stateChangeAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.stateChangeAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.stateChangeAppender.File=${kafka.logs.dir}/state-change.log
log4j.appender.stateChangeAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.stateChangeAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

log4j.appender.requestAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.requestAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.requestAppender.File=${kafka.logs.dir}/kafka-request.log
log4j.appender.requestAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.requestAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

没有ERROR级别的报错。

4台kafka均只有这一台报重连问题。日志中还有其他的大量的其他告警信息:

例如:

WARN Attempting to send response via channel for which there is no open connection, connection id XXXXXXXX:9092-XXXXXXXX:47415-114379938 (kafka.network.Processor)

WARN Received a PartitionLeaderEpoch assignment for an epoch < latestEpoch. This implies messages have arrived out of order. New: {epoch:0, offset:116331733}, Current: {epoch:24994, offset114475406} for Partition: XXXXXX-3 (kafka.server.epoch.LeaderEpochFileCache)

现场的topic均只有一个副本;

今日又发现kafka停止服务的问题;排查GC发现

2021-07-15T09:49:41.106+0800: 571718.919: [Full GC (Allocation Failure)  31G->29G(32G), 67.2720092 secs]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->29.9G(32.0G)], [Metaspace: 31574K->31440K(32768K)]
 [Times: user=124.42 sys=0.00, real=67.27 secs]
2021-07-15T09:50:48.380+0800: 571786.192: [GC concurrent-mark-reset-for-overflow]
2021-07-15T09:50:48.380+0800: 571786.192: [GC concurrent-mark-abort]

发现kafka内存溢出了;但究竟是什么原因搞出的溢出还在排查



发表于 22天前
添加评论

从你提供的信息来看,你的kafka集群是正常的,所有的分区也是正常提供服务。
此台kafka已经不再/breker/ids中了
但是从你的集群情况来看,它依然在提供服务。

zk和kafka之间或者是zk和kafka客户端之间这种告警是非常常见的,所以你需要关注:

  • 出现刷屏的警告就需要值得注意了(一般是zk版本与kafka不兼容导致的)
  • 这台有问题的节点是否资源达到了瓶颈(io、net、disk、cpu等)
地球人 -> 半兽人 22天前

今天上午重启kafka后恢复的业务,没有手动重启kafka前是不在/broker/ids中的;
现场使用的zk版本是zookeeper.version=3.4.9-1757313
kafka版本kafka_2.11-1.0.0
这个kafka建议使用多少版本的zk呢?

地球人 -> 半兽人 15天前

大神,今天发现kafka的内存溢出的问题,这个有没有啥排查方向啊?

半兽人 -> 地球人 15天前

kafka自身报内存溢出,一般是调整了kafka的默认参数(比如增加了接收消息的数量,缓存大小等等)。
调大kafka启动的jvm就可以了。
参见:https://www.orchome.com/10089

找不到想要的答案?

我要提问
提问