kafka有个broker一直在同步分区,I/O已经满负荷

祖晓晖 发表于: 2022-09-24   最后更新时间: 2022-09-24 19:33:22   636 游览

kafka 2.12-2.3.1
9台虚机 16C 32G 12T
每台主机1块盘

配置:

broker.id=5
num.network.threads=8
transaction.state.log.replication.factor=3
num.partitions=6
offsets.topic.replication.factor=3
default.replication.factor=3
num.io.threads=8

有个broker突然失联,然后一直同步失效分区,同步了1天,上面的kafka-zk也显示无法对外服务:

This ZooKeeper instance is not currently serving requests

该broker主机执行命令很卡,state-change.log一直在打日志,server.log日志里面有以下异常和warning:

WARN Attempting to send response via channel for which there is no open connection, connection id xxx
WARN [ReplicaFetcher replicaId=5, leaderId=1, fetcherId=0] Error in response for fetch request
WARN [ReplicaFetcher replicaId=5, leaderId=9, fetcherId=0] Partition xxx marked as failed (kafka.server.ReplicaFetcherThread)
Shrinking ISR from 2,5 to 5. Leader: (highWatermark: 0, endOffset: 0). Out of sync replicas: (brokerId: 2, endOffset: -1). (kafka.cluster.Partition)

初步怀疑是因为只有1块盘,I/O系统已经满负荷,磁盘可能存在瓶颈,关停这个broker后,zk恢复正常,I/O负载正常

监控报表如下:

screenshot

screenshot

screenshot

请问讲应用全部停止后,等分区全部同步完成再启动应用,能否恢复?

实在不行,打算提供个更大规模的集群,每台主机多块盘

发表于 2022-09-24
¥1.0

This ZooKeeper instance is not currently serving requests

当集群里的节点只剩下一台,或者不足半数时,就会出现这个错误提示。

这个量吞吐量已经到瓶颈吗?broker恢复之后,分区leader都切走了,理论上只有同步的流量了,如果你觉得还是达到了io上限,可以考虑kafka限流

你的答案

查看kafka相关的其他问题或提一个您自己的问题