kafka报错打开文件数过多导致kafka关闭？

服务器信息：

kafka版本：kafka_2.11-0.10.1.0
jvm版本：java version "1.7.0_67"
kafka最大文件数配置为: 65535

kafka配置如下

broker.id=23
listeners=PLAINTEXT://192.168.1.100:9092
num.network.threads=33
num.io.threads=64
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/disk01/,/data/disk02/,/data/disk03/,/data/disk04/,/data/disk05/,/data/disk06/,/data/disk07/,/data/disk08/,/data/disk09/,/data/disk10/,/data/disk11/,/data/disk12/
num.partitions=20
num.recovery.threads.per.data.dir=10
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.flush.scheduler.interval.ms=2000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleanup.interval.mins=10
log.cleaner.enable=false
message.max.bytes=20000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
replica.fetch.max.bytes=104857600
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
auto.leader.rebalance.enable=true
delete.topic.enable=true
log.message.format.version=0.8.2
zookeeper.connect=192.168.1.101:2181,192.168.1.102:2181,192.168.1.103:2181,192.168.1.104:2181,192.168.1.105:2181
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

问题描述：

近期生产环境kafka集群经常发生由于打开文件数过多导致kafka关闭的情况。
每次出现问题的kafka数量为1个，每次出现问题的kafka不固定,日志中有大量报错：

打开文件过多报错前有大量如下info信息，kafka isr由多个变成出错设备自己一个

 Shrinking ISR for partition [some_topic,13] from 23,17 to 17
 Shrinking ISR for partition [some_topic2,12] from 22，23,17 to 17
 Shrinking ISR for partition [some_topic3,11] from 20，23,17 to 17
 Shrinking ISR for partition [some_topic4,12] from 19，25,17 to 17

打开文件过多报错前有大量

 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic2,18] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,5] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic19,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 [2018-08-20 07:57:09,602] ERROR [ReplicaFetcherThread-0-18], Error for partition [some_topic1,9] to broker 18:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

大量打开文件数过多报错

 [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
 java.io.IOException: 打开的文件过多
         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
         at kafka.network.Acceptor.accept(SocketServer.scala:326)
         at kafka.network.Acceptor.run(SocketServer.scala:269)
         at java.lang.Thread.run(Thread.java:745)
 [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
 java.io.IOException: 打开的文件过多
         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
         at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
         at kafka.network.Acceptor.accept(SocketServer.scala:326)
         at kafka.network.Acceptor.run(SocketServer.scala:269)
         at java.lang.Thread.run(Thread.java:745)
 [2018-08-20 07:57:10,355] ERROR Error while accepting connection (kafka.network.Acceptor)
 java.io.IOException: 打开的文件过多

问题初步分析：

smartacl检查了文件系统，文件系统正常
检控kafka打开文件数，发现是大量close-wait状态socket连接，有1000个左右close-wait状态连接的ip，每个ip有500以上的连接，其中有1000以上连接有50个左右

2018-10-09 11:10:00,304] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:33414-440063 (kafka.network.Processor) [2018-10-09 11:35:00,105] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:56600-440875 (kafka.network.Processor) [2018-10-09 12:05:00,098] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54642-441768 (kafka.network.Processor) [2018-10-09 13:25:00,266] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:37534-444049 (kafka.network.Processor) [2018-10-09 14:18:37,630] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34310-445956 (kafka.network.Processor) [2018-10-09 14:25:00,190] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:39004-446107 (kafka.network.Processor) [2018-10-09 15:40:02,721] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:54002-448589 (kafka.network.Processor) [2018-10-09 15:50:00,153] WARN Attempting to send response via channel for which there is no open connection, connection id xxx.xxx.xxx.xxx:9092-xxx.xxx.xxx.xxx:34168-448891 (kafka.network.Processor)

lrwx------ 1 kafka kafka 64 Aug 20 07:45 189987 -> socket:[3517032353] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189988 -> socket:[3517032354] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189989 -> socket:[3517032355] lrwx------ 1 kafka kafka 64 Aug 20 07:39 18999 -> socket:[3516466241] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189990 -> socket:[3517032356] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189991 -> socket:[3517032357] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189992 -> socket:[3517032358] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189993 -> socket:[3517032359] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189994 -> socket:[3517032360] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189995 -> socket:[3517032361] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189996 -> socket:[3517032362] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189997 -> socket:[3517032363] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189998 -> socket:[3517032364] lrwx------ 1 kafka kafka 64 Aug 20 07:45 189999 -> socket:[3517032365]

tcp 1481 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.25.42:35271 CLOSE_WAIT 126050/java tcp 770 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.24.19:54740 CLOSE_WAIT 126050/java tcp 232 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.110.10:18666 CLOSE_WAIT 126050/java tcp 509 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.73:52718 CLOSE_WAIT 126050/java tcp 204 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.31.79:23604 CLOSE_WAIT 126050/java tcp 377 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.32.115:49461 CLOSE_WAIT 126050/java tcp 958 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.18.44:46041 CLOSE_WAIT - tcp 877 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.27:32511 CLOSE_WAIT 126050/java tcp 226 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.39.26.80:63607 CLOSE_WAIT 126050/java tcp 30207 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:172.21.31.56:53851 CLOSE_WAIT 126050/java tcp 854 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.34.56:12484 CLOSE_WAIT 126050/java tcp 447 0 kafka023.hadoo:XmlIpcRegSvc ::ffff:192.168.13.25:32355 CLOSE_WAIT 126050/java

kafka020 [2018-08-24 08:39:04,559] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@1b37b18 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka025 [2018-08-24 08:39:04,582] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@544c05bb (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka024 [2018-08-24 08:32:24,253] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7059a711 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka019 [2018-08-24 08:39:04,566] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@37f0b8a1 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka016 [2018-08-24 08:39:04,571] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7af80ad4 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka017 [2018-08-24 08:39:04,572] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@3caa6934 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka018 [2018-08-24 08:39:04,583] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@659c8509 (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read kafka022 [2018-08-24 08:32:24,088] WARN [ReplicaFetcherThread-0-23], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@74478c1b (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 23 was disconnected before the response was read

问题已经解决.是kafka的bug
https://github.com/apache/kafka/pull/2116
https://issues.apache.org/jira/browse/KAFKA-4362
升级到kafka kafka_2.11-0.10.1.1版本就可以解决

回答于 6年前

micmouse521

回答于 7年前

小无名者

ighack -> 小无名者 6年前

我也出现这些警告。不知道你解决了没有

冰海落花 -> ighack 5年前

请问这个告警解决了吗

博主，问题解决了吗？

﹏随遇而安ヾ

星空 -> ﹏随遇而安ヾ 6年前

您好，现在问题解决了吗？最近我这边也遇到了

我是cdh版本的开发，CDK3.0.0，对应kafka的0.11.0版本，最近也经常出现这个问题，需要升级到1.0版本吗

Angel-Beats

micmouse521 -> Angel-Beats 6年前

升级到kafka kafka_2.11-0.10.1.1版本就可以解决

星空

micmouse521 -> 星空 6年前

星空 -> micmouse521 6年前

我的版本是kafka_2.11-0.10.1.0

星空 -> 星空 6年前

大佬方便留下联系吗，帮忙看下

1、调整下系统的openfile
2、一个客户端正常的话是只建立一个长连接，看样子你是不是获取连接那块写到循环里了，检查下

半兽人

micmouse521 -> 半兽人 7年前

谢谢，openfile调整过两次

另外一个报错的节点调整到65350，还是会报打开文件数过多错误
没有重复，是通过lsof -p pid|wc -l统计的。

半兽人 -> micmouse521 7年前

调整后还是不行吗？
如果openfile调整过了，看下是否和启动kafka的用户对应。

每个ip有500个连接？

嗯。每个ip都有1000多个time-wait的链接
从/proc/pid/fd下统计文件数，也是很多。

部分netstat -aupt结果

把java升级到1.8，防火墙关了吧。
你观察某台机器的某个连接，看看过段时间，连接会不会自动断开。

谢谢，防火墙是关闭的。
谢谢你提供的思路。

找到问题了吗？

没有，和以下这个问题类似，应该是kafka的bug。
https://issues.apache.org/jira/browse/KAFKA-4477

场景能重现吗？

现在每隔两天会重现一次，监控文件数，然后重启对应节点。
不知道怎么人工重现

那之前提到监控某台机器的某个连接是否被重连过呢？
另外，建议你升级kafka版本，0.10版本我遇到了2次灾难级的问题了。

只对kafka做了监控，发送端没做监控。
能具体说明下吗。

1、客户端大量连接kafka集群进行消费和发送，你监管一些客户端与kafka的连接，正常的话是长连接，不会变的，所以你定时监控下是否有连接变动。
2、错误详情不在这台电脑上。1个是消费者客户端commit消息的时候大量失败。2、持续的消费者群组进行重新分配。

对客户端监控进行了监控。
最近又出现了一次。
监控结果是：客户端连接数没有明显变化，状态都是ESTABLISHED，但是出问题的kafka服务器上有大量的CLOSE_WAIT状态连接。
我对比了一下出问题时的连接数大约是正常情况连接数的7到8倍，并且从监控来看大约2分钟连接数由20000左右涨到30000

我分析是，客户端连接不上服务器，就断开连接重新连接，这样重复几次，就成功连上服务器了，但是服务器一直不释放之前失败的连接。

小无名者 -> micmouse521 7年前

问题解决了么？

micmouse521 -> 小无名者 7年前

没有，应该是版本问题。
做了连接监控超过3万就自动重启节点。

我们这边的节点，还不能随便重启。有实时任务在跑这也是个问题。听了半兽人的建议，准备把客户端升级下看看

哦。我们主要是日志，重启会导致消息积压。
客户端升级？详细说明一下

感觉问题不在客户端，是服务端一直没有断开连接。

嗯我的版本已经是 2.11_1.0.0 了，目前也没有多大的头绪。reconnect_backoff_ms, max_poll_records, heartbeat_interval_ms 都做过调整问题依旧出现

我的这个集群还有另外一个问题就是有个 consumer group 不断的在做 rebalance group，每隔2，4分钟就做一次频率太高了

呆木头 -> 半兽人 6年前

用的kafka 2.1.0版本，出现类似的情况
集群内部连不上其中一台broker，流量掉零；
但不会将异常的节点踢出去，命令行describe看topic信息ISR都是在线的，机器的句柄数也在持续增长
需要重启节点才能恢复

﹏随遇而安ヾ -> 小无名者 6年前

你好，现在问题解决了吗

﹏随遇而安ヾ -> 呆木头 6年前

你的答案

查看kafka相关的其他问题或提一个您自己的问题。