KAFKA-7870这个BUG在后续的哪个版本里修复了?

Jacky 发表于: 2019-09-03   最后更新时间: 2019-09-03  

生产KAFKA集群遇到 这个BUG,现象是KAFKA集群正常。ISR也全,分区leader无法写入和消费,分区follower同步数据失败,最终异常节点内存消耗很高,重启这一个节点后,恢复正常。检查所有日志,只有follower上面有以下同步异常日志,其他的日志均未发现异常。

[2019-09-02 10:03:30,051] WARN [ReplicaFetcher replicaId=8, leaderId=2, fetcherId=0] 
Error in response for fetch request (type=FetchRequest, replicaId=8, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={
PortrayAys-10=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
MMS-Metric-1=(offset=749142728, logStartOffset=749142728, maxBytes=1048576, currentLeaderEpoch=Optional[0]), 
dialtest-2=(offset=15316723, logStartOffset=15316723, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
DetectoCPU-3=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[2]), 
OneMinBL-17=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
MetricRoute-0=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[2]), 
MetricBaseData-25=(offset=649178, logStartOffset=649178, maxBytes=1048576, currentLeaderEpoch=Optional[0]), 
Argus-RawData-8=(offset=20386963624, logStartOffset=20386403279, maxBytes=1048576, currentLeaderEpoch=Optional[0]), 
MetricBaseData-15=(offset=648652, logStartOffset=648652, maxBytes=1048576, currentLeaderEpoch=Optional[0]), 
PortrayAys-20=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[2]), 
NewProxyBaseData-5=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[2]), 
Detector-To-ES-8=(offset=31787842, logStartOffset=31787842, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
AIAnomaly-2=(offset=5628, logStartOffset=5628, maxBytes=1048576, currentLeaderEpoch=Optional[2]), 
AIPortray-22=(offset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4])}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=2013330601, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)

https://issues.apache.org/jira/browse/KAFKA-7870



您需要解锁本帖隐藏内容请: 点击这里
本帖隐藏的内容




上一条: kafka消费者组重新平衡后,消费不到消息
下一条: kafka配置“request.required.acks”=-1时,循环发送消息会出现消息丢失的情况。

    • 我直接从2.1.0版本升级到2.2.0, 2.1.1这两个版本中的一个,这要升级会有什么影响吗?版本之间的兼容性怎么样?

        • kafka_2.11-0.8.2.1
          kafka_2.10-0.10.2.0
          请问一下,这两个版本能兼容吗?我这边开始用的kafka_2.11-0.8.2.1,后来用kafka_2.10-0.10.2.0,但是消费端的依赖还是kafka_2.11-0.8.2.1,造成有时候offset提交失败,这种是兼容导致的吗?

            • 你好,像__consumer_offsets这个默认topic的一个分区里有如下日志,是不是可以认为offset=728和729这两条消息被正常消费了?
              [ems,ems-otchs-topic,0]::[OffsetMetadata[728,NO_METADATA],CommitTime 1568284275547,ExpirationTime 1568370675547]
              [ems,ems-otchs-topic,0]::[OffsetMetadata[729,NO_METADATA],CommitTime 1568284276561,ExpirationTime 1568370676561]
              谢谢!

                • kafka-simple-consumer-shell.sh --topic consumer_offsets --partition 5 --broker-list IP:PORT --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter"
                  我是用这个命令解析出来的,严格来说,这不是日志,是
                  consumer_offsets的数据文件
                  那怎么判断消费是否成功?现在服务端看不到报错,消费端消费后提交offset有报错

                    • 0.8之前,消费者的offset消费位置存储在zk中,0.9版本之后,消费者消费topic的offset位置默认放在了consumer_offsets主题中。

                      消费者提交offset报错,提供下报错信息吧

                        • <2019-09-12 18:12:22,542>[TRACE] kafka.consumer.ZookeeperConsumerConnector - [ems_SH-L08013-1568281779941-bdab7903], OffsetMap: Map([ems-otchs-topic,0] -> [OffsetMetadata[728,NO_METADATA],CommitTime -1,ExpirationTime -1])
                          <2019-09-12 18:12:22,543>[DEBUG] kafka.consumer.ZookeeperConsumerConnector - [ems_SH-L08013-1568281779941-bdab7903], Connected to offset manager 30.79.78.25:34548.
                          <2019-09-12 18:12:22,543>[TRACE] kafka.network.RequestOrResponseSend - 75 bytes written.
                          <2019-09-12 18:12:22,543>[ERROR] kafka.consumer.ZookeeperConsumerConnector - [ems_SH-L08013-1568281779941-bdab7903], Error while committing offsets.
                          java.io.EOFException
                                   at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
                                   at kafka.network.BlockingChannel.readCompletely(BlockingChannel.scala:129)
                                   at kafka.network.BlockingChannel.receive(BlockingChannel.scala:120)
                                   at kafka.consumer.ZookeeperConsumerConnector.liftedTree2$1(ZookeeperConsumerConnector.scala:355)
                                   at kafka.consumer.ZookeeperConsumerConnector.commitOffsets(ZookeeperConsumerConnector.scala:352)
                                   at kafka.consumer.ZookeeperConsumerConnector.commitOffsets(ZookeeperConsumerConnector.scala:332)
                                   at kafka.javaapi.consumer.ZookeeperConsumerConnector.commitOffsets(ZookeeperConsumerConnector.scala:108)
                                   at com.paic.mercury.esb.kafka.KafkaConsumer.commitOffsets(KafkaConsumer.java:38)
                                   at com.paic.mercury.esb.kafka.consumer.KafkaMessageFetcherPool.commitOffsets(KafkaMessageFetcherPool.java:47)
                                   at com.paic.mercury.esb.kafka.consumer.KafkaMessageConsumer.doWork(KafkaMessageConsumer.java:94)
                                   at com.paic.mercury.esb.kafka.consumer.KafkaMessageConsumer.run(KafkaMessageConsumer.java:44)
                                   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                                   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                   at java.lang.Thread.run(Thread.java:748)
                          

                          只有消费端有报错,server端没有。

                            • kafka版本目前已改成一样的版本了,后来发现ZK版本不一样,不知道会不会有影响,现在把消费端的ZK依赖改成跟server端一样再观察一下

                                • 你好,想请教一下这两个参数的作用是什么,看网上写的有点不太明白
                                  offset.channel.backoff.ms 1000 重新连接offsets channel或者是重试失败的offset的fetch/commit请求的backoff时间
                                  offsets.channel.socket.timeout.ms 10000 当读取offset的fetch/commit请求回应的socket 超时限制。此超时限制是被consumerMetadata请求用来请求offset管理