kafka一条消息从生产到消费整个过程耗时 400-600ms 正常?

剑枫寒 发表于: 2019-10-28   最后更新时间: 2019-10-28  

提问说明

kafka生产消息到消费整个流程如下:

1.生产者生产消息到缓冲区时间:1572228409758 ms
2.从Event事件回调得到生产者消息成功发送到broker的整个网络时延:2ms

LOG-7-RECV: [thrd:IP:9292/bootstrap]: IP:9292/1: Received ProduceResponse (v7, 53 bytes, CorrId 2, rtt 2.48ms)

3.kafka消费者消费消息时间:1572228410256 ms
整个过程中总耗时: 498ms

kafka集群配置如下:

broker.id=1
port=9292
host.name=xxx
listeners=PLAINTEXT://xxx:9292
advertised.listeners=PLAINTEXT://xxx:9292

auto.create.topics.enable = false
auto.leader.rebalance.enable=true

# Replication configurations
num.replica.fetchers=4
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000

# Log configuration
num.partitions=8
num.recovery.threads.per.data.dir=1
message.max.bytes=1000000
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.hours=24
log.retention.bytes=5000000000
log.retention.check.interval.ms=300000
log.flush.interval.ms=10000
log.flush.interval.messages=20000
log.flush.scheduler.interval.ms=2000
log.roll.hours=6
log.segment.bytes=2000000000
log.dirs=/data/data1/svckafka/logdata,/data/data2/svckafka/logdata,/data/data3/svckafka/logdata,/data/data4/svckafka/logdata,/data/data5/svckafka/logdata,/data/data6/svckafka/logdata


# ZK configuration
zookeeper.connect=xxx:2181,xxx:2181,xxx:2181,xxx:2181,xxx:2181
zookeeper.connection.timeout.ms=24000
zookeeper.sync.time.ms=2000

# Socket server configuration
num.io.threads=12
num.network.threads=20

socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576

kafka生产者配置如下:

statistics.interval.ms :5000
queue.buffering.max.ms :500
debug : msg,protocol
request.required.acks : 1

kafka消费者配置如下:

statistics.interval.ms : 5000
debug : consumer,security,broker
enable.partition.eof : true
max.partition.fetch.bytes : 10240000
auto.offset.reset : latest


您需要解锁本帖隐藏内容请: 点击这里
本帖隐藏的内容




上一条: kafka集群controller节点连接不上受管节点
下一条: Log flush latency (95th percentile) > 10ms 日志刷新延迟

  • 生产者 kafka集群 消费者三者在同一机房

    • 网络压力上有吗,你可以试试kafka的压测命令。

      bin/kafka-producer-perf-test.sh --topic test --num-records 100 --record-size 1 --throughput 100  --producer-props bootstrap.servers=localhost:9092
      
        • 测试场景1 MQ消息总数 单个记录的大小(单位:字节) 设置每秒发送的最大消息数 吞吐量 写入每个记录的时延以及整体时延 整体耗时

           1w 1000 -1(无限制) 14367.816092 records/sec (13.70 MB/sec) 108.62 ms avg latency, 285.00 ms max latency, 99 ms 50th, 205 ms 95th, 209 ms 99th, 211 ms 99.9th. 1.4s
           10w 1000 -1 47687.172151 records/sec (45.48 MB/sec) 380.84 ms avg latency, 905.00 ms max latency, 355 ms 50th, 834 ms 95th, 889 ms 99th, 904 ms 99.9th. 3.68s
           100w 1000 -1 74090.538638 records/sec (70.66 MB/sec) 408.31 ms avg latency, 1213.00 ms max latency, 346 ms 50th, 972 ms 95th, 1054 ms 99th, 1205 ms 99.9th. 15.1s
           500w 1000 -1  80029.450838 records/sec (76.32 MB/sec) 403.19 ms avg latency, 1406.00 ms max latency, 283 ms 50th, 1071 ms 95th, 1229 ms 99th, 1351 ms 99.9th. 63.42s
          
            • MQ消息总数 单个记录的大小 设置每秒发送的最大消息数 吞吐量 写入每个记录的时延以及整体时延 整体耗时

              1w                    1000                    -1(无限制)       14367.816092 records/sec (13.70 MB/sec)          108.62 ms avg latency, 285.00 ms max latency, 99 ms 50th, 205 ms 95th, 209 ms 99th, 211 ms 99.9th.              1.4s   
              10w                  1000                    -1                         47687.172151 records/sec (45.48 MB/sec)          380.84 ms avg latency, 905.00 ms max latency, 355 ms 50th, 834 ms 95th, 889 ms 99th, 904 ms 99.9th.           3.68s
              100w                 1000                  -1                          74090.538638 records/sec (70.66 MB/sec)           408.31 ms avg latency, 1213.00 ms max latency, 346 ms 50th, 972 ms 95th, 1054 ms 99th, 1205 ms 99.9th.        15.1s
              500w                 1000                  -1                         80029.450838 records/sec (76.32 MB/sec)             403.19 ms avg latency, 1406.00 ms max latency, 283 ms 50th, 1071 ms 95th, 1229 ms 99th, 1351 ms 99.9th.           63.42s
              
                • 会不会是: queue.buffering.max.ms :500这个设置导致的, batch.num.messages缺省是10000,生产者在在发送消息前在本地队列如果消息量不够10000,则要等500ms才把消息发送到broker的?

                    • 测试场景
                      linger.ms 生产者发送消息时间 dr_cb时间 消费者收到消息时间 Deliver时延:ms 生产到消费时延:ms
                      10 1572404702849 1572404703497 1572404702906 648 57
                      50 1572405989275 1572405989658 1572405989368 383 93
                      100 1572406268934 1572406269603 1572406269083 669 149
                      200 1572406433124 1572406434254 1572406433372 1130 248
                      300 1572406602121 1572406602595 1572406602469 474 348
                      500 1572406767063 1572406767954 1572406767610 891 547
                      1000 1572406906579 1572406908399 1572406907637 1820 1058
                      5000 1572407077313 1572407083045 1572407082377 5732 5064

                      大概测了一下:消费者收到消息随linger.ms增大而增加,但是还有一点不明白,dr_cb是消息到达broker之后的回调,为什么时间基本都要在400到1000ms之间,生产者消息发到broker时间很快,返回却要花费更多时间 ?

                        • dr_cb是c++特有的额,按照你说的是broker之后的回调,大概猜测下,生产者发送broker很快,其实还是在缓存中,并没有马上发送到kafka(基于linger.ms和batch.size)。所以导致了400-1000ms

                            • linger.ms 生产者发送消息时间 dr_cb时间 消费者收到消息时间 Deliver时延:ms 生产到消费时延:ms
                              10 1572404702849 1572404703497 1572404702906 648 57

                              只看第一个数据(后面的数据也是一样):生产者发送消息时间:1572404702849, 消费者收到消息时间:1572404702906, 时延:57ms; 说明这时候消息已经到broker上了, 但是收到dr_cb的时间:1572404703497 , 假设在消费者收到消息之后在发送回调返回,那中间也经过了591ms