kafka监控

6.6 监控

Kafka uses Yammer Metrics for metrics reporting in the server. The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.
Kafka服务端和Java客户端使用Yammer Metrics来报告指标。它是一个内置的度量标准注册表，两者都可通过JMX暴露指标，可插拔式的统计报告信息，可连接到你自己的监视系统。

All Kafka rate metrics have a corresponding cumulative count metric with suffix -total. For example, records-consumed-rate has a corresponding metric named records-consumed-total.
所有Kafka比率指标都有一个后缀为-total累积计数指标。例如，records-consumed-rate的对应度量是records-consumed-total。

The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.
最简单的方式是通过启动jconsole并将其指向正在运行的kafka客户端或服务器来查看可用的指标（基于JMX）；

使用JMX进行远程监控的安全注意事项

Apache Kafka disables remote JMX by default. You can enable remote monitoring using JMX by setting the environment variable JMX_PORT for processes started using the CLI or standard Java system properties to enable remote JMX programmatically. You must enable security when enabling remote JMX in production scenarios to ensure that unauthorized users cannot monitor or control your broker or application as well as the platform on which these are running. Note that authentication is disabled for JMX by default in Kafka and security configs must be overridden for production deployments by setting the environment variable KAFKA_JMX_OPTS for processes started using the CLI or by setting appropriate Java system properties. See Monitoring and Management Using JMX Technology for details on securing JMX.
默认情况下，Apache Kafka远程JMX是禁用的。您可以通过为使用CLI或标准Java系统属性启动的进程设置环境变量JMX_PORT来启用JMX的远程监视，以通过编程方式启用远程JMX。在生产场景中启用远程JMX时，必须启用安全性，以确保未经授权的用户无法监视或控制您的代理或应用程序以及运行它们的平台。请注意，默认情况下，Kafka中对JMX的身份验证是禁用的，对于生产部署，必须通过为使用CLI启动的进程设置环境变量KAFKA_JMX_OPTS或通过设置适当的Java系统属性来覆盖安全配置。

以下是指标介绍：

描述	MBEAN NAME	NORMAL VALUE
Message in rate 消息速率	kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Byte in rate from clients 客户端字节速率	kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Byte in rate from other 其他brokers字节速率	kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec
Request rate 请求速率	kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce\|FetchConsumer\|FetchFollower}
Error rate 错误速率	kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)	Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.
Request size in bytes 请求大小（以字节为单位）	kafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)	Size of requests for each request type.
Temporary memory size in bytes 临时内存大小（以字节为段位）	kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request={Produce\|Fetch}	Temporary memory used for message format conversions and decompression.
Message conversion time 消息转换时间	kafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request={Produce\|Fetch}	Time in milliseconds spent on message format conversions.
Message conversion rate 消息转换比率	kafka.server:type=BrokerTopicMetrics,name={Produce\|Fetch}MessageConversionsPerSec,topic=([-.\w]+)	Number of records which required message format conversion.
Byte out rate to clients 向客户的字节输出率	kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Byte out rate to other brokers 对其他broker的字节输出率	kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
Message validation failure rate due to no key specified for compacted topic 由于未为压缩topic指定key，消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=NoKeyCompactedTopicRecordsPerSec
Message validation failure rate due to invalid magic number 无效的magic导致的消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=InvalidMagicNumberRecordsPerSec
Message validation failure rate due to incorrect crc checksum 由于错误的crc校验和导致的消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=InvalidMessageCrcRecordsPerSec
Message validation failure rate due to non-continuous offset or sequence number in batch 由于不连续offset或批处理中的序列号，导致消息验证失败率	kafka.server:type=BrokerTopicMetrics,name=InvalidOffsetOrSequenceRecordsPerSec
Log flush rate and time 日志刷新率和时间	kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
# of under replicated partitions (\|ISR\|< \|all replicas\|)	kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions	0
# of under minIsr partitions (\|ISR\| < min.insync.replicas)	kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount	0
# of at minIsr partitions (\|ISR\| = min.insync.replicas)	kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount	0
# of offline log directories 脱机日志目录	kafka.log:type=LogManager,name=OfflineLogDirectoryCount	0
Is controller active on broker 控制器在broker上是否活跃	kafka.controller:type=KafkaController,name=ActiveControllerCount	only one broker in the cluster should have 1
Leader election rate leader选举率	kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs	non-zero when there are broker failures
Unclean leader election rate 未清理的leader选举率	kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec	0
Pending topic deletes 待删除主题	kafka.controller:type=KafkaController,name=TopicsToDeleteCount
Pending replica deletes 待删除的副本	kafka.controller:type=KafkaController,name=ReplicasToDeleteCount
Ineligible pending topic deletes 不合格的待删除主题	kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount
Ineligible pending replica deletes 不合格的待删除副本	kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount
Partition counts 分区数	kafka.server:type=ReplicaManager,name=PartitionCount	mostly even across brokers
Leader replica counts Leader副本数	kafka.server:type=ReplicaManager,name=LeaderCount	mostly even across brokers
ISR shrink rate ISR收缩率	kafka.server:type=ReplicaManager,name=IsrShrinksPerSec	If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion rate ISR扩展率	kafka.server:type=ReplicaManager,name=IsrExpandsPerSec	See above
Max lag in messages btw follower and leader replicas follower副本和leader副本之间的最大消息延迟	kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica	lag should be proportional to the maximum batch size of a produce request.
Lag in messages per follower replica 每个follower副本的消息延迟	kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)	lag should be proportional to the maximum batch size of a produce request.
Requests waiting in the producer purgatory 请求在生产者purgatory中等待	kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce	non-zero if ack=-1 is used
Requests waiting in the fetch purgatory 请求在purgatory中等待	kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch	size depends on fetch.wait.max.ms in the consumer
Request total time 请求总时间	kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce\|FetchConsumer\|FetchFollower}	broken into queue, local, remote and response send time
Time the request waits in the request queue 请求在请求队列中等待的时间	kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Time the request is processed at the leader leader处理请求的时间	kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Time the request waits for the follower 请求等待follower的时间	kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce\|FetchConsumer\|FetchFollower}	non-zero for produce requests when ack=-1
Time the request waits in the response queue 请求在响应队列中等待的时间	kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Time to send the response 发送回应的时间	kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce\|FetchConsumer\|FetchFollower}
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker. 消费者落后于生产者的消息数。由消费者而非broker提供。	kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max
The average fraction of time the network processors are idle 网络处理空闲的平均时间	kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent	between 0 and 1, ideally > 0.3
The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication 由于客户端未重新进行身份验证，然后将连接超出其到期时间而用于除重新身份验证以外的任何操作而在处理器上断开的连接数	kafka.server:type=socket-server-metrics,listener=[SASL_PLAINTEXT\|SASL_SSL],networkProcessor=<#>,name=expired-connections-killed-count	ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination
The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication 由于客户端未重新进行身份验证，然后在其过期时间之后使用该连接进行除重新身份验证以外的任何操作时，所有处理器之间断开连接的总数	kafka.network:type=SocketServer,name=ExpiredConnectionsKilledCount	ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker
The average fraction of time the request handler threads are idle 请求处理程序线程空闲的平均时间百分比	kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent	between 0 and 1, ideally > 0.3
Bandwidth quota metrics per (user, client-id), user or client-id 每个（user， client-id），user或client-id的带宽配额指标	kafka.server:type={Produce\|Fetch},user=([-.\w]+),client-id=([-.\w]+)	Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Request quota metrics per (user, client-id), user or client-id 每个（user， client-id），user或client-id的请求配额指标	kafka.server:type=Request,user=([-.\w]+),client-id=([-.\w]+)	Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Requests exempt from throttling 请求不受限制	kafka.server:type=Request	exempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.
ZooKeeper client request latency ZooKeeper客户端请求延迟	kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs	Latency in millseconds for ZooKeeper requests from broker.
ZooKeeper connection status ZooKeeper连接状态	kafka.server:type=SessionExpireListener,name=SessionState	Connection status of broker's ZooKeeper session which may be one of Disconnected\|SyncConnected\|AuthFailed\|ConnectedReadOnly\|SaslAuthenticated\|Expired.
Max time to load group metadata 加载组元数据的最长时间	kafka.server:type=group-coordinator-metrics,name=partition-load-time-max	maximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds
Avg time to load group metadata 加载组元数据的平均时间	kafka.server:type=group-coordinator-metrics,name=partition-load-time-avg	average time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds
Max time to load transaction metadata 加载交易元数据的最长时间	kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-max	maximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds
Avg time to load transaction metadata 加载交易元数据的平均时间	kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-avg	average time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds

生产者/消费者/连接器共同的监控指标

The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下指标可用于生产者/消费者/连接器实例。有关具体的指标。请查看以下部分。

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
connection-close-rate	Connections closed per second in the window. 窗口每秒关闭的连接。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
connection-creation-rate	New connections established per second in the window. 窗口每秒建立的新连接。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
network-io-rate	The average number of network operations (reads or writes) on all connections per second. 所有连接每秒的平均网络操作数（读取或写入）。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
outgoing-byte-rate	The average number of outgoing bytes sent per second to all servers. 每秒向所有服务器发送的传出字节的平均数。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
request-rate	The average number of requests sent per second. 每秒发送请求的平均数。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
request-size-avg	The average size of all requests in the window. 窗口所有请求的平均大小。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
request-size-max	The maximum size of any request sent in the window. 窗口发送请求的最大值。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
incoming-byte-rate	Bytes/second read off all sockets. 字节/秒读取所有socket。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
response-rate	Responses received sent per second. 每秒响应收到的发送	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
select-rate	Number of times the I/O layer checked for new I/O to perform per second. I/O层每秒检查新I/O执行的次数。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-wait-time-ns-avg	The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds. I/O线程花费在等待以纳秒为单位准备好读取或写入的socket的平均时间长度。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-wait-ratio	The fraction of time the I/O thread spent waiting. I/O线程花费等待的时间的比例。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-time-ns-avg	The average length of time for I/O per select call in nanoseconds. 每个选择调用的I/O的平均时间长度（以纳秒为单位）。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
io-ratio	The fraction of time the I/O thread spent doing I/O. I/O线程用于执行I/O的时间比例。	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)
connection-count	The current number of active connections. 当前活跃的连接数	kafka.[producer\|consumer\|connect]:type=[producer\|consumer\|connect]-metrics,client-id=([-.\w]+)

每个broker的生产者/消费者/连接器的公共指标（Common Per-broker metrics for producer/consumer/connect）

The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下可用于生产者/消费者/连接器实例。有关具体指标，请参阅以下部分。

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
outgoing-byte-rate	The average number of outgoing bytes sent per second for a node. 每个节点每秒传出字节的平均数。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-rate	The average number of requests sent per second for a node. 每个节点每秒发送的平均请求数。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-size-avg	The average size of all requests in the window for a node. 每个节点窗口所有请求平均大小。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-size-max	The maximum size of any request sent in the window for a node. 每个节点窗口发送请求最大值。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
incoming-byte-rate	The average number of responses received per second for a node. 每个节点接收响应的平均时间。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-latency-avg	The average request latency in ms for a node. 节点等待平均请求延迟（毫秒）	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-latency-max	The maximum request latency in ms for a node. 节点的请求最大延迟。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
response-rate	Responses received sent per second for a node. 节点每秒接收发送的响应。	kafka.producer:type=[consumer\|producer\|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

生产者监控（Producer monitoring）

The following metrics are available on producer instances.
以下指数可用于生产实例。

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
waiting-threads	The number of user threads blocked waiting for buffer memory to enqueue their records. 用户线程数，阻塞等待缓冲内存消息入队。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
buffer-total-bytes	The maximum amount of buffer memory the client can use (whether or not it is currently used). 客户端可以使用的最大缓冲区内存（无论目前是否使用）	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
buffer-available-bytes	The total amount of buffer memory that is not being used (either unallocated or in the free list). 未使用的缓冲内存总量（未分配或在空闲列表中）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
bufferpool-wait-time	The fraction of time an appender waits for space allocation. appender等待空间分配的时间比率。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
batch-size-avg	The average number of bytes sent per partition per-request. 每个分区每个请求发送的平均字节数	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
batch-size-max	The max number of bytes sent per partition per-request. 每个分区每个请求发送的最大字节数	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
compression-rate-avg	The average compression rate of record batches. 消息批次的平均压缩比率	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-queue-time-avg	The average time in ms record batches spent in the record accumulator. 消息累加器花费消息批次的平均时间（毫秒）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-queue-time-max	The maximum time in ms record batches spent in the record accumulator. 消息累加器花费消息批次的最大时间（毫秒）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
request-latency-avg	The average request latency in ms. 请求平均延迟（毫秒）	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
request-latency-max	The maximum request latency in ms. 最大请求延迟（毫秒）	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-send-rate	The average number of records sent per second. 每秒发送的消息平均数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
records-per-request-avg	The average number of records per request. 每个请求的平均消息数	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-retry-rate	The average per-second number of retried record sends. 每秒重试消息发送的平均数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-error-rate	The average per-second number of record sends that resulted in errors. 引起错误的消息发送的每秒平均数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-size-max	The maximum record size. 最大消息大小	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-size-avg	The average record size. 平均消息大小	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
requests-in-flight	The current number of in-flight requests awaiting a response. 等待响应的当前请求数。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
metadata-age	The age in seconds of the current producer metadata being used. 当前生产者元数据已使用的时间（以秒为单位）。	kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-send-rate	The average number of records sent per second for a topic. topic每秒发送的平均消息数。	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
byte-rate	The average number of bytes sent per second for a topic. topic每秒发送的平均字节数	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
compression-rate	The average compression rate of record batches for a topic. topic的消息批次的平均压缩比率。	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
record-retry-rate	The average per-second number of retried record sends for a topic. topic发送重试消息的每秒平均数	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
record-error-rate	The average per-second number of record sends that resulted in errors for a topic. topic引起错误的发送每秒平均数。	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
produce-throttle-time-max	The maximum time in ms a request was throttled by a broker. broker限制请求的最打时间（以毫秒为单位）	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)
produce-throttle-time-avg	The average time in ms a request was throttled by a broker. broker限制请求的平均时间（以毫秒为单位）	kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)

新消费者监控（New consumer monitoring）

The following metrics are available on new consumer instances.
以下指标适用于新的消费者实例。

消费者组指标（Consumer Group Metrics）

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
commit-latency-avg	The average time taken for a commit request 提交请求所需的平均时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
commit-latency-max	The max time taken for a commit request 提交请求所需的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
commit-rate	The number of commit calls per second 每秒调用提交数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
assigned-partitions	The number of partitions currently assigned to this consumer 当前分配给此消费者的分区数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
heartbeat-response-time-max	The max time taken to receive a response to a heartbeat request 接收心跳请求响应所需的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
heartbeat-rate	The average number of heartbeats per second 每秒心跳的平均数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-time-avg	The average time taken for a group rejoin group重新加入所需要的平均时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-time-max	The max time taken for a group rejoin group重新加入的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-rate	The number of group joins per second 每秒加入的group数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-time-avg	The average time taken for a group sync group同步所需的平均时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-time-max	The max time taken for a group sync group同步所需的最大时间	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-rate	The number of group syncs per second 每秒group同步数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
last-heartbeat-seconds-ago	The number of seconds since the last controller heartbeat 上次控制器心跳之后的秒数	kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

消费者拉取指标（Consumer Fetch Metrics）

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
fetch-size-avg	The average number of bytes fetched per request 每个请求拉取的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-size-max	The maximum number of bytes fetched per request 每次请求拉取的最大字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
bytes-consumed-rate	The average number of bytes consumed per second 每秒消费的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-per-request-avg	The average number of records in each request 每个请求的平均消息数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-consumed-rate	The average number of records consumed per second 每秒消费的消息平均数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-latency-avg	The average time taken for a fetch request 拉取请求所需的平均时间	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-latency-max	The max time taken for a fetch request 拉取请求所需的最大时间	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-rate	The number of fetch requests per second 每秒拉取请求数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-lag-max	The maximum lag in terms of number of records for any partition in this window 此窗口中任何分区消息数的最大落后	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-throttle-time-avg	The average throttle time in ms 平均限制时间（毫秒）	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-throttle-time-max	The maximum throttle time in ms 最大限流时间（毫秒）	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

topic级别拉取指标（Topic-level Fetch Metrics）

METRIC/ATTRIBUTE NAME	DESCRIPTION	MBEAN NAME
fetch-size-avg	The average number of bytes fetched per request for a specific topic. 每个分区针对特定topic拉取的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
fetch-size-max	The maximum number of bytes fetched per request for a specific topic. 每个分区针对特定topic拉取的最大数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
bytes-consumed-rate	The average number of bytes consumed per second for a specific topic. 特定topic每秒消费的平均字节数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
records-per-request-avg	The average number of records in each request for a specific topic. 特定topic每个请求的平均消息数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
records-consumed-rate	The average number of records consumed per second for a specific topic. 特定topic每秒消费的平均消息数	kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

其他方面（Others）

We recommend monitoring GC time and other stats and various server stats such as CPU utilization, I/O service time, etc. On the client side, we recommend monitoring the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.
我们建议监控GC时间和其他统计信息以及各种服务器状态，例如CPU利用率，I/O服务时间等。客户端方面，我们建议监控消息/字节速率（全局和每个topic），请求速率/大小/ 时间，并且在消费者方面，在所有分区之间的消息中的最大滞后和最小获取请求速率。对于消费者来说，最大落后需要小于阈值，并且最少拉取速率需要大于0。

审计（Audit）

The final alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consumers and measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. The details of this are discussed in KAFKA-260.
我们最后提醒的是数据传输的正确性。我们审核发送的每条消息都由所有消费者消费，并估算发生这种情况的落后。对于重要的topic，我们提醒，如果在一定时间内没有达到某种完整性。详细内容在KAFKA-260中讨论。

luo 2年前

kafka pagecache 命中率的监控有没有老哥做过呀

半兽人 -> luo 2年前

kafka怎么还有命中率？它可跟Redis不一样哦。

luo -> 半兽人 2年前

我看官网文档有提到这个metric，hitRatio-avg:The average cache hit ratio defined as the ratio of cache read hits over the total cache read requests.

https://kafka.apache.org/24/documentation.html#kafka_streams_cache_monitoring

这个是流的，你用到了？

黄永杰 3年前

kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms 有碰到过dashboard显示这个指标统计值很大的情况吗？
kafka metrics里显示0.999的统计值17000+，0.99也很高。
楼主理解这个指标的metrics含义么？

半兽人 -> 黄永杰 3年前

从指标的名字来看，就是kafka请求zk的延迟时间（毫秒）。越大代表延迟的越高。表面意思吧。

黄永杰 -> 黄永杰 3年前

kafka jmx metric里显示这个指标格式

kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.50"} 1.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.75"} 1.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.95"} 4.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.98"} 14587.7
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.99"} 17068.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.999",} 17068.0

看一篇文章里介绍quantile:假设0.9-quantile的值为120，意思就是所有的采样值中，小于120的采样值的数量占总体采样值的90%.

https://cloud.tencent.com/developer/news/319419

看来不能单纯作为延迟值来看…

https://grafana.com/grafana/dashboards/11962
peometheus里的这个dashboard直接拿sum(kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{job=\"$job\",instance=~\"$broker\"})by(instance)
统计的延迟，感觉不太对吧

李东 3年前

请问有没有jmxtrans监控kafka集群，并将监控指标写入influxdb，及grafana展示的例子？主要求jmx json，还有grafana dashboard的json。谢谢啦

半兽人 -> 李东 3年前

https://grafana.com/grafana/dashboards/?search=kafka

喵帕斯~ 6年前

请教一下，该如何监控rebalance发生的时间，次数等信息呢

木木＆很呆 6年前

为什么我在Jconsole 里没有发现kafka.consumer 的类

走过你的风。 -> 木木＆很呆 6年前

你消费者开启JMX功能了吗？

Clive H -> 木木＆很呆 4年前

+1，百度说是版本的问题，比如2.2的就取消了kafka.consumer，目前还没找到如何获取kafka.consumer下的信息

鹰击长空 8年前

有两个问题请教一下前辈：

如果想监控每个topic的producer响应最长等待时间是否可行？
request-latency-max
The maximum request latency in ms.
最大请求延迟（毫秒）
```
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
```
如果想监控这个指标，实际写程序访问的时候client-id是什么，如何取到呢，多谢

半兽人 -> 鹰击长空 8年前

指定的消费组ID

client-id的解释：当发出请求时传递给服务器的id字符串。这样做的目的是允许服务器请求记录记录这个【逻辑应用名】，这样能够追踪请求的源，而不仅仅只是ip/prot。

鹰击长空 -> 半兽人 8年前

感谢您的解释，在实际应用中如何通过程序来获取这个client-id呢？

查看kafka更多相关的文章或提一个关于kafka的问题，也可以与我们一起分享文章。