Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.
The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!
一个有关磁盘性能的关键事实是：在过去的十年，磁盘驱动器的吞吐量跟磁盘寻道的延迟是相背离的。结果就是：JBOD配置的6个7200rpm SATA RAID-5 的磁盘阵列上线性写的速度大概是600M/秒，但是随机写的速度只有100K/秒，两者相差将近6000倍。线性读写在大多数应用场景下是可以预测的，因此，现代的操作系统提供了预读和写技术，将多个大块预取数据，并将较小的写入组合成一个大的物理写。更多的讨论可以在ACM Queue Artical中找到，他们发现，对磁盘的线性读在有些情况下可以比内存的随机访问要更快。
To compensate for this performance divergence modern operating systems have become increasingly aggressive in their use of main memory for disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.
Furthermore we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:
The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.
As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure—we at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore this cache will stay warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may take 10 minutes) or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simplifies the code as all logic for maintaining coherency between the cache and filesystem is now in the OS, which tends to do so more efficiently and more correctly than one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with useful data on each disk read.
由于这些原因，使用文件系统并依赖pagecache（页缓存）将优于维护一个内存缓存（或者其他结构） - 我们通过自动访问所有可用的内存使得可用的内存至少提高一倍。并可能通过存储压缩字节结构再次提高一倍。这将使得32G机器上高达28-32GB的缓存，并无需GC。此外，即使服务重新启动，该缓存保持可用，而进程内的缓存则需要在内存中重建（10GB缓存需要10分钟），否则将需要启动完全冷却的缓存（这意味着可怕的初始化性能）。这也大大简化了代码，因为在缓存和文件系统之间维持的一致性的所有逻辑现在都在OS中，这比一次性进程更加有效和更正确。如果你的磁盘支持线性的读取，那么预读取将有效地将每个磁盘中有用的数据预填充此缓存。
This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.
This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogance).
常数时间就足够了（Constant Time Suffices）
The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse then twice as slow.
在消息系统中使用的持久数据结构常常具有相关联的BTree或其他通过随机访问数据结构的每个消费者队列，以维护关于消息的元数据。BTrees是可用的最通用的数据结构，可以在消息系统中支持各种各样的事务和非事务性语义。尽管，Btree的操作是O(log N)，但它们的成本相当高。通常O(log N)O(log N)基本上等同于恒定时间，但是磁盘操作不是这样，磁盘寻找在10ms的pop，每个磁盘一次只能做一次寻找，所以并行性受限制。因此，即使是少量的磁盘搜索导致非常高的开销。由于存储系统将非常快速的缓存操作与非常慢的物理磁盘操作相结合，因为数据随固定缓存而增加，所以观察到的树结构的性能通常是超线性的。- 即，你的数据翻倍则使得事情慢两倍还多。
Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions. This structure has the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages since the performance is completely decoupled from the data size—one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and come at 1/3 the price and 3x the capacity.
直观上，持久队列可以建立在简单的读取和附加到文件上，就像日志解决方案的情况一样。 这种结构的优点是所有操作都是O(1)，并且读取不会阻塞写入或彼此。 这具有明显的性能优势，因为性能与数据大小完全分离 - 服务器现在可以充分利用这点，低转速 1+TB SATA驱动器。虽然这些驱动器的搜索性能不佳，但是对于大量的读写而言，这些驱动器具有可接受的性能，并且价格是1/3，能力为3倍。
Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to deleting messages as soon as they are consumed, we can retain messages for a relative long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.
事实上，无需任何性能损失就可以访问几乎无限制的磁盘空间，这意味着我们可以提供一般消息传递系统无法提供的特性。 例如，在Kafka中，消息被消费后不是立马被删除，我们可以保留消息相对较长的时间（例如一个星期）。 这将为消费者带来很大的灵活性。