Kafka is a distributed messaging system for collecting and delivering high volumes of log data with low latency

stream processing: Flink

JMS messaging system offers rich set of delivery guarantees, which are overkill for collecting log data. LinkedIn is okay with lossing some data.

Log = segment files of 1GB

Broker keeps in-memory sorted list of offsets (indexing)

avoid explictly caching messages in memory but instead relies on underlying filesystem's page cache.

Stateless broker

Partitioning

consumer groups == load balancing concept in DDIA

of partitions >> # of consumers

Kafka uses Fixed number of partition as the solution.

Every consumer owns 1 partition ⇒ avoid locking and state maintainance