https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1f12f77f-074e-4d5e-b5e0-ade7be399534/Untitled.png

Kafka is a distributed messaging system for collecting and delivering high volumes of log data with low latency

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1d3954ba-9c37-448c-86a9-631d011270e9/Untitled.png

stream processing: Flink

JMS messaging system offers rich set of delivery guarantees, which are overkill for collecting log data. LinkedIn is okay with lossing some data.

Log = segment files of 1GB

Broker keeps in-memory sorted list of offsets (indexing)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/35bb14ec-4401-48e9-81c8-de130c2c04c2/Untitled.png

avoid explictly caching messages in memory but instead relies on underlying filesystem's page cache.

Stateless broker

Partitioning

consumer groups == load balancing concept in DDIA

of partitions >> # of consumers

Kafka uses Fixed number of partition as the solution.

Every consumer owns 1 partition ⇒ avoid locking and state maintainance