Apache Kafka

Brief overview of handling distributed logs and messages using Kafka

Kalash Shah

Akhilesh Manda

, and

Dhruv Dave

Jun 30, 2024

Introduction

Logs are used to provide analytics and insights on user interactions, system metrics and operations of a service. Recently, applications have been using these logs to build user facing services like recommendations and advertisements. This usage of log data in real-time is critical to manage efficiently because the amount of logs recorded are much higher when compared to application data. Previously, this log data was put to use by scraping through the log files. These were then replaced by certain specialised log aggregators. Kafka is a distributed, low-latency messaging system that allows applications to subscribe to events and receive messages in real time and also provides the data in an offline manner for future usage.

Traditional Systems

The messaging systems before Kafka were not adequate for handling log processing. These systems did not focus on distributed nature and providing high throughputs but were centred around providing delivery guarantees and acknowledgements which are important for messaging systems but are quite overcomplicated for handling logs. These were also focused on immediate usage of messages and could not handle large queues of logs for offline or periodic consumption.

How Other Companies Handle Log Data

1. Facebook: Scribe

- Frontend machines send log data to Scribe machines over sockets.

- Scribe machines aggregate log entries and periodically dump them to HDFS or an NFS device.

2. Yahoo: Data Highway Project

- Machines aggregate events from clients and create “minute” files.

- These files are then added to HDFS.

3. Cloudera: Flume

- A log aggregator that supports extensible “pipes” and “sinks”.

- Makes streaming log data very flexible.

- Offers integrated distributed support.

Kafka's Architecture

Kafka's architecture has a few basic components:

topic: a stream of messages of a certain type/category.
message: a message contains a payload of bytes which can be encoded in any format.
producer: the producer of the messages.
consumer: consumes the messages in real time or in the future.
brokers: a set of servers on which the published messages are stored.

To consume messages sent for a topic, a consumer creates a message stream specific to the topic. Messages being published are distributed amongst sub-streams where each stream provides an iterable interface which the consumer can use to process the messages. A special characteristic of the message stream iterator is that it never terminates and is blocked until a newer message arrives. In addition, to distribute load, a topic is further partitioned to ensure success in a distributed environment.

Thank you for reading Project Whitepaper. This post is public so feel free to share it.

How are individual partitions made efficient?

Each partition of a topic is a segment of same-sized files. On a new message publish, the broker appends the message to the last segment file. For increased performance, the messages are flushed to disk in batches. Once flushed to disk, the message is available to be exposed to consumers. Messages stored in Kafka do not have an explicit message ID, instead the messages are addressed by their logical offset in the log. Consumers consume messages from partitions in order and calculate the message IDs to receive by using the previous messages. Message IDs are calculated by adding the length of the previous message by its ID. Note: Kafka only guarantees the ordered nature of messages from the same partition, it does not guarantee it for messages coming from different partitions. Message requests from consumers are called pull requests and can query multiple messages at once, hence improving performance. Kafka does not cache messages itself and relies on the system page cache making it easier for garbage collection. It also makes use of the sendfile API on UNIX operating systems to speed up the process of sending a local file to a remote socket. The brokers are stateless and does not maintain the data related to whether a consumer has consumed certain messages or not. It relies on a time based cleanup to delete messages. This character enables a rewind functionality where an application can go back in time to consume messages which are usually consumed in-order.

Kafka in a distributed environment

It is important to discuss and understand how the producers, consumers and brokers will interact in a distributed environment since it is how it will be used in large scale applications. Producers have it easy, they can publish messages to any randomly selected partition or one selected using a partitioning function which uses a partitioning key for determining the partition. A consumer group has a set of consumers that consume a set of subscribed topics together. Each message is delivered to one of the members of a group. A few points are kept in mind to divide the messages evenly are:

Messages from a partition can be used by one consumer at a time. This is done to avoid coordination related load.
Consumers coordinate amongst themselves to to handle the partition access. This is done using Zookeeper.

Kafka's Delivery Guarantees

A usual message delivery in other systems require using the two-phase commit protocol to ensure exactly one delivery. Kafka uses a different approach, aiming for an atleast-once delivery, consumers can request the same data twice, usually done in case of failures and need to handle managing duplicates themselves. Kafka uses CRC to ensure that logs are void of any corrupted messages.

Kafka at LinkedIn

The image above shows a skeleton overview of things run at LinkedIn with Kafka in the picture. The frontend services generate logs which are then published to brokers in batches supported by a load balancer. Realtime services make use of these logs published to the brokers. A Kafka cluster is also deployed at a different datacenter close to their Hadoop cluster and Data Warehouses for offline analysis of the data. This offline analysis of data is done via periodically executing jobs. Additionally, at LinkedIn, they have an auditing system to ensure that no data was lost at any point. Avro is the data serialisation protocol used. The ID of the Avro schema and the serialised bytes are included in the payload for each message.

Experimental Results

The team at LinkedIn compared the performance of Kafka with Apache ActiveMQ and RabbitMQ using two Linux machines (8 2GHz cores, 16GB memory, RAID 10, 1Gb network link). One of the machines acted as the broker while the other was either the producer or consumer.

Producer Test: Both system's brokers were set to flush messages asynchronously, and a single producer published 10 million messages of around 200 bytes each. Kafka achieved 50,000 and 400,000 messages per second for batch sizes of 1 and 50 respectively, performing better than both ActiveMQ and RabbitMQ. Kafka's high throughput is due to its non-blocking design, efficient storage, and batching.

Consumer Test: A single consumer retrieved 10 million messages, with systems prefetching up to 1000 messages of around 200KB each. Kafka consumed 22,000 messages per second, about four times higher than ActiveMQ and RabbitMQ. Kafka's performance benefits from efficient storage, reduced transmission overhead, and minimal state maintenance.

Overall, Kafka performed better in both tests. It shows that even though the other systems have more features, Kafka can be utilised in specialised environments to optimise performance.

References and Additional Resources

Paper: https://notes.stephenholiday.com/Kafka.pdf
Interesting twitter thread on how Kafka was used at Slack: https://x.com/bdkozlovski/status/1807068344052543634
Processing trillions of Kafka messages at Walmart: https://medium.com/walmartglobaltech/reliably-processing-trillions-of-kafka-messages-per-day-23494f553ef9
Kafka documentation: https://kafka.apache.org/documentation

Project Whitepaper