Apr 25, 2023

Fundamentals of Kafka

🚧

In this post, let’s understand some very critical fundamental concepts pertaining to Kafka. If you feel, you are already well-versed with the same, feel free to skip this post.

Welcome to this article on Kafka fundamentals! Kafka is a powerful tool for building distributed systems that can handle large volumes of data. Whether you're working with streaming data, event-driven architectures, or real-time analytics, Kafka provides a reliable, scalable, and fault-tolerant messaging platform that can help you achieve your goals.

This Post will cover the basics of Kafka, mostly its architecture and key concepts. We'll also explore some of the most common Kafka features and components, such as producers, consumers, topics, and partitions. By the end of this post, you should have a good understanding of how Kafka works and some clever design decisions that make Kafka so great. We’ll explore the various generic as well as some really interesting use-cases in next Post.

So, whether you're a software engineer, data scientist, or systems architect, read on to learn more about the fundamentals of Kafka!

🔥

Fortunately, the above book is freely available as a PDF. Download from this Link →

Kafka Architecture

Kafka is a distributed streaming platform that is designed to handle large volumes of data in a fault-tolerant and scalable manner. The architecture of Kafka is based on the publisher-subscriber model, where producers publish messages to Kafka topics, and consumers subscribe to these topics to consume the messages.

Following are the main components of Kafka Architecture:

Topics

Kafka topics are the channels through which messages are published and consumed in the Kafka architecture. They are like tables in databases.

Functionality: A Kafka topic is a named stream of records. Each record is a key-value pair that represents a message. The messages are produced by Kafka producers and consumed by Kafka consumers.

👉🏻

Kafka topics can contain any kind of message in any format, and the sequence of all these messages is called a data stream.

Partitioning: A Kafka topic can be partitioned across multiple Kafka brokers in a Kafka cluster. Each partition is an ordered and immutable sequence of records. The number of partitions in a topic determines the parallelism of message processing, and the partitioning scheme determines the distribution of data across the cluster.

Each partition in a Kafka topic is replicated across multiple brokers to provide fault-tolerance and availability. One broker acts as the leader for a partition, and the others act as followers. The leader is responsible for handling read and write requests for the partition, while the followers replicate the data from the leader.

The number of partitions in a topic determines the parallelism of message processing, and the number of replicas determines the level of fault tolerance.

Kafka topics have a retention policy that determines how long messages are retained in the topic. The retention policy can be based on time or size, and it can be configured on a per-topic basis.

Topics can be partitioned across multiple Kafka brokers to handle changes in workload. As new brokers are added or removed from the cluster, the partitions are rebalanced across the brokers to maintain an even distribution of data.

These topics can be secured using access control lists (ACLs) and encryption. ACLs can be used to restrict access to topics, and encryption can be used to secure the data in transit and at rest.

🔥

Kafka Offset: The offset is an integer value that Kafka adds to each message as it is written into a partition. Each message in a given partition has a unique offset.

Brokers

Kafka brokers are the core components of the Kafka architecture. They are responsible for storing and managing the messages that are published to Kafka topics. These are the machines that actually runs the Kafka server software. Each broker in a Kafka cluster stores one or more partitions of a Kafka topic.

Brokers work together in a cluster to provide fault-tolerance and scalability. The brokers communicate with each other to keep the data consistent across the cluster. The brokers are also responsible for handling requests from producers and consumers. An ensemble of Kafka brokers working together is called a Kafka cluster. Some clusters may contain just one broker or others may contain three or potentially hundreds of brokers.

🔥

Companies like Netflix and Uber run hundreds or thousands of Kafka brokers to handle their data.

The advanced internals of how Kafka stores data will be discussed in upcoming articles.

Each broker in a Kafka cluster has a unique identifier called the broker ID. The broker ID is a number that is assigned to the broker when it starts up. The broker ID is used by other brokers in the cluster to communicate with the broker.

Kafka brokers use replication to provide fault-tolerance. Each partition is replicated across multiple brokers, with one broker acting as the leader and the others acting as followers. The leader is responsible for handling read and write requests for the partition, while the followers replicate the data from the leader.

Brokers store messages on disk, providing durability and persistence even if the broker fails. Messages are retained on disk for a configurable amount of time, determined by the retention policy.

Producers

Kafka producers are the components responsible for publishing messages to Kafka topics.Each message is a key-value pair that represents a record.

When a producer sends a message to a topic, it can optionally specify a partition key. If no partition key is specified, the producer uses a default partitioning strategy to distribute messages across all partitions in a round-robin fashion. If a partition key is specified, the producer uses a hash-based partitioning strategy to map the key to a specific partition.

The producer can also specify the delivery semantics for the messages it publishes. Kafka supports three delivery semantics:

At most once: Messages are sent without any guarantee of delivery. In this mode, the producer does not wait for acknowledgements from the brokers before sending the next message.
At least once: Messages are sent with a guarantee that they will be delivered at least once. In this mode, the producer waits for acknowledgements from the brokers before sending the next message. If an acknowledgement is not received, the producer retries sending the message until it receives an acknowledgement.
Exactly once: Messages are sent with a guarantee that they will be delivered exactly once. This mode requires coordination between the producer and consumer, as well as support for transactional semantics.

The producer can also compress messages before sending them to Kafka. Kafka supports several compression codecs, including GZIP, Snappy, and LZ4.

Asynchronous Operations: Kafka producers can send messages asynchronously, allowing them to continue processing without waiting for acknowledgements from the brokers. This can improve throughput and reduce latency, but it also increases the risk of message loss if the producer crashes before the messages are acknowledged.

Consumers

Kafka consumers are the client components responsible for consuming messages from Kafka topics.

When a consumer joins a consumer group, it is assigned one or more partitions to consume messages from. The partition assignment is managed by a group coordinator, which is a Kafka broker responsible for coordinating the assignment of partitions to consumers.

How do the consumers know there is a message to be consumed? They read messages from Kafka by periodically polling the Kafka brokers for new records. The frequency of polling can be controlled using the consumer configuration settings.

The Consumers maintain their position in the topic using the offsets. They can commit their offsets periodically to ensure that it does not reprocess messages it has already read. Kafka provides both automatic and manual offset management options.A consumer always reads data from a lower offset to a higher offset and cannot read data backwards (due to how Apache Kafka and clients are implemented).

⚠️

If the consumer consumes data from more than one partition, the message order is not guaranteed across multiple partitions because they are consumed simultaneously, but the message read order is still guaranteed within each individual partition.

Since consumers can process messages in parallel within a consumer group, they allow for high throughput. Each partition in a topic is assigned to only one consumer within a consumer group to ensure that messages are processed in order within a partition.

If a consumer leaves or joins a consumer group, or if the partition assignment changes for any other reason, Kafka triggers a rebalancing operation to redistribute the partitions across the consumers in the group.

Consumer Group

Kafka consumer groups are a way of scaling Kafka consumers to handle high volumes of messages by collectively consuming them from one or more Kafka topics. Each consumer within the group reads messages from a subset of the partitions in the topic.

Kafka consumer groups also provide load balancing of message processing across multiple consumers. When a new consumer joins the group, or an existing consumer leaves the group, Kafka automatically rebalances the partition assignments to ensure that each consumer is processing roughly the same number of messages.

Each consumer group maintains its own set of offsets, allowing consumers to work independently and avoiding duplicate processing of messages.

👉🏻

Kafka brokers use an internal topic named __consumer_offsets that keeps track of what messages a given consumer group last successfully processed.

These groups help provide fault tolerance for message processing. If a consumer fails or is shut down, the partition assignment is rebalanced automatically to ensure that the remaining consumers continue processing messages without interruption.

⚠️

Each topic partition is only assigned to one consumer within a consumer group, but a consumer from a consumer group can be assigned multiple partitions. So there is a one to many mapping between consumers and topics.

How does connection to Kafka Happens?

A client that wants to send or receive messages from the Kafka cluster may connect to any broker in the cluster. Every broker in the cluster has metadata about all the other brokers and will help the client connect to them as well, and therefore any broker in the cluster is also called a bootstrap server.

The bootstrap server will return metadata to the client that consists of a list of all the brokers in the cluster. Then, when required, the client will know which exact broker to connect to to send or receive data, and accurately find which brokers contain the relevant topic-partition.

It is common for the Kafka client to reference at least two bootstrap servers in its connection URL, in the case one of them not being available, the other one should still respond to the connection request. That means that Kafka clients (and developers / DevOps) do not need to be aware of every single hostname of every single broker in a Kafka cluster, but only to be aware and reference two or three in the connection string for clients.

Kafka Message Anatomy

Key. Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the key is serialised into binary format.
Value. The value represents the content of the message and can also be null. The value format is arbitrary and is then also serialised into binary format.
Compression Type. Kafka messages may be compressed. The compression type can be specified as part of the message. Options are none, gzip, lz4, snappy, and zstd
Headers. There can be a list of optional Kafka message headers in the form of key-value pairs. It is common to add headers to specify metadata about the message, especially for tracing.
Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The combination of topic+partition+offset uniquely identifies the message
Timestamp. A timestamp is added either by the user or the system in the message.

Ending Notes

Kafka is a great piece of software and has tons of capabilities and can be used in various sets of use cases. Kafka fits great into Modern-day Distributed Systems due to it being distributed by design. It was originally founded at LinkedIn and is currently maintained by Confluent. It is used by top tech companies like Uber, Netflix, Activision, Spotify, Slack, Pinterest, Coursera. We looked into the core concepts of Kafka to get started. There are tons of other things about Kafka that haven’t been even touched in this post but we will surely discuss them in upcoming posts.

Until then.. Peace✌🏻