Skip to content

Kafka ‐ Deep Dive Notes from Hello Interview Article

Jagrit edited this page Sep 22, 2024 · 3 revisions

Intro

  1. Apache Kafka is an open-source distributed event streaming platform.
  2. Used as a message queue or as a stream processing system.
  3. Used by 80% of fortune 100 companies
  4. Delivering high performance, scalability, and durability
  5. Handle vast volumes of data in real-time, no message is lost and each piece of data is processed

Motivating Example

It's the World Cup and we run a website that provides real-time statistics on the matches. Each time a goal is scored, a player is booked, or a substitution is made, we want to update our website with the latest information.

  • Events are put into the queue
  • Producer push these events to the queue
  • Consumer pull these events from the queue

Now the problem arises when 1000 matches occur at the same time and one queue is not enough. So need to distribute the events across multiple queues

We need to scale the system by adding more servers to distribute our queue. But how do we ensure that the events are still processed in order? If we were to randomly distribute the events across the servers, we would have a mess on our hands. Goals would be scored before the match even started, and players would be booked for fouls they haven't committed yet.

  • One way to distribute the events is based on the game. So all the events of 1 game goes to same queue - Logical Distribution of Events. This is one of the fundamental ideas behind Kafka: messages sent and received through Kafka require a user specified distribution strategy.

How to handle the consumer side? It's still overwhelmed. It is easy enough to add more, but how do we make sure that each event is only processed once?

Consumer Groups

We can group consumers together into what Kafka calls a consumer group. With consumer groups, each event is guaranteed to only be processed by one consumer in the group.

Basic Terminology and Architecture

  • Kafka Cluster has multiple Brokers
  • Each Broker is a individual server that holds the data
  • Each Broker has multiple Partitions
  • Each Partition is an ordered, immutable sequence of messages that is continually appended to
  • Topic is a logical grouping of partitions

Topic

Topics are the way you publish and subscribe to data in Kafka. When you publish a message, you publish it to a topic, and when you consume a message, you consume it from a topic. Topics are always multi-producer; that is, a topic can have zero, one, or many producers that write data to it.

Topic vs Partition

A topic is a logical grouping of messages. A partition is a physical grouping of messages. A topic can have multiple partitions, and each partition can be on a different broker. Topics are just a way to organize your data, while partitions are a way to scale your data.

Producer vs Consumer

Producers are the ones who write data to topics, and consumers are the ones who read data from topics.

Kafka as Message Queue or a Stream

  • In a message queue, consumers read messages from the queue and then acknowledge that they have processed the message.
  • In a stream, consumers read messages from the stream and then process them, but they don't acknowledge that they have processed the message.