-
Notifications
You must be signed in to change notification settings - Fork 0
Kafka ‐ Deep Dive Notes from Hello Interview Article
- Apache Kafka is an open-source distributed event streaming platform.
- Used as a message queue or as a stream processing system.
- Used by 80% of fortune 100 companies
- Delivering high performance, scalability, and durability
- Handle vast volumes of data in real-time, no message is lost and each piece of data is processed
It's the World Cup and we run a website that provides real-time statistics on the matches. Each time a goal is scored, a player is booked, or a substitution is made, we want to update our website with the latest information.
- Events are put into the queue
- Producer push these events to the queue
- Consumer pull these events from the queue
Now the problem arises when 1000 matches occur at the same time and one queue is not enough. So need to distribute the events across multiple queues
We need to scale the system by adding more servers to distribute our queue. But how do we ensure that the events are still processed in order? If we were to randomly distribute the events across the servers, we would have a mess on our hands. Goals would be scored before the match even started, and players would be booked for fouls they haven't committed yet.
- One way to distribute the events is based on the game. So all the events of 1 game goes to same queue - Logical Distribution of Events. This is one of the fundamental ideas behind Kafka: messages sent and received through Kafka require a user specified distribution strategy.
How to handle the consumer side? It's still overwhelmed. It is easy enough to add more, but how do we make sure that each event is only processed once?
We can group consumers together into what Kafka calls a consumer group. With consumer groups, each event is guaranteed to only be processed by one consumer in the group.
- Kafka Cluster has multiple Brokers
- Each Broker is a individual server that holds the data
- Each Broker has multiple Partitions
- Each Partition is an ordered, immutable sequence of messages that is continually appended to
- Topic is a logical grouping of partitions
Topics are the way you publish and subscribe to data in Kafka. When you publish a message, you publish it to a topic, and when you consume a message, you consume it from a topic. Topics are always multi-producer; that is, a topic can have zero, one, or many producers that write data to it.
A topic is a logical grouping of messages. A partition is a physical grouping of messages. A topic can have multiple partitions, and each partition can be on a different broker. Topics are just a way to organize your data, while partitions are a way to scale your data.
Producers are the ones who write data to topics, and consumers are the ones who read data from topics.
- In a message queue, consumers read messages from the queue and then acknowledge that they have processed the message.
- In a stream, consumers read messages from the stream and then process them, but they don't acknowledge that they have processed the message.