Data Engineering – Apache Kafka(아파치 카프카)

Apache Kafka(아파치 카프카) is a viable alternative to a traditional messaging system. It began as an internal system created by Linkedin to manage 1.4 trillion daily messages. However, today it’s an open-source data streaming system that can meet a range of business requirements.

What Exactly Is Apache Kafka(아파치 카프카)?

Apache Kafka is a data collection, processing, storage, and integration platform that collects, processes, stores, and integrates data at scale. Data integration, distributed logging, and stream processing are just a few of the many applications it may be put to use for. To fully comprehend Kafka’s actions, we must first understand an “event streaming platform.” Before we talk about Kafka’s architecture or its main parts, let’s talk about what an event is. This will assist in explaining how Kafka saves events, how events are entered and exited from the system, as well as how to evaluate event streams once they have been stored.

Kafka stores all received data to disc. Then, Kafka copies data in a Kafka cluster to keep it safe from being lost. A lot of things can make Kafka sprint. It doesn’t have a lot of bells and whistles, so that’s the first thing you should know about it. Another reason is the lack of unique message identifiers in Apache Kafka(아파치 카프카). It takes into account the time when the message was sent. Also, it doesn’t keep track of who has read about a specific subject or who has seen a particular message. Consumers should monitor this data. When you get data, you can only choose an offset. The data will then be returned in sequence, beginning with that offset.

Apache Kafka has three primary purposes for its users:

  • Create and subscribe to streams of records
  • Effectively save streams of records in the order that records were created.
  • Process records into streams in real-time

Apache afka is used primarily to create real-time streaming data pipelines as well as applications that can adapt to a stream of data. It blends storage, messaging and stream processing to enable the storage and analysis of real-time and historical data.

Apache Kafka Architecture

Kafka is commonly used with Storm, HBase, and Spark to handle real-time streaming data. It can send a lot of messages to the Hadoop cluster, no matter what industry or use case it is in. Taking a close look at its environment can help us better understand how it works.


It contains four main APIs:

·        Producer API:

This API allows apps to broadcast a stream of data to one or more subjects.

·        Consumer API:

Using the Consumer API, applications may subscribe to one or even more topics and handle the stream of data that is generated by the subscriptions

·        Streams API:

One or more topics can use this API to get input and output. It converts the input streams to output streams so that they match.

·        Connector API:

There are reusable producers as well as consumers that may be linked to existing applications thanks to this API.

Components and Description

·        Broker

To keep the load balanced, Kafka clusters usually have a lot of brokers. Kafka brokers use zooKeeper to keep track of the state of their clusters. There are hundreds of thousands of messages that can be read and written to each Apache Kafka(아파치 카프카) broker simultaneously. Each broker can manage TB of messages without slowing down. ZooKeeper can be used to elect the leader of a Kafka broker.

Reference : engkimbs blog