Streaming real-time data from Kafka 3.7.0 to Flink 1.18.1 for processing

3 min readMar 7, 2024

Over the past few years, Apache Kafka has emerged as the leading standard for streaming data. Fast-forward to the present day, Kafka has achieved ubiquity, being adopted by at least 80% of the Fortune 100. This widespread adoption is attributed to Kafka’s architecture, which goes far beyond basic messaging. Kafka’s architecture versatility makes it exceptionally suitable for streaming data at an vast ’internet’ scale, ensuring fault tolerance and data consistency crucial for supporting mission-critical applications. Flink is a high-throughput, unified batch and stream processing engine, renowned for its capability to handle continuous data streams at scale. It seamlessly integrates with Kafka and offers robust support for exactly-once semantics, ensuring each event is processed precisely once, even amidst system failures. Flink emerges as a natural choice as a stream processor for Kafka. While Apache Flink enjoys significant success and popularity as a tool for real-time data processing, accessing sufficient resources and current examples for learning Flink can be challenging.

In this article, I will guide you through the step-by-step process of integrating Kafka 2.13–3.7.0 with Flink 1.18.1 to consume data from a topic and process it within Flink on single-node cluster. Ubuntu-22.04 LTS has been used as an OS in the cluster.

Assumptions :-

The system has a minimum of 8 GB RAM and 250 GB SSD along with Ubuntu-22.04.2 amd64 as the operating system.
OpenJDK 11 is installed with JAVA_HOME environment variable configuration.
Python 3 or Python 2 along with Perl 5 is available on the system.
Single-node Apache Kafka-3.7.0 cluster has been up and running with Apache Zookeeper -3.5.6. (Please read here how to set up a Kafka cluster)

Installation and starting of Flink-1.18.1:-

Binary distribution of Flink-1.18.1 can be downloaded from https://www.apache.org/dyn/closer.lua/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz
Extract the archive flink-1.18.1-bin-scala_2.12.tgz on the terminal using $ tar -xvzf flink-1.18.1-bin-scala_2.12.tgz. After successful extraction directory flink-1.18.1 would be created. Please make sure that inside it bin/, conf/ and examples/ directories are available.
Navigate to bin directory through terminal and execute $ ./bin/start-cluster.sh to start the single-node Flink cluster.

Moreover, we can utilize the Flink’s web UI to monitor the status of the cluster and running jobs by accessing the browser at port 8081.

Flink cluster can be stopped by executing $ ./bin/stop-cluster.sh

List of dependent jars :-

The following jars should be included on the classpath/build file

I’ve created a basic Java program using Eclipse IDE 23–12 to continuously consume messages within Flink from a Kafka topic. Dummy string messages are being published to the topic using Kafka’s built-in kafka-console-publisher script. Upon arrival in the Flink engine, no data transformation occurs for each message. Instead, an additional string is simply appended to each message and printed for verification, ensuring that messages are continuously streamed to Flink.

package com.dataview.flink;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import com.dataview.flink.util.IKafkaConstants;


public class readFromKafkaTopic {
 public static void main(String[] args) throws Exception {
  StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
  KafkaSource<String> source = KafkaSource.<String>builder()
       .setBootstrapServers(IKafkaConstants.KAFKA_BROKERS)
       .setTopics(IKafkaConstants.FIRST_TOPIC_NAME)
       .setGroupId(IKafkaConstants.GROUP_ID_CONFIG)
       .setStartingOffsets(OffsetsInitializer.earliest())
       .setValueOnlyDeserializer(new SimpleStringSchema())
       .build();
  DataStream<String> messageStream = see.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");
  messageStream.rebalance().map(new MapFunction<String, String>() {
   private static final long serialVersionUID = -6867736771747690202L;

   @Override
   public String map(String value) throws Exception {
    return "Kafka and Flink says: " + value;
   }
  }).print();

  see.execute();
 }

}

Entire execution has been screen recorded. If interested you could watch here.

I hope you enjoyed reading this. Please stay tuned for another upcoming article where I will explain how to stream messages/data from Flink to a Kafka topic. Please like and share if you feel this write-up is valuable.

Streaming real-time data from Kafka 3.7.0 to Flink 1.18.1 for processing

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Gautam Goswami

No responses yet

More from Gautam Goswami

Integrating rate-limiting and backpressure strategies synergistically to handle and alleviate…

Apache Kafka stands as a robust distributed streaming platform. However, like any system, it is imperative to proficiently oversee and…

Resolved “ERROR Fatal error during KafkaServer startup.

This short article explains how to resolve the error “ERROR Fatal error during KafkaServer startup. Prepare to shutdown…

The Zero Copy Principle With Apache Kafka

The Apache Kafka, a distributed event streaming technology, can process trillions of events each day and eventually demonstrate its…

Real-Time Redefined: Apache Flink and Apache Paimon Influence Data Streaming’s Future

Apache Paimon is made to function well with constantly flowing data, which is typical of contemporary systems like financial markets…

Recommended from Medium

Exception Handling in Apache Flink: Best Practices, Frameworks, and Examples

Master exception handling in Apache Flink: Explore strategies, frameworks, and best practices to build resilient stream processing jobs.

Apache Flink vs. Apache Kafka Streams: A Comparison of Streaming Technologies

A comparison of Apache Flink vs. Kafka Streams, exploring their features, architectures, and use cases for real-time stream processing

Lists

Stories to Help You Grow as a Software Developer

Medium's Huge List of Publications Accepting Submissions

Kafka Streams — How to magically join multiple data streams

Seamless Kafka Streams joining just like SQL table joins

Kafka to Flink Integration in Python

A Step-by-Step Guide to Integrating Kafka with Flink Using Python

Building a Low-Cost Lakehouse for Near Real-Time Analytics with Apache Iceberg and Nessie Catalog

A hands-on guide to leverage Apache Flink, Apache Iceberg, and Project Nessie for data processing in near Real-time with code and demo.

Data Transfer in Apache Flink: A Comprehensive Overview

In distributed stream processing frameworks like Apache Flink, efficient data transfer between different components is crucial to…