Introduction

Apache Kafka is an event streaming platform designed for high-throughput, fault-tolerant, and low-latency event streaming. One of its most remarkable features is its ability to integrate with a multitude of other data systems, both as a source and as a sink. That’s where Kafka Connect comes in.

Kafka Connect is a framework that provides scalable and resilient integration between Kafka and other data systems. With its pluggable connector architecture, it can transport data into and out of Kafka without writing additional code. This blog post explores how to leverage Kafka Connect’s power, effectively bridging the gap between Kafka and your other data systems.

Kafka Connect Architecture

Before we delve into code, it’s crucial to understand Kafka Connect’s architecture. It operates in two modes: Standalone and Distributed.

  • Standalone mode is ideal for testing, while
  • Distributed mode is used for production as it provides scalability and fault tolerance.

1. Starting Kafka Connect

To start Kafka Connect, you’ll need to specify either standalone or distributed mode. Here’s how to start it in standalone mode:

Bash
connect-standalone.sh connect-standalone.properties file-source.properties file-sink.properties

In this command, connect-standalone.properties specifies the Kafka Connect runtime configuration, and file-source.properties and file-sink.properties specify the configurations for the connectors.

2. Configuration for Standalone Mode

Let’s consider an example of a standalone mode configuration:

Bash
# connect-standalone.properties
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true

In this configuration, we define the Kafka bootstrap servers and the converters for the key and value.

Source Connectors

Source connectors import data from another system into Kafka. They are responsible for splitting the data into records and writing them to Kafka.

3. Configuring a File Source Connector

The following configuration describes a simple file source connector:

Bash
# file-source.properties
name=local-file-source
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=test.txt
topic=test-topic

This connector reads from test.txt and writes to test-topic in Kafka.

4. Verifying Source Connector

You can verify whether data has been produced to Kafka using a console consumer:

Bash
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning

Sink Connectors

Sink connectors export data from Kafka to another system. They read records from Kafka topics and convert them into a format suitable for the destination system.

5. Configuring a File Sink Connector

Here’s a basic configuration for a file sink connector:

Bash
# file-sink.properties
name=local-file-sink
connector.class=org.apache.kafka.connect.file.FileStreamSinkConnector
tasks.max=1
file=test.sink.txt
topics=test-topic

This connector reads from test-topic in Kafka and writes to test.sink.txt.

6. Verifying Sink Connector

The sink file test.sink.txt should now contain the records consumed from test-topic.

Distributed Mode

For scalable and fault-tolerant production deployments, Kafka Connect should be run in distributed mode.

7. Starting Kafka Connect in Distributed Mode

Bash
connect-distributed.sh connect-distributed.properties

This command starts Kafka Connect in distributed mode, using the configuration specified in connect-distributed.properties.

8. Configuring a Connector in Distributed Mode

Connectors in distributed mode are configured via a REST API. The following curl command configures a file source connector:

Bash
curl -X POST -H "Content-Type: application/json" --data '{"name": "distributed-file-source", "config": {"connector.class":"org.apache.kafka.connect.file.FileStreamSourceConnector", "tasks.max":"1", "file":"test.txt", "topic":"test-topic-distributed"}}' http://localhost:8083/connectors

9. Checking Connector Status

You can check the status of a connector using the Connect REST API:

Bash
curl http://localhost:8083/connectors/distributed-file-source/status

10. Deleting a Connector

If you need to delete a connector, you can do so with the following command:

Bash
curl -X DELETE http://localhost:8083/connectors/distributed-file-source

Conclusion

Kafka Connect represents a robust, scalable solution to link Kafka with other data systems. With its standardized API, pre-built connectors, and distributed nature, it makes data ingestion and export tasks manageable and efficient.

Understanding Kafka Connect is a significant step towards leveraging Kafka’s full potential. The ability to connect to a variety of data sources and sinks turns Kafka into a central hub for your real-time data pipelines. The examples provided in this blog post should help you get started with creating your connectors in both standalone and distributed modes. Happy streaming!