Introduction
Apache Kafka is an event streaming platform designed for high-throughput, fault-tolerant, and low-latency event streaming. One of its most remarkable features is its ability to integrate with a multitude of other data systems, both as a source and as a sink. That’s where Kafka Connect comes in.
Kafka Connect is a framework that provides scalable and resilient integration between Kafka and other data systems. With its pluggable connector architecture, it can transport data into and out of Kafka without writing additional code. This blog post explores how to leverage Kafka Connect’s power, effectively bridging the gap between Kafka and your other data systems.
Kafka Connect Architecture
Before we delve into code, it’s crucial to understand Kafka Connect’s architecture. It operates in two modes: Standalone and Distributed.
- Standalone mode is ideal for testing, while
- Distributed mode is used for production as it provides scalability and fault tolerance.
1. Starting Kafka Connect
To start Kafka Connect, you’ll need to specify either standalone or distributed mode. Here’s how to start it in standalone mode:
connect-standalone.sh connect-standalone.properties file-source.properties file-sink.properties
In this command, connect-standalone.properties
specifies the Kafka Connect runtime configuration, and file-source.properties
and file-sink.properties
specify the configurations for the connectors.
2. Configuration for Standalone Mode
Let’s consider an example of a standalone mode configuration:
# connect-standalone.properties
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
In this configuration, we define the Kafka bootstrap servers and the converters for the key and value.
Source Connectors
Source connectors import data from another system into Kafka. They are responsible for splitting the data into records and writing them to Kafka.
3. Configuring a File Source Connector
The following configuration describes a simple file source connector:
# file-source.properties
name=local-file-source
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=test.txt
topic=test-topic
This connector reads from test.txt
and writes to test-topic
in Kafka.
4. Verifying Source Connector
You can verify whether data has been produced to Kafka using a console consumer:
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning
Sink Connectors
Sink connectors export data from Kafka to another system. They read records from Kafka topics and convert them into a format suitable for the destination system.
5. Configuring a File Sink Connector
Here’s a basic configuration for a file sink connector:
# file-sink.properties
name=local-file-sink
connector.class=org.apache.kafka.connect.file.FileStreamSinkConnector
tasks.max=1
file=test.sink.txt
topics=test-topic
This connector reads from test-topic
in Kafka and writes to test.sink.txt
.
6. Verifying Sink Connector
The sink file test.sink.txt
should now contain the records consumed from test-topic
.
Distributed Mode
For scalable and fault-tolerant production deployments, Kafka Connect should be run in distributed mode.
7. Starting Kafka Connect in Distributed Mode
connect-distributed.sh connect-distributed.properties
This command starts Kafka Connect in distributed mode, using the configuration specified in connect-distributed.properties
.
8. Configuring a Connector in Distributed Mode
Connectors in distributed mode are configured via a REST API. The following curl
command configures a file source connector:
curl -X POST -H "Content-Type: application/json" --data '{"name": "distributed-file-source", "config": {"connector.class":"org.apache.kafka.connect.file.FileStreamSourceConnector", "tasks.max":"1", "file":"test.txt", "topic":"test-topic-distributed"}}' http://localhost:8083/connectors
9. Checking Connector Status
You can check the status of a connector using the Connect REST API:
curl http://localhost:8083/connectors/distributed-file-source/status
10. Deleting a Connector
If you need to delete a connector, you can do so with the following command:
curl -X DELETE http://localhost:8083/connectors/distributed-file-source
Conclusion
Kafka Connect represents a robust, scalable solution to link Kafka with other data systems. With its standardized API, pre-built connectors, and distributed nature, it makes data ingestion and export tasks manageable and efficient.
Understanding Kafka Connect is a significant step towards leveraging Kafka’s full potential. The ability to connect to a variety of data sources and sinks turns Kafka into a central hub for your real-time data pipelines. The examples provided in this blog post should help you get started with creating your connectors in both standalone and distributed modes. Happy streaming!
Subscribe to our email newsletter to get the latest posts delivered right to your email.