├── Important-Links └── README.md ├── KSQL └── KSQL.txt ├── Kafka CLI- Cheat Sheet └── README.md ├── Kafka Connectors └── Kafka-Connectors.txt ├── Kafka Consumer API └── KafkaConsumerTest.java ├── Kafka Consumer └── Consumer.txt ├── Kafka Producer API └── KafkaProducerTest.java ├── Kafka Producer ├── CustomKafkaProducer.Java ├── KafkaProducerMain.java ├── README.md └── log4j.properties ├── Kafka-CLI └── commands.txt ├── Kafka-Core └── Kafka-Core.txt ├── Kafka-Streams └── Stream-Notes.txt ├── Schema-Registry ├── Avro Schema │ └── Customer.avro ├── CustomKafkaProducer.java ├── KafkaProducerMain.java ├── Schema-Registry.txt └── log4j.properties ├── Set-UP README.md └── azure-pipelines.yml /Important-Links/README.md: -------------------------------------------------------------------------------- 1 | Links: 2 | Kafka-rest https://docs.confluent.io/current/kafka-rest/index.html 3 | kafka-rest https://docs.confluent.io/current/kafka-rest/docs/index.html 4 | 5 | kafka-rest https://docs.confluent.io/current/kafka-rest/quickstart.html 6 | 7 | schema registry https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html 8 | 9 | Kafka-connectors https://www.baeldung.com/kafka-connectors-guide 10 | 11 | Kafka-connect https://data-flair.training/blogs/kafka-connect/amp/ 12 | 13 | Kafka-connect-avro https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html 14 | 15 | kafka-connect https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/ 16 | 17 | kafka-connect https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html 18 | 19 | Kafka-streams https://www.confluent.io/stream-processing-cookbook/ 20 | 21 | Kafka-streams https://docs.confluent.io/current/streams/index.html 22 | 23 | KSQL https://docs.confluent.io/current/ksql/docs/faq.html 24 | 25 | KSQL https://docs.confluent.io/current/ksql/docs/tutorials/index.html 26 | 27 | KSQL https://www.michael-noll.com/blog/2018/04/05/of-stream-and-tables-in-kafka-and-stream-processing-part1/ 28 | 29 | Kafka Security https://medium.com/@stephane.maarek/introduction-to-apache-kafka-security-c8951d410adf 30 | 31 | Kafka-control-center https://docs.confluent.io/current/tutorials/cp-demo/docs/index.html 32 | 33 | Video tutorials: 34 | Kafka schema registry and rest-proxy - https://youtu.be/5fjw62LGYNg 35 | 36 | Avro: schema evolution - By Stephane Maarek - https://youtu.be/SZX9DM_gyOE 37 | 38 | Avro producer - By Stephane Maarek https://youtu.be/_6HTHH1NCK0 39 | 40 | Kafka Connect Architecture - By Stephane Maarek https://youtu.be/YOGN7qr2nSE 41 | 42 | Kafka Connect Concepts - By Stephane Maarek https://youtu.be/BUv1IgWm-gQ 43 | 44 | Kafka Connect Distributed architecture - By Stephane Maarek https://youtu.be/52HXoxthRs0 45 | 46 | Kafka-connect https://www.youtube.com/playlist?list=PLt1SIbA8guutTlfh0J7bGboW_Iplm6O_B 47 | 48 | KSQL https://www.youtube.com/watch?v=ExEWJVjj-RA&list=PLa7VYi0yPIH2eX8q3mPpZAn3qCS1eDX8W 49 | 50 | kafka-streams explained by Neha https://www.youtube.com/watch?v=A9KQufewd-s&feature=youtu.be 51 | 52 | kafka-streams https://www.youtube.com/watch?v=Z3JKCLG3VP4&list=PLa7VYi0yPIH1vDclVOB49xUruBAWkOCZD 53 | 54 | kafka-streams https://youtu.be/Z3JKCLG3VP4 55 | 56 | kafka-streams https://youtu.be/LxxeXI1mPKo 57 | 58 | 59 | kafka-streams https://youtu.be/-y2ALVkU5Bc 60 | 61 | kafka-streams - By Stephane Maarek https://youtu.be/wPw3tb_dl70 62 | 63 | kafka-streams - By Tim berglund https://youtu.be/7JYEEx7SBuE 64 | 65 | kafka-streams - By Tim berglund https://youtu.be/3kJgYIkAeHs 66 | 67 | Architecture and Advanced Concepts 68 | 69 | Exactly once semantics : https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/ 70 | 71 | Kafka partition https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab 72 | 73 | kafka stateful DSL : https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#stateful-transformations 74 | Real-time use-cases & Blog's 75 | 76 | https://www.confluent.io/stream-processing-cookbook/ https://www.confluent.io/blog/ 77 | 78 | Alternate links for preparation 79 | 80 | http://lahotisolutions.blogspot.com/2019/03/apache-kafka-notes.html https://www.quora.com/How-do-I-prepare-for-Kafka-certification-confluent-1 81 | 82 | Sources 83 | Content in this github is sourced from following locations 84 | 85 | Confluent documentation 86 | Apache documentation 87 | Kafka definitive guide Book 88 | -------------------------------------------------------------------------------- /KSQL/KSQL.txt: -------------------------------------------------------------------------------- 1 | Kafka ksql 2 | Steps to connect to ksql running with docker. 3 | docker exec -it ksql-server /bin/bash 4 | comamnd to data generate to ksql stream topic 5 | docker exec -it ksql-datagen bash 6 | ksql-datagen \ 7 | bootstrap-server=broker:29092 \ 8 | quickstart=pageviews \ 9 | format=delimited \ 10 | topic=pageviews \ 11 | maxInterval=500 12 | ksql-datagen schema=./userprofile.avro format=json topic=USERPROFILE key=userid maxInterval=5000 iterations=100ootstrap-server=broker:29092 13 | 14 | KSQL will execute in 2 modes 15 | 16 | Interactive and headless 17 | is rest API supported in headless mode? 18 | 19 | No. rest API is not supported in headless way. 20 | 21 | what is KTable? 22 | 23 | what is GlobalKTable? 24 | 25 | Ktable vs GlobalKTable? 26 | 27 | Type of join supported by KSQL? 28 | 29 | output types in joins 30 | 31 | join between a stream and a stream will return a new stream 32 | join between a stream and a table will return a stream 33 | join between a table and a table will return a table 34 | 35 | How to terminate a query 36 | 37 | terminate query "query_name" 38 | How to run a ksql script from cli 39 | 40 | run script "./path/to/file.ksql" 41 | How to print a topic from beginning 42 | 43 | print 'topicname' from beginning 44 | 45 | How to stream from beginning 46 | 47 | set 'auto.offset.reset'='earliest'; 48 | select * from stream; 49 | Create a new stream from a stream 50 | 51 | create stream user_profile_pretty as select firstname || ' ' || ucase( lastname) || ' from ' || countrycode || ' has a rating of ' || cast(rating as varchar) || ' stars. ' || case when rating < 2.5 then 'Poor' when rating between 2.5 and 4.2 then 'Good'else 'Excellent'end as description from userprofile; 52 | How to set infinite retention in kafka 53 | -------------------------------------------------------------------------------- /Kafka CLI- Cheat Sheet/README.md: -------------------------------------------------------------------------------- 1 | https://medium.com/@TimvanBaarsen/apache-kafka-cli-commands-cheat-sheet-a6f06eac01b 2 | -------------------------------------------------------------------------------- /Kafka Connectors/Kafka-Connectors.txt: -------------------------------------------------------------------------------- 1 | what is a worker? 2 | 3 | A single java process, it can be standalone or disctributed mode. 4 | 5 | Modes to run kafka connect server 6 | 7 | Standalone mode 8 | Distributed mode - recommended for production use. 9 | What is standalone Mode 10 | 11 | A single process runs your connectors and tasks. 12 | Configration is bundled with your process 13 | Very easy to started with, useful for development and testing. 14 | Not fault tolerant, no scalability, hard to monitor 15 | What is Distributed mode 16 | 17 | Multiple workers run your connetors and tasks. 18 | Configuration is submitted using REST API. 19 | Easy to scale and fault tolerant(rebalance in case a worker dies) 20 | useful for production deployment of connectors. 21 | java memory property to control java heap size 22 | 23 | export KAFKA_HEAP_OPTS="-Xms256M -Xmx2G" 24 | https://stackoverflow.com/questions/50621962/how-to-set-kafka-connect-connector-and-tasks-jvm-heap-size 25 | What is a task and how connector will break them into tasks? 26 | 27 | What is task.maxs property? 28 | 29 | Defined the no of tasks that we want to run in parallel. 30 | For source connector we usually keep it 1(one). 31 | For sink connectors this property will be set to higher? 32 | -------------------------------------------------------------------------------- /Kafka Consumer API/KafkaConsumerTest.java: -------------------------------------------------------------------------------- 1 | import java.time.Duration; 2 | import java.util.Collections; 3 | import java.util.Properties; 4 | import java.util.concurrent.ExecutionException; 5 | 6 | import org.apache.kafka.clients.consumer.ConsumerConfig; 7 | import org.apache.kafka.clients.consumer.ConsumerRecord; 8 | import org.apache.kafka.clients.consumer.ConsumerRecords; 9 | import org.apache.kafka.clients.consumer.KafkaConsumer; 10 | import org.apache.kafka.common.serialization.StringDeserializer; 11 | 12 | public class KafkaConsumerTest { 13 | 14 | public static void main(String[] args) throws InterruptedException, ExecutionException{ 15 | //Create consumer property 16 | String bootstrapServer = "localhost:9092"; 17 | String groupId = "my-first-consumer-group"; 18 | String topicName = "my-first-topic"; 19 | 20 | Properties properties = new Properties(); 21 | properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer); 22 | properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); 23 | properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); 24 | properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId); 25 | properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"); 26 | properties.setProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false"); 27 | 28 | //Create consumer 29 | KafkaConsumer consumer = new KafkaConsumer<>(properties); 30 | 31 | //Subscribe consumer to topic(s) 32 | consumer.subscribe(Collections.singleton(topicName)); 33 | 34 | 35 | //Poll for new data 36 | while(true){ 37 | ConsumerRecords records = consumer.poll(Duration.ofMillis(1000)); 38 | 39 | for(ConsumerRecord record: records){ 40 | System.out.println(record.key() + record.value()); 41 | System.out.println(record.topic() + record.partition() + record.offset()); 42 | } 43 | 44 | //Commit consumer offset manually (recommended) 45 | consumer.commitAsync(); 46 | } 47 | 48 | } 49 | } 50 | -------------------------------------------------------------------------------- /Kafka Consumer/Consumer.txt: -------------------------------------------------------------------------------- 1 | What is consumer group? 2 | 3 | What is consumer? 4 | 5 | When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic. 6 | If we add more consumers to a single group with a single topic than we have partitions, some of the consumers will be idle and get no messages at all. 7 | What is rebalance? 8 | 9 | Moving partition ownership from one consumer to another is called rebalance. 10 | During a rebalance, consumers can’t consume messages, so a rebalance is basically a short window of unavailability of the entire consumer group. 11 | What is group coordinator? 12 | 13 | The way consumers maintain membership in a consumer group and ownership of the partitions assigned to them is by sending heartbeats to a Kafka broker designated as the group coordinator. 14 | Heartbeats are sent when the consumer polls (i.e., retrieves records) and when it commits records it has consumed. 15 | what are the consumer mandatory properties 16 | 17 | bootstrap.servers, key.deserializer and value.deserializer. 18 | group.id is not mandatory, but in most of the situations it will be populated. 19 | subscribing to topics 20 | 21 | subscribe method will take list of topics as parameters like below 22 | consumer.subscribe(Collections.singletonList("customerCountries")); 23 | It is also possible to call subscribe with a regular expression. 24 | consumer.subscribe("test.*"); 25 | poll loops 26 | 27 | poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving the developer with a clean API that simply returns available data from the assigned partitions. 28 | The parameter we pass, poll(), is a timeout interval and controls how long poll() will block if data is not available in the consumer buffer. If this is set to 0, poll() will return immediately; otherwise, it will wait for the specified number of milliseconds for data to arrive from the broker. 29 | consumer.poll(100); 30 | poll() returns a list of records. Each record contains the topic and partition the record came from, the offset of the record within the partition, and of course the key and the value of the record. 31 | Always close() the consumer before exiting. This will close the network connections and sockets. 32 | The poll loop does a lot more than just get data. The first time you call poll() with a new consumer, it is responsible for finding the GroupCoordinator, joining the consumer group, and receiving a partition assignment. If a rebalance is triggered, it will be handled inside the poll loop as well. And of course the heartbeats that keep consumers alive are sent from within the poll loop. 33 | fetch.min.bytes 34 | 35 | This property allows a consumer to specify the minimum amount of data that it wants to receive from the broker when fetching records. 36 | fetch.max.wait.ms 37 | 38 | By setting fetch.min.bytes, you tell Kafka to wait until it has enough data to send before responding to the consumer. 39 | default value is 500 ms. 40 | max.partition.fetch.bytes 41 | 42 | This property controls the maximum number of bytes the server will return per partition. 43 | session.timeout.ms 44 | 45 | The amount of time a consumer can be out of contact with the broker while still considered alive defaults to 3 seconds. 46 | heatbeat.interval.ms must be lower than session.timeout.ms, and is usually set to one-third of the timeout value. So if session.timeout.ms is 3 seconds, heartbeat.interval.ms should be 1 second. 47 | auto.offset.reset 48 | 49 | latest or earliest 50 | enable.auto.commit 51 | 52 | default is true. 53 | If you set enable.auto.commit to true, then you might also want to control how frequently offsets will be committed using auto.commit.interval.ms. 54 | What are the various partitions assignment strategies? 55 | 56 | org.apache.kafka.clients.consumer.RangeAssignor -> This is default option 57 | org.apache.kafka.clients.consumer.RoundRobinAssignor -> recommend when multi topics are consumered together. 58 | more details are here: https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab 59 | partition.assignment.strategy 60 | 61 | * Range : default 62 | * Roundrobin 63 | client.id 64 | 65 | This can be any string, and will be used by the brokers to identify messages sent from the client. It is used in logging and metrics, and for quotas. 66 | max.poll.records 67 | 68 | This controls the maximum number of records that a single call to poll() will return. This is useful to help control the amount of data your application will need to process in the polling loop. 69 | receive.buffer.bytes and send.buffer.bytes 70 | 71 | These are the sizes of the TCP send and receive buffers used by the sockets when writing and reading data. If these are set to -1, the OS defaults will be used. It can be a good idea to increase those when producers or consumers communicate with brokers in a different datacenter, because those network links typically have higher latency and lower bandwidth. 72 | commits and offsets 73 | 74 | Kafka stores the offsets in __consumer_offsets topic. 75 | commitsync() will retry untill it either succeeds or encounters a nonretraiable failure. commitAsync() will not retry. 76 | commitasync() 77 | 78 | If we are using commit async and handling retries in our code, then we need to be careful about the order of commits. 79 | combining synchronous and asynchronous commits 80 | 81 | But if we know that this is the last commit before we close the consumer, or before a rebalance, we want to make extra sure that the commit succeeds. 82 | Therefore, a common pattern is to combine commitAsync() with commitSync() just before shutdown. 83 | Rebalance listeners 84 | 85 | ConsumerRebalanceListener onPartitionsRevoked(Collection partitions) and onPartitionsRevoked(Collection partitions). 86 | consumer.subscribe(topics, new HandleRebalanceCsutomListener()); 87 | -------------------------------------------------------------------------------- /Kafka Producer API/KafkaProducerTest.java: -------------------------------------------------------------------------------- 1 | import java.util.Properties; 2 | import java.util.concurrent.ExecutionException; 3 | 4 | import org.apache.kafka.clients.producer.KafkaProducer; 5 | import org.apache.kafka.clients.producer.ProducerConfig; 6 | import org.apache.kafka.clients.producer.ProducerRecord; 7 | import org.apache.kafka.common.serialization.StringSerializer; 8 | 9 | public class KafkaProducerTest { 10 | 11 | public static void main(String[] args) throws InterruptedException, ExecutionException{ 12 | //Create producer property 13 | String bootstrapServer = "localhost:9092"; 14 | Properties properties = new Properties(); 15 | properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer); 16 | properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 17 | properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 18 | 19 | //Create safe producer 20 | properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true"); 21 | properties.setProperty(ProducerConfig.ACKS_CONFIG, "all"); 22 | properties.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "5"); 23 | properties.setProperty(ProducerConfig.RETRIES_CONFIG, Integer.toString(Integer.MAX_VALUE)); 24 | 25 | //High throughput producer (at the expense of a bit of latency and CPU usage) 26 | properties.setProperty(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy"); 27 | properties.setProperty(ProducerConfig.LINGER_MS_CONFIG, "20"); //20ms wait time 28 | properties.setProperty(ProducerConfig.BATCH_SIZE_CONFIG, Integer.toString(32*1024)); //32KB batch size 29 | 30 | //Create producer 31 | KafkaProducer producer = new KafkaProducer<>(properties); 32 | 33 | //create a producer record 34 | ProducerRecord record = new ProducerRecord<>("topicName", "firstRecord"); 35 | //create producer record with key 36 | //new ProducerRecord<>("topicName", "MessageKey", "Message"); 37 | //create producer record with key and partition number 38 | //new ProducerRecord<>("topicName", 1 /*partition number*/, "MessageKey", "Message"); 39 | 40 | //send data - asynchronous 41 | //without callback 42 | //producer.send(record); 43 | //with callback 44 | producer.send(record, (recordMetadata, exception) -> { 45 | if(exception == null){ 46 | System.out.println(recordMetadata.topic() + "+" + recordMetadata.partition() + "+" + recordMetadata.offset()); 47 | }else{ 48 | System.err.println(exception.getMessage()); 49 | } 50 | }); 51 | 52 | //send data - synchronous 53 | //without callback 54 | //producer.send(record).get(); //.get() make it synchronous call 55 | 56 | //flush data 57 | producer.flush(); 58 | 59 | //flush and close producer 60 | producer.close(); 61 | } 62 | } 63 | -------------------------------------------------------------------------------- /Kafka Producer/CustomKafkaProducer.Java: -------------------------------------------------------------------------------- 1 | import io.confluent.common.utils.MockTime; 2 | import io.confluent.common.utils.Time; 3 | import kafka.admin.AdminUtils; 4 | import kafka.admin.RackAwareMode; 5 | import kafka.server.KafkaConfig; 6 | import kafka.server.KafkaServer; 7 | import kafka.utils.TestUtils; 8 | import kafka.utils.ZKStringSerializer$; 9 | import kafka.utils.ZkUtils; 10 | import kafka.zk.EmbeddedZookeeper; 11 | import org.I0Itec.zkclient.ZkClient; 12 | import org.junit.Test; 13 | 14 | import java.util.Properties; 15 | 16 | public class CustomKafkaProducerTest { 17 | 18 | private static final String ZKHOST = "127.0.0.1"; 19 | private static final String BROKERHOST = "127.0.0.1"; 20 | private static final String BROKERPORT = "9092"; 21 | private static final String TOPIC = "test"; 22 | 23 | @Test 24 | public void testProducer(){ 25 | /* EmbeddedZookeeper zkServer = new EmbeddedZookeeper(); 26 | String zkConnect = ZKHOST + ":" + zkServer.port(); 27 | ZkClient zkClient = new ZkClient(zkConnect, 30000, 30000, ZKStringSerializer$.MODULE$); 28 | ZkUtils zkUtils = ZkUtils.apply(zkClient, false); 29 | // setup Broker 30 | Properties brokerProps = new Properties(); 31 | brokerProps.setProperty("zookeeper.connect", zkConnect); 32 | brokerProps.setProperty("broker.id", "0"); 33 | // brokerProps.setProperty("log.dirs", Files.createTempDirectory("kafka-").toAbsolutePath().toString()); 34 | brokerProps.setProperty("listeners", "PLAINTEXT://" + BROKERHOST +":" + BROKERPORT); 35 | brokerProps.setProperty("offsets.topic.replication.factor" , "1"); 36 | KafkaConfig config = new KafkaConfig(brokerProps); 37 | *//* Time mock = new MockTime(); 38 | KafkaServer kafkaServer = TestUtils.createServer(config, mock);*//* 39 | // create topic 40 | AdminUtils.createTopic(zkUtils, TOPIC, 1, 1, new Properties(), RackAwareMode.Disabled$.MODULE$);*/ 41 | } 42 | 43 | } 44 | -------------------------------------------------------------------------------- /Kafka Producer/KafkaProducerMain.java: -------------------------------------------------------------------------------- 1 | import org.slf4j.Logger; 2 | import org.slf4j.LoggerFactory; 3 | 4 | public class KafkaProducerMain { 5 | 6 | private static Logger LOGGER = LoggerFactory.getLogger(KafkaProducerMain.class); 7 | 8 | public static void main(String[] args) throws Exception { 9 | LOGGER.info("Running kafka producer..."); 10 | CustomKafkaProducer customKafkaProducer = new CustomKafkaProducer(); 11 | customKafkaProducer.produce("sample-topic","1", "This is a sample message"); 12 | LOGGER.info("Running kafka producer completed!! "); 13 | } 14 | -------------------------------------------------------------------------------- /Kafka Producer/README.md: -------------------------------------------------------------------------------- 1 | Kafka producer 2 | Key points 3 | **What happens after producerRecord is sent? 4 | 5 | Step #1: Once we send the ProducerRecord, the first thing the producer will do is serialize the key and value objects to ByteArrays so they can be sent over the network 6 | Step #2: Data is sent to partitioner, If we specified a partition in the ProducerRecord, the partitioner doesn’t do anything and simply returns the partition we specified. 7 | If we didn’t, the partitioner will choose a partition for us, usually based on the ProducerRecord key. 8 | Step #3: Adds the record to a batch of records that will also be sent to the same topic and partition 9 | Step #4: A separate thread is responsible for sending those batches of records to the appropriate Kafka brokers. 10 | Step #5: Broker receives the messages, it sends back a response. 11 | Successful : it will return a RecordMetadata object with the topic, partition, and the offset of the record within the partition. 12 | Failed: it will return a error code, producer may retry sending few more times before giving up and returning an error. 13 | When will a produced message will be ready to consume? 14 | 15 | Messages written to the partition leader are not immediately readable by consumers regardless of the producer's acknowledgement settings. 16 | When all in-sync replicas have acknowledged the write, then the message is considered committed, which makes it available for reading. 17 | This ensures that messages cannot be lost by a broker failure after they have already been read. 18 | What are the producer key configuration? 19 | 20 | key configurations are explined here : https://docs.confluent.io/current/clients/producer.html 21 | key.serializer 22 | 23 | producer interface allows user to provide the key.serializer. 24 | Avilable serializers: ByteArraySerializer, StringSerializer and IntegerSerializer. 25 | setting key.serializer is required even if you intend to send only values. 26 | value.serializer 27 | 28 | producer interface allows user to provide the key.serializer. like the same way as key.serializer. 29 | retries 30 | 31 | Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionall that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use delivery.timeout.ms to control retry behavior. 32 | default: 2147483647 33 | By default, the producer will wait 100ms between retries, but you can control this using the retry.backoff.ms parameter. 34 | acks 35 | 36 | If acks=0, the producer will not wait for a reply from the broker before assuming the message was sent successfully. 37 | If acks=1, the producer will receive a success response from the broker the moment the leader replica received the message 38 | If the client uses callbacks, latency will be hidden, but throughput will be limited by the number of in-flight messages (i.e., how many messages the producer will send before receiving replies from the server). 39 | If acks=all, the producer will receive a success response from the broker once all in-sync replicas received the message. 40 | if acks is set to all, the request will be stored in a buffer called purgatory until the leader observes that the follower replicas replicated the message, at which point a response is sent to the client 41 | buffer.memory 42 | 43 | This sets the amount of memory the producer will use to buffer messages waiting to be sent to brokers. 44 | If messages are sent by the application faster than they can be delivered to the server, the producer may run out of space and additional send() calls will either block or throw an exception, based on the block.on.buffer.full parameter (replaced with max.block.ms in release 0.9.0.0, which allows blocking for a certain time and then throwing an exception). 45 | batch.size 46 | 47 | When multiple records are sent to the same partition, the producer will batch them together. 48 | This parameter controls the amount of memory in bytes (not messages!) that will be used for each batch. 49 | When the batch is full, all the messages in the batch will be sent. However, this does not mean that the producer will wait for the batch to become full. 50 | The producer will send half-full batches and even batches with just a single message in them. 51 | Therefore, setting the batch size too large will not cause delays in sending messages; it will just use more memory for the batches. 52 | Setting the batch size too small will add some overhead because the producer will need to send messages more frequently. 53 | linger.ms 54 | 55 | linger.ms controls the amount of time to wait for additional messages before sending the current batch. 56 | KafkaProducer sends a batch of messages either when the current batch is full or when the linger.ms limit is reached. 57 | By default, the producer will send messages as soon as there is a sender thread available to send them, even if there’s just one message in the batch. 58 | By setting linger.ms higher than 0, we instruct the producer to wait a few milliseconds to add additional messages to the batch before sending it to the brokers. 59 | This increases latency but also increases throughput (because we send more messages at once, there is less overhead per message). 60 | compression.type 61 | 62 | By default, messages are sent uncompressed 63 | supported compression types: snappy, gzip and lz4. 64 | snappy is recommended, with low CPU and good performance and decent comression ratio. 65 | Gzip use more CPU and time, but result in better compression ratio. 66 | max.in.flight.requests.per.connection 67 | 68 | This controls how many messages the producer will send to the server without receiving responses. 69 | Setting this high can increase memory usage while improving throughput, but setting it too high can reduce throughput as batching becomes less efficient. 70 | Setting this to 1 will guarantee that messages will be written to the broker in the order in which they were sent, even when retries occur. 71 | Methods of sending messages 72 | 73 | * Fire-and-forget : 74 | We many loose data in this situation. Possbile cases of lossing data : SerializationException when it fails to serialize the message, a BufferExhaustedException or TimeoutException if the buffer is full, or an InterruptException if the sending thread was interrupted. 75 | not recommended for production use. 76 | * Synchronous send 77 | We user Future.get() to wait for a reply from Kafka 78 | * Asynchronous send 79 | We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker. 80 | Type of errors 81 | 82 | Retriable: errors are those that can be resolved by sending the message again. 83 | For example, a connection error can be resolved because the connection may get reestablished. 84 | A “no leader” error can be resolved when a new leader is elected for the partition. 85 | KafkaProducer can be configured to retry those errors automatically. 86 | non-retriable: For example, “message size too large.” In those cases, KafkaProducer will not attempt a retry and will return the exception immediately. 87 | Explain bootstrap.servers 88 | 89 | bootstrap.servers property so that the producer can find the Kafka cluster. 90 | Explain client.id 91 | 92 | Although not required, you should always set a client.id since this allows you to easily correlate requests on the broker with the client instance which made it. 93 | Where can i find the full list of configs documentation? 94 | 95 | https://docs.confluent.io/current/installation/configuration/producer-configs.html#cp-config-producer 96 | unclean.leader.election.enable 97 | 98 | default value is true, this will allow out of sync replicas to become leaders. 99 | This should be disable in critical applications like banking system managing transactions. 100 | min.insync.replicas 101 | 102 | This will ensure the minumm number of replicas are in sync. 103 | NotEnoughReplicasException will be throw to producer when the in sync replicas are less then what is configured. 104 | -------------------------------------------------------------------------------- /Kafka Producer/log4j.properties: -------------------------------------------------------------------------------- 1 | log4j.rootLogger=TRACE, stdout 2 | log4j.appender.stdout=org.apache.log4j.ConsoleAppender 3 | log4j.appender.stdout.Target=System.out 4 | log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 5 | log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss.SSS} %-5p [%c] - %m%n 6 | -------------------------------------------------------------------------------- /Kafka-CLI/commands.txt: -------------------------------------------------------------------------------- 1 | KAFKA CLI 2 | ① Start a zookeeper at default port 2181 3 | 4 | $bin/zookeeper-server-start.sh config/zookeeper.properties 5 | ② Start a kafka server at default port 9092 6 | 7 | $bin/kafka-server-start.sh config/server.properties 8 | ③ Create a kafka topic ‘my-first-topic’ with 3 partitions and 3 replicas 9 | 10 | $bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --create --replication-factor 3 --partitions 3 11 | ④ List all kafka topics 12 | 13 | $bin/kafka-topics.sh --zookeeper localhost:2181 --list 14 | ⑤ Describe kafka topic ‘my-first-topic’ 15 | 16 | $bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --describe 17 | ⑥ Delete kafka topic ‘my-first-topic’ 18 | 19 | $bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --delete 20 | Note: This will have no impact if delete.topic.enable is not set to true 21 | 22 | ⑦ Find out all the partitions without a leader 23 | 24 | $bin/kafka-topics.sh --zookeeper localhost:2181 --describe --unavailable-partitions 25 | ⑧ Produce messages to Kafka topic my-first-topic 26 | 27 | $bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-first-topic --producer-property acks=all 28 | > message 1 29 | > message 2 30 | > ^C 31 | ⑨ Start Consuming messages from kafka topic my-first-topic 32 | 33 | $bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --from-beginning 34 | > message 1 35 | > message 2 36 | ⑩ Start Consuming messages in a consumer group from kafka topic my-first-topic 37 | 38 | $bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --group my-first-consumer-group --from-beginning 39 | ⑪ List all consumer groups 40 | 41 | $bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list 42 | ⑫ Describe consumer group 43 | 44 | $bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group 45 | ⑬ Reset offset of consumer group to replay all messages 46 | 47 | $bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group --reset-offsets --to-earliest --execute --topic my-first-topic 48 | ⑭ Shift offsets by 2 (forward) as another strategy 49 | 50 | bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by 2 --execute --topic my-first_topic 51 | ⑮ Shift offsets by 2 (backward) as another strategy 52 | 53 | bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by -2 --execute --t 54 | -------------------------------------------------------------------------------- /Kafka-Core/Kafka-Core.txt: -------------------------------------------------------------------------------- 1 | Kafka Core 2 | How produced messages are sent to kafka server? 3 | 4 | What is a batch? 5 | 6 | A batch is just a collection of messages, all of which are being produced to the same topic and partition. 7 | Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power 8 | What is the role of a controller? 9 | 10 | Within a cluster of brokers, one broker will also function as the cluster controller (elected automatically from the live members of the cluster). 11 | The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. 12 | What is a leader? 13 | 14 | A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. 15 | A partition may be assigned to multiple brokers, which will result in the partition being replicated (as seen in Figure 1-7). 16 | This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure 17 | Types of replica 18 | 19 | Leader replica 20 | follower replica 21 | prefered replica 22 | insync replicas Only in-sync replicas are eligible to be elected as partition leaders in case the existing leader fails. 23 | out of sync : If a replica hasn’t requested a message in more than 10 seconds or if it has requested messages but hasn’t caught up to the most recent message in more than 10 seconds, the replica is considered out of sync. 24 | replica.lag.time.max.ms: The amount of time a follower can be inactive or behind before it is considered out of sync. 25 | By default, Kafka is configured with auto.leader.rebalance.enable=true, which will check if the preferred leader replica is not the current leader but is in-sync and trigger leader election to make the preferred leader the current leader. 26 | What is retension? 27 | 28 | Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g., 7 days) or until the topic reaches a certain size in bytes (e.g., 1 GB). 29 | Once these limits are reached, messages are expired and deleted so that the retention configuration is a minimum amount of data available at any time. 30 | What is log compaction? 31 | 32 | Topics can also be configured as log compacted, which means that Kafka will retain only the last message produced with a specific key. 33 | This can be useful for changelog-type data, where only the last update is interesting. 34 | Zookeeper 35 | What is the role of zookeeper in kafka 36 | 37 | zookeeper maintains the list of brokers, brokers info is stored in zookeeper under /brokers/ids 38 | *Emphemeral node 39 | 40 | Every time a broker process starts, it registers itself with its ID in Zookeeper by creating an ephemeral node. 41 | When a broker loses connectivity to Zookeeper, the ephemeral node that the broker created when starting will be automatically removed from Zookeeper. 42 | Even though the node representing the broker is gone when the broker is stopped, the broker ID still exists in other data structures. For example, the list of replicas of each topic contains the broker IDs for the replica. 43 | This way, if you completely lose a broker and start a brand new broker with the ID of the old one, it will immediately join the cluster in place of the missing broker with the same partitions and topics assigned to it. 44 | What is ensemble? 45 | 46 | A Zookeeper cluster is called an ensemble. 47 | Due to the algorithm used, it is recommended that ensembles contain an odd number of servers (e.g., 3, 5, etc.) as a majority of ensemble members (a quorum) must be working in order for Zookeeper to respond to requests. 48 | This means that in a three-node ensemble, you can run with one node missing. 49 | With a five-node ensemble, you can run with two nodes missing. 50 | Ensemble configuration? 51 | 52 | Example: 53 | tickTime=2000 54 | dataDir=/var/lib/zookeeper 55 | clientPort=2181 56 | initLimit=20 57 | syncLimit=5 58 | server.1=zoo1.example.com:2888:3888 59 | server.2=zoo2.example.com:2888:3888 60 | server.3=zoo3.example.com:2888:3888 61 | 62 | 63 | initLimit is the amount of time to allow followers to connect with a leader. 64 | SyncLimit The syncLimit value limits how out-of-sync followers can be with the leader 65 | clientPort 2181 66 | The servers are specified in the format server.X=hostname:peerPort:leaderPort, with 67 | the following parameters: 68 | X The ID number of the server. This must be an integer, but it does not need to be zero-based or sequential. 69 | hostname The hostname or IP address of the server. 70 | peerPort The TCP port over which servers in the ensemble communicate with each other. (default port: 2888) 71 | leaderPort The TCP port over which leader election is performed. (default port: 3888) 72 | Kafka Broker 73 | broker.id 74 | 75 | Every Kafka broker must have an integer identifier, which is set using the broker.id configuration. 76 | By default, this integer is set to 0, but it can be any value. 77 | The most important thing is that the integer must be unique within a single Kafka cluster. 78 | The selection of this number is arbitrary, and it can be moved between brokers if necessary for maintenance tasks. 79 | A good guideline is to set this value to something intrinsic to the host so that when performing maintenance it is not onerous to map broker ID numbers to hosts. 80 | For example, if your hostnames contain a unique number (such as host1.example.com, host2.example.com, etc.), that is a good choice for the broker.id value. 81 | port 82 | 83 | The example configuration file starts Kafka with a listener on TCP port 9092. 84 | zookeeper.connect 85 | 86 | The location of the Zookeeper used for storing the broker metadata is set using the zookeeper. 87 | Connect configuration parameter. The example configuration uses a Zookeeper running on port 2181 on the local host, which is specified as localhost:2181. 88 | log.dirs 89 | 90 | Kafka persists all messages to disk, and these log segments are stored in the directories specified in the log.dirs configuration. 91 | This is a comma-separated list of paths on the local system. 92 | If more than one path is specified, the broker will store partitions on them in a “least-used” fashion with one partition’s log segments stored within the same path. 93 | Note that the broker will place a new partition in the path that has the least number of partitions currently stored in it, not the least amount of disk space used. 94 | num.recovery.threads.per.data.dir 95 | 96 | Kafka uses a configurable pool of threads for handling log segments. Currently, this thread pool is used: 97 | • When starting normally, to open each partition’s log segments 98 | • When starting after a failure, to check and truncate each partition’s log segments 99 | • When shutting down, to cleanly close log segments 100 | By default, only one thread per log directory is used. 101 | As these threads are only used during startup and shutdown, it is reasonable to set a larger number of threads in order to parallelize operations. 102 | Specifically, when recovering from an unclean shutdown, this can mean the difference of several hours when restarting a broker with a large number of partitions! When setting this parameter, remember that the number configured is per log directory specified with log.dirs. 103 | This means that if num.recovery.threads.per.data.dir is set to 8, and there are 3 paths specified in log.dirs, this is a total of 24 threads. 104 | auto.create.topics.enable 105 | 106 | The default Kafka configuration specifies that the broker should automatically create a topic under the following circumstances: 107 | • When a producer starts writing messages to the topic 108 | • When a consumer starts reading messages from the topic 109 | • When any client requests metadata for the topic 110 | In many situations, this can be undesirable behavior, especially as there is no way to validate the existence of a topic through the Kafka protocol without causing it to be created. 111 | If you are managing topic creation explicitly, whether manually or through a provisioning system, you can set the auto.create.topics.enable configuration to false. 112 | num.partitions 113 | 114 | default is 1 115 | log.retention.ms 116 | 117 | The most common configuration for how long Kafka will retain messages is by time. 118 | The default is specified in the configuration file using the log.retention.hours 119 | parameter, and it is set to 168 hours, or one week. However, there are two other 120 | parameters allowed, log.retention.minutes and log.retention.ms. All three of 121 | these specify the same configuration—the amount of time after which messages may 122 | be deleted—but the recommended parameter to use is log.retention.ms, as the 123 | smaller unit size will take precedence if more than one is specified. This will make 124 | sure that the value set for log.retention.ms is always the one used. If more than one 125 | is specified, the smaller unit size will take precedence. 126 | log.retention.bytes 127 | 128 | This property will define the amount of data retainied per partition, if the topic has 8 partition then data retained per topic would be 8GB. 129 | if both the log.retention.ms and log.retention.bytes are configured, messages may be removed when either criteria is met. 130 | log.segment.bytes 131 | 132 | Once the log segment has reached the size specified by the log.segment.bytes parameter, which defaults to 1 GB, the log segment is closed and a new one is opened. 133 | Once a log segment has been closed, it can be considered for expiration. 134 | log.segment.ms 135 | 136 | Specifies the amount of time after which a log segment should be closed. 137 | Kafka will close a log segment either when the size limit is reached or when the time limit is reached, whichever comes first. 138 | By default, there is no setting for log.segment.ms, which results in only closing log segments by size. 139 | message.max.bytes 140 | 141 | message.max.bytes parameter, which defaults to 1000000, or 1 MB. 142 | This configuration deals with compressed message size, actual uncompressed message can be larger then it. 143 | 144 | There are noticeable performance impacts from increasing the allowable message size. 145 | Larger messages will mean that the broker threads that deal with processing network connections and requests will be working longer on each request. 146 | Larger messages also increase the size of disk writes, which will impact I/O throughput. 147 | fetch.message.max.bytes 148 | 149 | If this value is smaller than message.max.bytes, then consumers that encounter larger messages will fail to fetch those messages, resulting in a situation where the consumer gets stuck and cannot proceed. 150 | Memory 151 | 152 | Most of the cases, consumer is caught up and lagging behind the producers very little. consumer will read are from system page cache resulting in faster reads. 153 | Having more memory avilable to the system for page cache will improve the performance of consumer clients. 154 | Kafka itself does not need much heap memory configured for the Java Virtual Machine (JVM). Even a broker that is handling X messages per second and a data rate of X megabits per second can run with a 5 GB heap. 155 | CPU CPU power is mainly utilized for compressing messages from disc and recompress the message batch in order to store in disc. 156 | 157 | Request processing: 158 | 159 | client request --> broker --> partitions leaders --> reponse --> broker --> client All requests sent to the broker from a specific client will be processed in the order in which they were received—this guarantee is what allows Kafka to behave as a message queue and provide ordering guarantees on the messages it stores. 160 | 161 | Request header 162 | 163 | request type 164 | request version 165 | correlation id 166 | client id 167 | client cache topic metadata 168 | 169 | client request for the metadata (request type: metadata request, which includes a list of topics the client is interested in). 170 | metadata containts which partitions exist in the topics, the replicas for each partition, and which replica is the leader 171 | all brokers caches the metadata informations. 172 | metadata.max.age.ms defines the time to refresh the medadata in client. 173 | if a client receives "not a leader", it will refresh metadata before retrying. 174 | where does kafka writes the produced messages 175 | 176 | On Linux, the messages are written to the filesystem cache and there is no guarantee about when they will be written to disk. 177 | Kafka does not wait for the data to get persisted to disk—it relies on replication for message durability. 178 | what are segments 179 | 180 | partitions are further divided into segments, default size of segment is either 1 GB of data or a week of data. 181 | currently writting segments is called active segment. active segment will never be deleted even the retension is passed. 182 | Kafka broker will keep an open file handle to every segment in every partition—even inactive segments. This leads to an usually high number of open file handles, and the OS must be tuned accordingly. 183 | message additional infor 184 | 185 | Each message contains—in addition to its key, value, and offset—things like the message size, checksum code that allows us to detect corruption, magic byte that indicates the version of the message format, compression codec (Snappy, GZip, or LZ4), and a timestamp (added in release 0.10.0). The timestamp is given either by the producer when the message was sent or by the broker when the message arrived—depending on configuration. 186 | Indexes 187 | 188 | Kafka maintains indexes for each partition, indexes maps offsets to segment files and position within the file. 189 | compaction 190 | 191 | Policies: delete --> delete events older then retension time. compact --> keeps only the recent version of a particular key. 192 | 193 | How compactions works? 194 | 195 | clean 196 | dirty 197 | Deleted events --> producer will send a mesasge with key and value as null. 198 | 199 | Compact policy will never delete a compact messages in current segment. 200 | Where does kafka stores dynamic per broker configurations? 201 | 202 | zookeeper. 203 | Where does dynamic cluster-wide default configs stored? 204 | 205 | zookeeper. 206 | -------------------------------------------------------------------------------- /Kafka-Streams/Stream-Notes.txt: -------------------------------------------------------------------------------- 1 | Kafka streams 2 | What is a topology? 3 | 4 | Topology represent the DAG for streaming process. 5 | what are kafka streams? 6 | 7 | Kafka streams leveeraged the consumer and producer API, so all the properties applicable to consumer and producer is sill applicable here. 8 | application.id specific to stream applications will be used for 9 | consumer group.id = application.id 10 | default client.id prefix 11 | prefix to the internal changelog topics 12 | Kafka streams creates internal intermidiate topics 13 | 14 | repartitioning topics 15 | changelog topics 16 | What are the types of stores in streams? 17 | 18 | 1. local or internal state 19 | 2. external state 20 | What is state store? 21 | 22 | Some operations(like avg, count etc) in streams depends on previously calculated infromation for which state need to be stored. 23 | What happens to local state if a node goes down? 24 | 25 | local state will be stored in rockDB and it is also writted in kafka topic, in case of node failure state will be recreated from kafka. 26 | What are the type of windowing options in streams? 27 | 28 | 1. Tumbling window or sliding window 29 | 2. hoping window 30 | How out of sync records are handled in streams? 31 | 32 | Stream vs table?? 33 | 34 | stream represents all the events occured in table. insert only mode. 35 | table represents the current state of the records. upsert mode. 36 | -------------------------------------------------------------------------------- /Schema-Registry/Avro Schema/Customer.avro: -------------------------------------------------------------------------------- 1 | { 2 | "type": "record", 3 | "namespace": "com.example", 4 | "name": "Customer", 5 | "version": "1", 6 | "fields": [ 7 | { "name": "first_name", "type": "string", "doc": "First Name of Customer" }, 8 | { "name": "last_name", "type": "string", "doc": "Last Name of Customer" }, 9 | { "name": "age", "type": "int", "doc": "Age at the time of registration" }, 10 | { "name": "height", "type": "float", "doc": "Height at the time of registration in cm" }, 11 | { "name": "weight", "type": "float", "doc": "Weight at the time of registration in kg" }, 12 | { "name": "automated_email", "type": "boolean", "default": true, "doc": "Field indicating if the user is enrolled in marketing emails" } 13 | ] 14 | } 15 | -------------------------------------------------------------------------------- /Schema-Registry/CustomKafkaProducer.java: -------------------------------------------------------------------------------- 1 | import org.apache.kafka.clients.producer.ProducerConfig; 2 | import org.apache.kafka.clients.producer.ProducerRecord; 3 | import org.apache.kafka.clients.producer.RecordMetadata; 4 | import org.apache.kafka.common.serialization.StringSerializer; 5 | 6 | import java.util.Properties; 7 | import java.util.concurrent.Future; 8 | 9 | public class CustomKafkaProducer { 10 | 11 | org.apache.kafka.clients.producer.KafkaProducer producer; 12 | 13 | public CustomKafkaProducer() { 14 | super(); 15 | this.producer = new org.apache.kafka.clients.producer.KafkaProducer(getProperties()); 16 | } 17 | 18 | 19 | private Properties getProperties(){ 20 | Properties config = new Properties(); 21 | config.put("client.id", "sample-client-id"); 22 | config.put("bootstrap.servers", "13.233.142.162:9092,13.233.142.162:29092"); 23 | config.put("acks", "all"); 24 | config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 25 | config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 26 | return config; 27 | } 28 | 29 | public boolean produce(String topic, String key, String value){ 30 | final ProducerRecord record = new ProducerRecord(topic, key, value); 31 | Future future = producer.send(record); 32 | return true; 33 | } 34 | 35 | } 36 | -------------------------------------------------------------------------------- /Schema-Registry/KafkaProducerMain.java: -------------------------------------------------------------------------------- 1 | import com.example.Customer; 2 | import io.confluent.kafka.serializers.KafkaAvroSerializer; 3 | import org.apache.kafka.clients.producer.*; 4 | import org.apache.kafka.common.serialization.StringSerializer; 5 | import org.slf4j.Logger; 6 | import org.slf4j.LoggerFactory; 7 | 8 | import java.util.Properties; 9 | 10 | public class KafkaProducerMain { 11 | 12 | private static Logger LOGGER = LoggerFactory.getLogger(KafkaProducerMain.class); 13 | 14 | public static void main(String[] args) throws Exception { 15 | Properties properties = new Properties(); 16 | // normal producer 17 | properties.setProperty("bootstrap.servers", "13.233.161.161:29092"); 18 | properties.setProperty("acks", "all"); 19 | properties.setProperty("retries", "10"); 20 | // avro part 21 | properties.setProperty("key.serializer", StringSerializer.class.getName()); 22 | properties.setProperty("value.serializer", KafkaAvroSerializer.class.getName()); 23 | properties.setProperty("schema.registry.url", "http://13.233.161.161:8081"); 24 | 25 | Producer producer = new KafkaProducer(properties); 26 | 27 | String topic = "customer-avro"; 28 | 29 | // copied from avro examples 30 | Customer customer = Customer.newBuilder() 31 | .setAge(34) 32 | .setAutomatedEmail(false) 33 | .setFirstName("John") 34 | .setLastName("Doe") 35 | .setHeight(178f) 36 | .setWeight(75f) 37 | .build(); 38 | 39 | ProducerRecord producerRecord = new ProducerRecord( 40 | topic, customer 41 | ); 42 | 43 | System.out.println(customer); 44 | producer.send(producerRecord, new Callback() { 45 | public void onCompletion(RecordMetadata metadata, Exception exception) { 46 | if (exception == null) { 47 | System.out.println(metadata); 48 | } else { 49 | exception.printStackTrace(); 50 | } 51 | } 52 | }); 53 | 54 | producer.flush(); 55 | producer.close(); 56 | 57 | } 58 | } 59 | -------------------------------------------------------------------------------- /Schema-Registry/Schema-Registry.txt: -------------------------------------------------------------------------------- 1 | Kafka schema-registry 2 | How to connect and run schema registry commands? 3 | docker-commands 4 | docker exec -it schema-registry /bin/bash 5 | 6 | useful kafka commands 7 | kafka-avro-console-producer 8 | --broker-list broker:29092 --topic bar 9 | --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}' 10 | 11 | sample mesages {"f1": "value1"} {"f2": "value2"} 12 | 13 | kafka-avro-console-consumer 14 | --topic customer-avro 15 | --bootstrap-server broker:29092 16 | --from-beginning 17 | 18 | Where does kafka stores schema's? 19 | 20 | Kafka stores schema infromation in _schema topic. 21 | 22 | What are the schema registry properties we need to provide in producer to register schema? 23 | 24 | props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); 25 | props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class); 26 | props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); 27 | props.put("schema.registry.url", "http://localhost:8081"); 28 | How to registry schema for a key? 29 | 30 | props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class); 31 | This will register schema for a key. 32 | What is a specificrecord or genericRecord? 33 | 34 | Specificrecord: 35 | It is a auto generated pojo class by maven plugin when a .avsc schema file is provided. 36 | This will throw an compilation error when a field is missing. 37 | GenericRecord: 38 | schema will be provided explicitly and field will be accessed either by name or index. 39 | This will throw an compilation error when a field is missing. 40 | Will kafka producer always register schema for each produced message? 41 | 42 | No. schema will be registered when the first message is published and it will be cached in a hashmap by KafkaAvroSerializer. 43 | Next time it will retrive it from cache. 44 | Will kafka consumer always fetch schema for each consumed message? 45 | 46 | No. schema will be registered when the first message is fetched from registry and it will be cached in a hashmap by KafkaAvroDeSerializer. 47 | Next time it will retrive it from cache. 48 | When deserialize a message if schema is not avilable in cache 49 | 50 | Schema will be fetched from schema registry. 51 | 52 | What is magicbyte? 53 | 54 | magicbyte is a schema id got after schema is registered with schema registry. 55 | It will appened to the message and consumer will refer to this while decoding the message. 56 | What are type of compatibility schema registry supports and how? 57 | 58 | These are the compatibility types: BACKWARD: (default) consumers using the new schema can read data written by producers using the latest registered schema BACKWARD_TRANSITIVE: consumers using the new schema can read data written by producers using all previously registered schemas FORWARD: consumers using the latest registered schema can read data written by producers using the new schema FORWARD_TRANSITIVE: consumers using all previously registered schemas can read data written by producers using the new schema FULL: the new schema is forward and backward compatible with the latest registered schema FULL_TRANSITIVE: the new schema is forward and backward compatible with all previously registered schemas NONE: schema compatibility checks are disabled 59 | 60 | what is the default compatibility type 61 | 62 | BACKWARD 63 | will a custom pojo can be auto registered in register schema? 64 | 65 | No, this has to be handled explicitly and you need to modify your class similar to auto generate class by maven plugin. 66 | How to disable auto schema registry? 67 | 68 | client applications automatically register new schemas. This is very convenient in development environments, but in production environments we recommend that client applications do not automatically register new schemas. 69 | props.put(AbstractKafkaAvroSerDeConfig.AUTO_REGISTER_SCHEMAS, false); 70 | -------------------------------------------------------------------------------- /Schema-Registry/log4j.properties: -------------------------------------------------------------------------------- 1 | log4j.rootLogger=TRACE, stdout 2 | log4j.appender.stdout=org.apache.log4j.ConsoleAppender 3 | log4j.appender.stdout.Target=System.out 4 | log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 5 | log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss.SSS} %-5p [%c] - %m%n 6 | -------------------------------------------------------------------------------- /Set-UP README.md: -------------------------------------------------------------------------------- 1 | Setup & Commands 2 | How to run confluent kafka docker in AWS 3 | 1. Create an ec2 instance 4 | 2. install git 5 | 3. install docker 6 | 4. install docker-compose 7 | 5. git clone https://github.com/confluentinc/cp-docker-images 8 | 6. cd cp-docker-images 9 | 7. git checkout 5.2.1-post 10 | 8. cd examples/cp-all-in-one/ 11 | 9. docker-compose up -d --build 12 | Linux command 13 | # List topics docker-compose exec broker kafka-topics --zookeeper zookeeper:2181 --list 14 | # Create topic docker-compose exec broker kafka-topics --create --zookeeper \ 15 | zookeeper:2181 --replication-factor 1 --partitions 1 --topic users 16 | Docker command 17 | # docker command to run in bash or sh shell 18 | bash interactive mode - docker exec -it broker /bin/bash 19 | sh mode - docker-compose exec broker sh 20 | kafka-topics --zookeeper zookeeper:2181 --list 21 | -------------------------------------------------------------------------------- /azure-pipelines.yml: -------------------------------------------------------------------------------- 1 | # Starter pipeline 2 | # Start with a minimal pipeline that you can customize to build and deploy your code. 3 | # Add steps that build, run tests, deploy, and more: 4 | # https://aka.ms/yaml 5 | 6 | trigger: 7 | - master 8 | 9 | pool: 10 | name: Default 11 | vmImage: vs2017-win2016 12 | 13 | steps: 14 | - script: echo Hello, world! 15 | displayName: 'Run a one-line script' 16 | 17 | - script: | 18 | echo Add other tasks to build, test, and deploy your project. 19 | echo See https://aka.ms/yaml 20 | displayName: 'Run a multi-line script' 21 | - task: WhiteSource@21 22 | inputs: 23 | cwd: '$(System.DefaultWorkingDirectory)' 24 | projectName: 'Kafka-Notes' 25 | --------------------------------------------------------------------------------