├── Important-Links
    └── README.md
├── KSQL
    └── KSQL.txt
├── Kafka CLI- Cheat Sheet
    └── README.md
├── Kafka Connectors
    └── Kafka-Connectors.txt
├── Kafka Consumer API
    └── KafkaConsumerTest.java
├── Kafka Consumer
    └── Consumer.txt
├── Kafka Producer API
    └── KafkaProducerTest.java
├── Kafka Producer
    ├── CustomKafkaProducer.Java
    ├── KafkaProducerMain.java
    ├── README.md
    └── log4j.properties
├── Kafka-CLI
    └── commands.txt
├── Kafka-Core
    └── Kafka-Core.txt
├── Kafka-Streams
    └── Stream-Notes.txt
├── Schema-Registry
    ├── Avro Schema
    │   └── Customer.avro
    ├── CustomKafkaProducer.java
    ├── KafkaProducerMain.java
    ├── Schema-Registry.txt
    └── log4j.properties
├── Set-UP README.md
└── azure-pipelines.yml


/Important-Links/README.md:
--------------------------------------------------------------------------------
 1 | Links:
 2 | Kafka-rest https://docs.confluent.io/current/kafka-rest/index.html
 3 | kafka-rest https://docs.confluent.io/current/kafka-rest/docs/index.html
 4 | 
 5 | kafka-rest https://docs.confluent.io/current/kafka-rest/quickstart.html
 6 | 
 7 | schema registry https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html
 8 | 
 9 | Kafka-connectors https://www.baeldung.com/kafka-connectors-guide
10 | 
11 | Kafka-connect https://data-flair.training/blogs/kafka-connect/amp/
12 | 
13 | Kafka-connect-avro https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html
14 | 
15 | kafka-connect https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
16 | 
17 | kafka-connect https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html
18 | 
19 | Kafka-streams https://www.confluent.io/stream-processing-cookbook/
20 | 
21 | Kafka-streams https://docs.confluent.io/current/streams/index.html
22 | 
23 | KSQL https://docs.confluent.io/current/ksql/docs/faq.html
24 | 
25 | KSQL https://docs.confluent.io/current/ksql/docs/tutorials/index.html
26 | 
27 | KSQL https://www.michael-noll.com/blog/2018/04/05/of-stream-and-tables-in-kafka-and-stream-processing-part1/
28 | 
29 | Kafka Security https://medium.com/@stephane.maarek/introduction-to-apache-kafka-security-c8951d410adf
30 | 
31 | Kafka-control-center https://docs.confluent.io/current/tutorials/cp-demo/docs/index.html
32 | 
33 | Video tutorials:
34 | Kafka schema registry and rest-proxy - https://youtu.be/5fjw62LGYNg
35 | 
36 | Avro: schema evolution - By Stephane Maarek - https://youtu.be/SZX9DM_gyOE
37 | 
38 | Avro producer - By Stephane Maarek https://youtu.be/_6HTHH1NCK0
39 | 
40 | Kafka Connect Architecture - By Stephane Maarek https://youtu.be/YOGN7qr2nSE
41 | 
42 | Kafka Connect Concepts - By Stephane Maarek https://youtu.be/BUv1IgWm-gQ
43 | 
44 | Kafka Connect Distributed architecture - By Stephane Maarek https://youtu.be/52HXoxthRs0
45 | 
46 | Kafka-connect https://www.youtube.com/playlist?list=PLt1SIbA8guutTlfh0J7bGboW_Iplm6O_B
47 | 
48 | KSQL https://www.youtube.com/watch?v=ExEWJVjj-RA&list=PLa7VYi0yPIH2eX8q3mPpZAn3qCS1eDX8W
49 | 
50 | kafka-streams explained by Neha https://www.youtube.com/watch?v=A9KQufewd-s&feature=youtu.be
51 | 
52 | kafka-streams https://www.youtube.com/watch?v=Z3JKCLG3VP4&list=PLa7VYi0yPIH1vDclVOB49xUruBAWkOCZD
53 | 
54 | kafka-streams https://youtu.be/Z3JKCLG3VP4
55 | 
56 | kafka-streams https://youtu.be/LxxeXI1mPKo
57 | 
58 | 
59 | kafka-streams https://youtu.be/-y2ALVkU5Bc
60 | 
61 | kafka-streams - By Stephane Maarek https://youtu.be/wPw3tb_dl70
62 | 
63 | kafka-streams - By Tim berglund https://youtu.be/7JYEEx7SBuE
64 | 
65 | kafka-streams - By Tim berglund https://youtu.be/3kJgYIkAeHs
66 | 
67 | Architecture and Advanced Concepts
68 | 
69 | Exactly once semantics : https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
70 | 
71 | Kafka partition https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab
72 | 
73 | kafka stateful DSL : https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#stateful-transformations
74 | Real-time use-cases & Blog's
75 | 
76 | https://www.confluent.io/stream-processing-cookbook/ https://www.confluent.io/blog/
77 | 
78 | Alternate links for preparation
79 | 
80 | http://lahotisolutions.blogspot.com/2019/03/apache-kafka-notes.html https://www.quora.com/How-do-I-prepare-for-Kafka-certification-confluent-1
81 | 
82 | Sources
83 | Content in this github is sourced from following locations
84 | 
85 | Confluent documentation
86 | Apache documentation
87 | Kafka definitive guide Book
88 | 


--------------------------------------------------------------------------------
/KSQL/KSQL.txt:
--------------------------------------------------------------------------------
 1 | Kafka ksql
 2 | Steps to connect to ksql running with docker.
 3 |  docker exec -it ksql-server /bin/bash
 4 | comamnd to data generate to ksql stream topic
 5 | docker exec -it ksql-datagen bash
 6 | ksql-datagen \
 7 |   bootstrap-server=broker:29092 \
 8 |   quickstart=pageviews \
 9 |   format=delimited \
10 |   topic=pageviews \
11 |   maxInterval=500
12 | ksql-datagen schema=./userprofile.avro format=json topic=USERPROFILE key=userid maxInterval=5000 iterations=100ootstrap-server=broker:29092
13 | 
14 | KSQL will execute in 2 modes
15 | 
16 | Interactive and headless
17 | is rest API supported in headless mode?
18 | 
19 | No. rest API is not supported in headless way.
20 | 
21 | what is KTable?
22 | 
23 | what is GlobalKTable?
24 | 
25 | Ktable vs GlobalKTable?
26 | 
27 | Type of join supported by KSQL?
28 | 
29 | output types in joins
30 | 
31 | join between a stream and a stream will return a new stream
32 | join between a stream and a table will return a stream
33 | join between a table and a table will return a table
34 | 
35 | How to terminate a query
36 | 
37 | terminate query "query_name"
38 | How to run a ksql script from cli
39 | 
40 | run script "./path/to/file.ksql"
41 | How to print a topic from beginning
42 | 
43 | print 'topicname' from beginning
44 | 
45 | How to stream from beginning
46 | 
47 | set 'auto.offset.reset'='earliest';
48 | select * from stream;
49 | Create a new stream from a stream
50 | 
51 | create stream user_profile_pretty as select firstname || ' ' || ucase( lastname) || ' from ' || countrycode || ' has a rating of ' || cast(rating as varchar) || ' stars. ' || case when rating < 2.5 then 'Poor' when rating between 2.5 and 4.2 then 'Good'else 'Excellent'end as description from userprofile;
52 | How to set infinite retention in kafka
53 | 


--------------------------------------------------------------------------------
/Kafka CLI- Cheat Sheet/README.md:
--------------------------------------------------------------------------------
1 | https://medium.com/@TimvanBaarsen/apache-kafka-cli-commands-cheat-sheet-a6f06eac01b
2 | 


--------------------------------------------------------------------------------
/Kafka Connectors/Kafka-Connectors.txt:
--------------------------------------------------------------------------------
 1 | what is a worker?
 2 | 
 3 | A single java process, it can be standalone or disctributed mode.
 4 | 
 5 | Modes to run kafka connect server
 6 | 
 7 | Standalone mode
 8 | Distributed mode - recommended for production use.
 9 | What is standalone Mode
10 | 
11 | A single process runs your connectors and tasks.
12 | Configration is bundled with your process
13 | Very easy to started with, useful for development and testing.
14 | Not fault tolerant, no scalability, hard to monitor
15 | What is Distributed mode
16 | 
17 | Multiple workers run your connetors and tasks.
18 | Configuration is submitted using REST API.
19 | Easy to scale and fault tolerant(rebalance in case a worker dies)
20 | useful for production deployment of connectors.
21 | java memory property to control java heap size
22 | 
23 | export KAFKA_HEAP_OPTS="-Xms256M -Xmx2G"
24 | https://stackoverflow.com/questions/50621962/how-to-set-kafka-connect-connector-and-tasks-jvm-heap-size
25 | What is a task and how connector will break them into tasks?
26 | 
27 | What is task.maxs property?
28 | 
29 | Defined the no of tasks that we want to run in parallel.
30 | For source connector we usually keep it 1(one).
31 | For sink connectors this property will be set to higher?
32 | 


--------------------------------------------------------------------------------
/Kafka Consumer API/KafkaConsumerTest.java:
--------------------------------------------------------------------------------
 1 | import java.time.Duration;
 2 | import java.util.Collections;
 3 | import java.util.Properties;
 4 | import java.util.concurrent.ExecutionException;
 5 | 
 6 | import org.apache.kafka.clients.consumer.ConsumerConfig;
 7 | import org.apache.kafka.clients.consumer.ConsumerRecord;
 8 | import org.apache.kafka.clients.consumer.ConsumerRecords;
 9 | import org.apache.kafka.clients.consumer.KafkaConsumer;
10 | import org.apache.kafka.common.serialization.StringDeserializer;
11 | 
12 | public class KafkaConsumerTest {
13 | 
14 |  public static void main(String[] args) throws InterruptedException, ExecutionException{
15 |   //Create consumer property
16 |   String bootstrapServer = "localhost:9092";
17 |   String groupId = "my-first-consumer-group";
18 |   String topicName = "my-first-topic";
19 |   
20 |   Properties properties = new Properties();
21 |   properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer);
22 |   properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
23 |   properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
24 |   properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId);
25 |   properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
26 |   properties.setProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
27 |   
28 |   //Create consumer
29 |   KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
30 |   
31 |   //Subscribe consumer to topic(s)
32 |   consumer.subscribe(Collections.singleton(topicName));
33 |   
34 |   
35 |   //Poll for new data
36 |   while(true){
37 |    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(1000));
38 |    
39 |    for(ConsumerRecord<String, String> record: records){
40 |     System.out.println(record.key() + record.value());
41 |     System.out.println(record.topic() + record.partition() + record.offset());
42 |    }
43 |    
44 |    //Commit consumer offset manually (recommended)
45 |    consumer.commitAsync();
46 |   }
47 |   
48 |  }
49 | }
50 | 


--------------------------------------------------------------------------------
/Kafka Consumer/Consumer.txt:
--------------------------------------------------------------------------------
 1 | What is consumer group?
 2 | 
 3 | What is consumer?
 4 | 
 5 | When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic.
 6 | If we add more consumers to a single group with a single topic than we have partitions, some of the consumers will be idle and get no messages at all.
 7 | What is rebalance?
 8 | 
 9 | Moving partition ownership from one consumer to another is called rebalance.
10 | During a rebalance, consumers can’t consume messages, so a rebalance is basically a short window of unavailability of the entire consumer group.
11 | What is group coordinator?
12 | 
13 | The way consumers maintain membership in a consumer group and ownership of the partitions assigned to them is by sending heartbeats to a Kafka broker designated as the group coordinator.
14 | Heartbeats are sent when the consumer polls (i.e., retrieves records) and when it commits records it has consumed.
15 | what are the consumer mandatory properties
16 | 
17 | bootstrap.servers, key.deserializer and value.deserializer.
18 | group.id is not mandatory, but in most of the situations it will be populated.
19 | subscribing to topics
20 | 
21 | subscribe method will take list of topics as parameters like below
22 |     consumer.subscribe(Collections.singletonList("customerCountries"));
23 | It is also possible to call subscribe with a regular expression.
24 |     consumer.subscribe("test.*");
25 | poll loops
26 | 
27 | poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving the developer with a clean API that simply returns available data from the assigned partitions.
28 | The parameter we pass, poll(), is a timeout interval and controls how long poll() will block if data is not available in the consumer buffer. If this is set to 0, poll() will return immediately; otherwise, it will wait for the specified number of milliseconds for data to arrive from the broker.
29 | consumer.poll(100);
30 | poll() returns a list of records. Each record contains the topic and partition the record came from, the offset of the record within the partition, and of course the key and the value of the record.
31 | Always close() the consumer before exiting. This will close the network connections and sockets.
32 | The poll loop does a lot more than just get data. The first time you call poll() with a new consumer, it is responsible for finding the GroupCoordinator, joining the consumer group, and receiving a partition assignment. If a rebalance is triggered, it will be handled inside the poll loop as well. And of course the heartbeats that keep consumers alive are sent from within the poll loop.
33 | fetch.min.bytes
34 | 
35 | This property allows a consumer to specify the minimum amount of data that it wants to receive from the broker when fetching records.
36 | fetch.max.wait.ms
37 | 
38 | By setting fetch.min.bytes, you tell Kafka to wait until it has enough data to send before responding to the consumer.
39 | default value is 500 ms.
40 | max.partition.fetch.bytes
41 | 
42 | This property controls the maximum number of bytes the server will return per partition.
43 | session.timeout.ms
44 | 
45 | The amount of time a consumer can be out of contact with the broker while still considered alive defaults to 3 seconds.
46 | heatbeat.interval.ms must be lower than session.timeout.ms, and is usually set to one-third of the timeout value. So if session.timeout.ms is 3 seconds, heartbeat.interval.ms should be 1 second.
47 | auto.offset.reset
48 | 
49 | latest or earliest
50 | enable.auto.commit
51 | 
52 | default is true.
53 | If you set enable.auto.commit to true, then you might also want to control how frequently offsets will be committed using auto.commit.interval.ms.
54 | What are the various partitions assignment strategies?
55 | 
56 | org.apache.kafka.clients.consumer.RangeAssignor -> This is default option
57 | org.apache.kafka.clients.consumer.RoundRobinAssignor -> recommend when multi topics are consumered together.
58 | more details are here: https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab 
59 | partition.assignment.strategy
60 | 
61 | * Range : default
62 | * Roundrobin
63 | client.id
64 | 
65 | This can be any string, and will be used by the brokers to identify messages sent from the client. It is used in logging and metrics, and for quotas.
66 | max.poll.records
67 | 
68 | This controls the maximum number of records that a single call to poll() will return. This is useful to help control the amount of data your application will need to process in the polling loop.
69 | receive.buffer.bytes and send.buffer.bytes
70 | 
71 | These are the sizes of the TCP send and receive buffers used by the sockets when writing and reading data. If these are set to -1, the OS defaults will be used. It can be a good idea to increase those when producers or consumers communicate with brokers in a different datacenter, because those network links typically have higher latency and lower bandwidth.
72 | commits and offsets
73 | 
74 | Kafka stores the offsets in __consumer_offsets topic.
75 | commitsync() will retry untill it either succeeds or encounters a nonretraiable failure. commitAsync() will not retry.
76 | commitasync()
77 | 
78 | If we are using commit async and handling retries in our code, then we need to be careful about the order of commits.
79 | combining synchronous and asynchronous commits
80 | 
81 | But if we know that this is the last commit before we close the consumer, or before a rebalance, we want to make extra sure that the commit succeeds.
82 | Therefore, a common pattern is to combine commitAsync() with commitSync() just before shutdown.
83 | Rebalance listeners
84 | 
85 | ConsumerRebalanceListener onPartitionsRevoked(Collection<TopicPartition> partitions) and onPartitionsRevoked(Collection<TopicPartition> partitions).
86 | consumer.subscribe(topics, new HandleRebalanceCsutomListener());
87 | 


--------------------------------------------------------------------------------
/Kafka Producer API/KafkaProducerTest.java:
--------------------------------------------------------------------------------
 1 | import java.util.Properties;
 2 | import java.util.concurrent.ExecutionException;
 3 | 
 4 | import org.apache.kafka.clients.producer.KafkaProducer;
 5 | import org.apache.kafka.clients.producer.ProducerConfig;
 6 | import org.apache.kafka.clients.producer.ProducerRecord;
 7 | import org.apache.kafka.common.serialization.StringSerializer;
 8 | 
 9 | public class KafkaProducerTest {
10 | 
11 |  public static void main(String[] args) throws InterruptedException, ExecutionException{
12 |   //Create producer property
13 |   String bootstrapServer = "localhost:9092";
14 |   Properties properties = new Properties();
15 |   properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer);
16 |   properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
17 |   properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
18 |   
19 |   //Create safe producer
20 |   properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
21 |   properties.setProperty(ProducerConfig.ACKS_CONFIG, "all");
22 |   properties.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "5");
23 |   properties.setProperty(ProducerConfig.RETRIES_CONFIG, Integer.toString(Integer.MAX_VALUE));
24 |   
25 |   //High throughput producer (at the expense of a bit of latency and CPU usage)
26 |   properties.setProperty(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
27 |   properties.setProperty(ProducerConfig.LINGER_MS_CONFIG, "20"); //20ms wait time
28 |   properties.setProperty(ProducerConfig.BATCH_SIZE_CONFIG, Integer.toString(32*1024)); //32KB batch size
29 |   
30 |   //Create producer
31 |   KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
32 |   
33 |   //create a producer record
34 |   ProducerRecord<String, String> record = new ProducerRecord<>("topicName", "firstRecord");
35 |   //create producer record with key
36 |   //new ProducerRecord<>("topicName", "MessageKey", "Message");
37 |   //create producer record with key and partition number
38 |   //new ProducerRecord<>("topicName", 1 /*partition number*/, "MessageKey", "Message");
39 |   
40 |   //send data - asynchronous
41 |   //without callback
42 |   //producer.send(record);
43 |   //with callback
44 |   producer.send(record, (recordMetadata, exception) -> {
45 |    if(exception == null){
46 |     System.out.println(recordMetadata.topic() + "+" + recordMetadata.partition() + "+" + recordMetadata.offset());
47 |    }else{
48 |     System.err.println(exception.getMessage());
49 |    }
50 |   });
51 |   
52 |   //send data - synchronous
53 |   //without callback
54 |   //producer.send(record).get(); //.get() make it synchronous call
55 |   
56 |   //flush data
57 |   producer.flush();
58 |   
59 |   //flush and close producer
60 |   producer.close();
61 |  }
62 | }
63 | 


--------------------------------------------------------------------------------
/Kafka Producer/CustomKafkaProducer.Java:
--------------------------------------------------------------------------------
 1 | import io.confluent.common.utils.MockTime;
 2 | import io.confluent.common.utils.Time;
 3 | import kafka.admin.AdminUtils;
 4 | import kafka.admin.RackAwareMode;
 5 | import kafka.server.KafkaConfig;
 6 | import kafka.server.KafkaServer;
 7 | import kafka.utils.TestUtils;
 8 | import kafka.utils.ZKStringSerializer$;
 9 | import kafka.utils.ZkUtils;
10 | import kafka.zk.EmbeddedZookeeper;
11 | import org.I0Itec.zkclient.ZkClient;
12 | import org.junit.Test;
13 | 
14 | import java.util.Properties;
15 | 
16 | public class CustomKafkaProducerTest {
17 | 
18 |     private static final String ZKHOST = "127.0.0.1";
19 |     private static final String BROKERHOST = "127.0.0.1";
20 |     private static final String BROKERPORT = "9092";
21 |     private static final String TOPIC = "test";
22 | 
23 |     @Test
24 |     public void testProducer(){
25 |        /* EmbeddedZookeeper zkServer = new EmbeddedZookeeper();
26 |         String zkConnect = ZKHOST + ":" + zkServer.port();
27 |         ZkClient zkClient = new ZkClient(zkConnect, 30000, 30000, ZKStringSerializer$.MODULE$);
28 |         ZkUtils zkUtils = ZkUtils.apply(zkClient, false);
29 |         // setup Broker
30 |         Properties brokerProps = new Properties();
31 |         brokerProps.setProperty("zookeeper.connect", zkConnect);
32 |         brokerProps.setProperty("broker.id", "0");
33 |        // brokerProps.setProperty("log.dirs", Files.createTempDirectory("kafka-").toAbsolutePath().toString());
34 |         brokerProps.setProperty("listeners", "PLAINTEXT://" + BROKERHOST +":" + BROKERPORT);
35 |         brokerProps.setProperty("offsets.topic.replication.factor" , "1");
36 |         KafkaConfig config = new KafkaConfig(brokerProps);
37 |        *//* Time mock = new MockTime();
38 |         KafkaServer kafkaServer = TestUtils.createServer(config, mock);*//*
39 |         // create topic
40 |         AdminUtils.createTopic(zkUtils, TOPIC, 1, 1, new Properties(), RackAwareMode.Disabled$.MODULE$);*/
41 |     }
42 | 
43 | }
44 | 


--------------------------------------------------------------------------------
/Kafka Producer/KafkaProducerMain.java:
--------------------------------------------------------------------------------
 1 | import org.slf4j.Logger;
 2 | import org.slf4j.LoggerFactory;
 3 | 
 4 | public class KafkaProducerMain {
 5 | 
 6 |     private static Logger LOGGER = LoggerFactory.getLogger(KafkaProducerMain.class);
 7 | 
 8 |     public static void main(String[] args) throws Exception {
 9 |         LOGGER.info("Running kafka producer...");
10 |         CustomKafkaProducer customKafkaProducer = new CustomKafkaProducer();
11 |         customKafkaProducer.produce("sample-topic","1", "This is a sample message");
12 |         LOGGER.info("Running kafka producer completed!! ");
13 |     }
14 | 


--------------------------------------------------------------------------------
/Kafka Producer/README.md:
--------------------------------------------------------------------------------
  1 | Kafka producer
  2 | Key points
  3 | **What happens after producerRecord is sent?
  4 | 
  5 | Step #1: Once we send the ProducerRecord, the first thing the producer will do is serialize the key and value objects to ByteArrays so they can be sent over the network
  6 | Step #2: Data is sent to partitioner, If we specified a partition in the ProducerRecord, the partitioner doesn’t do anything and simply returns the partition we specified.
  7 |          If we didn’t, the partitioner will choose a partition for us, usually based on the ProducerRecord key.
  8 | Step #3: Adds the record to a batch of records that will also be sent to the same topic and partition
  9 | Step #4: A separate thread is responsible for sending those batches of records to the appropriate Kafka brokers.
 10 | Step #5: Broker receives the messages, it sends back a response.
 11 |          Successful : it will return a RecordMetadata object with the topic, partition, and the offset of the record within the partition.
 12 |          Failed: it will return a error code, producer may retry sending few more times before giving up and returning an error.
 13 | When will a produced message will be ready to consume?
 14 | 
 15 | Messages written to the partition leader are not immediately readable by consumers regardless of the producer's acknowledgement settings.
 16 | When all in-sync replicas have acknowledged the write, then the message is considered committed, which makes it available for reading.
 17 | This ensures that messages cannot be lost by a broker failure after they have already been read.
 18 | What are the producer key configuration?
 19 | 
 20 | key configurations are explined here : https://docs.confluent.io/current/clients/producer.html
 21 | key.serializer
 22 | 
 23 | producer interface allows user to provide the key.serializer.
 24 | Avilable serializers: ByteArraySerializer, StringSerializer and IntegerSerializer.
 25 | setting key.serializer is required even if you intend to send only values.
 26 | value.serializer
 27 | 
 28 | producer interface allows user to provide the key.serializer. like the same way as key.serializer.
 29 | retries
 30 | 
 31 | Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionall that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use delivery.timeout.ms to control retry behavior.
 32 | default: 2147483647
 33 | By default, the producer will wait 100ms between retries, but you can control this using the retry.backoff.ms parameter.
 34 | acks
 35 | 
 36 | If acks=0, the producer will not wait for a reply from the broker before assuming the message was sent successfully.
 37 | If acks=1, the producer will receive a success response from the broker the moment the leader replica received the message
 38 |             If the client uses callbacks, latency will be hidden, but throughput will be limited by the number of in-flight messages (i.e., how many messages the producer will send before receiving replies from the server).
 39 | If acks=all, the producer will receive a success response from the broker once all in-sync replicas received the message.
 40 |             if acks is set to all, the request will be stored in a buffer called purgatory until the leader observes that the follower replicas replicated the message, at which point a response is sent to the client
 41 | buffer.memory
 42 | 
 43 | This sets the amount of memory the producer will use to buffer messages waiting to be sent to brokers.
 44 | If messages are sent by the application faster than they can be delivered to the server, the producer may run out of space and additional send() calls will either block or throw an exception, based on the block.on.buffer.full parameter (replaced with max.block.ms in release 0.9.0.0, which allows blocking for a certain time and then throwing an exception).
 45 | batch.size
 46 | 
 47 | When multiple records are sent to the same partition, the producer will batch them together.
 48 | This parameter controls the amount of memory in bytes (not messages!) that will be used for each batch.
 49 | When the batch is full, all the messages in the batch will be sent. However, this does not mean that the producer will wait for the batch to become full.
 50 | The producer will send half-full batches and even batches with just a single message in them.
 51 | Therefore, setting the batch size too large will not cause delays in sending messages; it will just use more memory for the batches.
 52 | Setting the batch size too small will add some overhead because the producer will need to send messages more frequently.
 53 | linger.ms
 54 | 
 55 | linger.ms controls the amount of time to wait for additional messages before sending the current batch.
 56 | KafkaProducer sends a batch of messages either when the current batch is full or when the linger.ms limit is reached.
 57 | By default, the producer will send messages as soon as there is a sender thread available to send them, even if there’s just one message in the batch.
 58 | By setting linger.ms higher than 0, we instruct the producer to wait a few milliseconds to add additional messages to the batch  before sending it to the brokers.
 59 | This increases latency but also increases throughput (because we send more messages at once, there is less overhead per message).
 60 | compression.type
 61 | 
 62 | By default, messages are sent uncompressed
 63 | supported compression types: snappy, gzip and lz4.
 64 | snappy is recommended, with low CPU and good performance and decent comression ratio.
 65 | Gzip use more CPU and time, but result in better compression ratio.
 66 | max.in.flight.requests.per.connection
 67 | 
 68 | This controls how many messages the producer will send to the server without receiving responses.
 69 | Setting this high can increase memory usage while improving throughput, but setting it too high can reduce throughput as batching becomes less efficient.
 70 | Setting this to 1 will guarantee that messages will be written to the broker in the order in which they were sent, even when retries occur.
 71 | Methods of sending messages
 72 | 
 73 | * Fire-and-forget :
 74 |     We many loose data in this situation. Possbile cases of lossing data : SerializationException when it fails to serialize the message, a BufferExhaustedException or TimeoutException if the buffer is full, or an InterruptException if the sending thread was interrupted.
 75 |     not recommended for production use.
 76 | * Synchronous send
 77 |     We user Future.get() to wait for a reply from Kafka
 78 | * Asynchronous send
 79 |     We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker.
 80 | Type of errors
 81 | 
 82 | Retriable: errors are those that can be resolved by sending the message again.
 83 |           For example, a connection error can be resolved because the connection may get reestablished.
 84 |           A “no leader” error can be resolved when a new leader is elected for the partition.
 85 |           KafkaProducer can be configured to retry those errors automatically.
 86 | non-retriable: For example, “message size too large.” In those cases, KafkaProducer will not attempt a retry and will return the exception immediately.
 87 | Explain bootstrap.servers
 88 | 
 89 | bootstrap.servers property so that the producer can find the Kafka cluster.
 90 | Explain client.id
 91 | 
 92 | Although not required, you should always set a client.id since this allows you to easily correlate requests on the broker with the client instance which made it.
 93 | Where can i find the full list of configs documentation?
 94 | 
 95 | https://docs.confluent.io/current/installation/configuration/producer-configs.html#cp-config-producer
 96 | unclean.leader.election.enable
 97 | 
 98 | default value is true, this will allow out of sync replicas to become leaders.
 99 | This should be disable in critical applications like banking system managing transactions.
100 | min.insync.replicas
101 | 
102 | This will ensure the minumm number of replicas are in sync.
103 | NotEnoughReplicasException will be throw to producer when the in sync replicas are less then what is configured.
104 | 


--------------------------------------------------------------------------------
/Kafka Producer/log4j.properties:
--------------------------------------------------------------------------------
1 | log4j.rootLogger=TRACE, stdout
2 | log4j.appender.stdout=org.apache.log4j.ConsoleAppender
3 | log4j.appender.stdout.Target=System.out
4 | log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
5 | log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss.SSS} %-5p [%c] - %m%n
6 | 


--------------------------------------------------------------------------------
/Kafka-CLI/commands.txt:
--------------------------------------------------------------------------------
 1 | KAFKA CLI
 2 | ① Start a zookeeper at default port 2181
 3 | 
 4 | $bin/zookeeper-server-start.sh config/zookeeper.properties
 5 | ② Start a kafka server at default port 9092
 6 | 
 7 | $bin/kafka-server-start.sh config/server.properties
 8 | ③ Create a kafka topic ‘my-first-topic’ with 3 partitions and 3 replicas
 9 | 
10 | $bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --create --replication-factor 3 --partitions 3
11 | ④ List all kafka topics
12 | 
13 | $bin/kafka-topics.sh --zookeeper localhost:2181 --list
14 | ⑤ Describe kafka topic ‘my-first-topic’
15 | 
16 | $bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --describe
17 | ⑥ Delete kafka topic ‘my-first-topic’
18 | 
19 | $bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --delete
20 | Note: This will have no impact if delete.topic.enable is not set to true
21 | 
22 | ⑦ Find out all the partitions without a leader
23 | 
24 | $bin/kafka-topics.sh --zookeeper localhost:2181 --describe --unavailable-partitions
25 | ⑧ Produce messages to Kafka topic my-first-topic
26 | 
27 | $bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-first-topic --producer-property acks=all 
28 | > message 1  
29 | > message 2  
30 | > ^C
31 | ⑨ Start Consuming messages from kafka topic my-first-topic
32 | 
33 | $bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --from-beginning
34 | > message 1  
35 | > message 2
36 | ⑩ Start Consuming messages in a consumer group from kafka topic my-first-topic
37 | 
38 | $bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --group my-first-consumer-group --from-beginning
39 | ⑪ List all consumer groups
40 | 
41 | $bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
42 | ⑫ Describe consumer group
43 | 
44 | $bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group
45 | ⑬ Reset offset of consumer group to replay all messages
46 | 
47 | $bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group --reset-offsets --to-earliest --execute --topic my-first-topic
48 | ⑭ Shift offsets by 2 (forward) as another strategy
49 | 
50 | bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by 2 --execute --topic my-first_topic
51 | ⑮ Shift offsets by 2 (backward) as another strategy
52 | 
53 | bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by -2 --execute --t
54 | 


--------------------------------------------------------------------------------
/Kafka-Core/Kafka-Core.txt:
--------------------------------------------------------------------------------
  1 | Kafka Core
  2 | How produced messages are sent to kafka server?
  3 | 
  4 | What is a batch?
  5 | 
  6 | A batch is just a collection of messages, all of which are being produced to the same topic and partition.
  7 | Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power
  8 | What is the role of a controller?
  9 | 
 10 | Within a cluster of brokers, one broker will also function as the cluster controller (elected automatically from the live members of the cluster).
 11 | The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures.
 12 | What is a leader?
 13 | 
 14 | A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition.
 15 | A partition may be assigned to multiple brokers, which will result in the partition being replicated (as seen in Figure 1-7).
 16 | This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure
 17 | Types of replica
 18 | 
 19 | Leader replica
 20 | follower replica
 21 | prefered replica
 22 | insync replicas Only in-sync replicas are eligible to be elected as partition leaders in case the existing leader fails.
 23 | out of sync : If a replica hasn’t requested a message in more than 10 seconds or if it has requested messages but hasn’t caught up to the most recent message in more than 10 seconds, the replica is considered out of sync.
 24 | replica.lag.time.max.ms: The amount of time a follower can be inactive or behind before it is considered out of sync.
 25 | By default, Kafka is configured with auto.leader.rebalance.enable=true, which will check if the preferred leader replica is not the current leader but is in-sync and trigger leader election to make the preferred leader the current leader.
 26 | What is retension?
 27 | 
 28 | Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g., 7 days) or until the topic reaches a certain size in bytes (e.g., 1 GB).
 29 | Once these limits are reached, messages are expired and deleted so that the retention configuration is a minimum amount of data available at any time.
 30 | What is log compaction?
 31 | 
 32 | Topics can also be configured as log compacted, which means that Kafka will retain only the last message produced with a specific key.
 33 | This can be useful for changelog-type data, where only the last update is interesting.
 34 | Zookeeper
 35 | What is the role of zookeeper in kafka
 36 | 
 37 | zookeeper maintains the list of brokers, brokers info is stored in zookeeper under /brokers/ids
 38 | *Emphemeral node
 39 | 
 40 | Every time a broker process starts, it registers itself with its ID in Zookeeper by creating an ephemeral node.
 41 | When a broker loses connectivity to Zookeeper, the ephemeral node that the broker created when starting will be automatically removed from Zookeeper.
 42 | Even though the node representing the broker is gone when the broker is stopped, the broker ID still exists in other data structures. For example, the list of replicas of each topic contains the broker IDs for the replica.
 43 | This way, if you completely lose a broker and start a brand new broker with the ID of the old one, it will immediately join the cluster in place of the missing broker with the same partitions and topics assigned to it.
 44 | What is ensemble?
 45 | 
 46 | A Zookeeper cluster is called an ensemble.
 47 | Due to the algorithm used, it is recommended that ensembles contain an odd number of servers (e.g., 3, 5, etc.) as a majority of ensemble members (a quorum) must be working in order for Zookeeper to respond to requests.
 48 | This means that in a three-node ensemble, you can run with one node missing.
 49 | With a five-node ensemble, you can run with two nodes missing.
 50 | Ensemble configuration?
 51 | 
 52 | Example:
 53 | tickTime=2000
 54 | dataDir=/var/lib/zookeeper
 55 | clientPort=2181
 56 | initLimit=20
 57 | syncLimit=5
 58 | server.1=zoo1.example.com:2888:3888
 59 | server.2=zoo2.example.com:2888:3888
 60 | server.3=zoo3.example.com:2888:3888
 61 | 
 62 | 
 63 | initLimit   is the amount of time to allow followers to connect with a leader.
 64 | SyncLimit   The syncLimit value limits how out-of-sync followers can be with the leader
 65 | clientPort  2181
 66 | The servers are specified in the format server.X=hostname:peerPort:leaderPort, with
 67 | the following parameters:
 68 | X           The ID number of the server. This must be an integer, but it does not need to be zero-based or sequential.
 69 | hostname    The hostname or IP address of the server.
 70 | peerPort    The TCP port over which servers in the ensemble communicate with each other. (default port: 2888)
 71 | leaderPort  The TCP port over which leader election is performed. (default port: 3888)
 72 | Kafka Broker
 73 | broker.id
 74 | 
 75 | Every Kafka broker must have an integer identifier, which is set using the broker.id configuration.
 76 | By default, this integer is set to 0, but it can be any value.
 77 | The most important thing is that the integer must be unique within a single Kafka cluster.
 78 | The selection of this number is arbitrary, and it can be moved between brokers if necessary for maintenance tasks.
 79 | A good guideline is to set this value to something intrinsic to the host so that when performing maintenance it is not onerous to map broker ID numbers to hosts.
 80 | For example, if your hostnames contain a unique number (such as host1.example.com, host2.example.com, etc.), that is a good choice for the broker.id value.
 81 | port
 82 | 
 83 | The example configuration file starts Kafka with a listener on TCP port 9092.
 84 | zookeeper.connect
 85 | 
 86 | The location of the Zookeeper used for storing the broker metadata is set using the zookeeper.
 87 | Connect configuration parameter. The example configuration uses a Zookeeper running on port 2181 on the local host, which is specified as localhost:2181.
 88 | log.dirs
 89 | 
 90 | Kafka persists all messages to disk, and these log segments are stored in the directories specified in the log.dirs configuration.
 91 | This is a comma-separated list of paths on the local system.
 92 | If more than one path is specified, the broker will store partitions on them in a “least-used” fashion with one partition’s log segments stored within the same path.
 93 | Note that the broker will place a new partition in the path that has the least number of partitions currently stored in it, not the least amount of disk space used.
 94 | num.recovery.threads.per.data.dir
 95 | 
 96 | Kafka uses a configurable pool of threads for handling log segments. Currently, this thread pool is used:
 97 |     • When starting normally, to open each partition’s log segments
 98 |     • When starting after a failure, to check and truncate each partition’s log segments
 99 |     • When shutting down, to cleanly close log segments
100 |  By default, only one thread per log directory is used.
101 |  As these threads are only used during startup and shutdown, it is reasonable to set a larger number of threads in order to parallelize operations.
102 |  Specifically, when recovering from an unclean shutdown, this can mean the difference of several hours when restarting a broker with a large number of partitions! When setting this parameter, remember that the number configured is per log directory specified with log.dirs.
103 |  This means that if num.recovery.threads.per.data.dir is set to 8, and there are 3 paths specified in log.dirs, this is a total of 24 threads.
104 | auto.create.topics.enable
105 | 
106 | The default Kafka configuration specifies that the broker should automatically create a topic under the following circumstances:
107 |     • When a producer starts writing messages to the topic
108 |     • When a consumer starts reading messages from the topic
109 |     • When any client requests metadata for the topic
110 | In many situations, this can be undesirable behavior, especially as there is no way to validate the existence of a topic through the Kafka protocol without causing it to be created.
111 | If you are managing topic creation explicitly, whether manually or through a provisioning system, you can set the auto.create.topics.enable configuration to false.
112 | num.partitions
113 | 
114 | default is 1
115 | log.retention.ms
116 | 
117 | The most common configuration for how long Kafka will retain messages is by time.
118 | The default is specified in the configuration file using the log.retention.hours
119 | parameter, and it is set to 168 hours, or one week. However, there are two other
120 | parameters allowed, log.retention.minutes and log.retention.ms. All three of
121 | these specify the same configuration—the amount of time after which messages may
122 | be deleted—but the recommended parameter to use is log.retention.ms, as the
123 | smaller unit size will take precedence if more than one is specified. This will make
124 | sure that the value set for log.retention.ms is always the one used. If more than one
125 | is specified, the smaller unit size will take precedence.
126 | log.retention.bytes
127 | 
128 | This property will define the amount of data retainied per partition, if the topic has 8 partition then data retained per topic would be 8GB.
129 | if both the log.retention.ms and log.retention.bytes are configured, messages may be removed when either criteria is met.
130 | log.segment.bytes
131 | 
132 | Once the log segment has reached the size specified by the log.segment.bytes parameter, which defaults to 1 GB, the log segment is closed and a new one is opened.
133 | Once a log segment has been closed, it can be considered for expiration.
134 | log.segment.ms
135 | 
136 | Specifies the amount of time after which a log segment should be closed.
137 | Kafka will close a log segment either when the size limit is reached or when the time limit is reached, whichever comes first.
138 | By default, there is no setting for log.segment.ms, which results in only closing log segments by size.
139 | message.max.bytes
140 | 
141 | message.max.bytes parameter, which defaults to 1000000, or 1 MB.
142 | This configuration deals with compressed message size, actual uncompressed message can be larger then it.
143 | 
144 | There are noticeable performance impacts from increasing the allowable message size.
145 | Larger messages will mean that the broker threads that deal with processing network connections and requests will be working longer on each request.
146 | Larger messages also increase the size of disk writes, which will impact I/O throughput.
147 | fetch.message.max.bytes
148 | 
149 | If this value is smaller than message.max.bytes, then consumers that encounter larger messages will fail to fetch those messages, resulting in a situation where the consumer gets stuck and cannot proceed.
150 | Memory
151 | 
152 | Most of the cases, consumer is caught up and lagging behind the producers very little. consumer will read are from system page cache resulting in faster reads.
153 | Having more memory avilable to the system for page cache will improve the performance of consumer clients.
154 | Kafka itself does not need much heap memory configured for the Java Virtual Machine (JVM). Even a broker that is handling X messages per second and a data rate of X megabits per second can run with a 5 GB heap.
155 | CPU CPU power is mainly utilized for compressing messages from disc and recompress the message batch in order to store in disc.
156 | 
157 | Request processing:
158 | 
159 | client request --> broker --> partitions leaders --> reponse --> broker --> client All requests sent to the broker from a specific client will be processed in the order in which they were received—this guarantee is what allows Kafka to behave as a message queue and provide ordering guarantees on the messages it stores.
160 | 
161 | Request header
162 | 
163 | request type
164 | request version
165 | correlation id
166 | client id
167 | client cache topic metadata
168 | 
169 | client request for the metadata (request type: metadata request, which includes a list of topics the client is interested in).
170 | metadata containts which partitions exist in the topics, the replicas for each partition, and which replica is the leader
171 | all brokers caches the metadata informations.
172 | metadata.max.age.ms defines the time to refresh the medadata in client.
173 | if a client receives "not a leader", it will refresh metadata before retrying.
174 | where does kafka writes the produced messages
175 | 
176 | On Linux, the messages are written to the filesystem cache and there is no guarantee about when they will be written to disk.
177 | Kafka does not wait for the data to get persisted to disk—it relies on replication for message durability.
178 | what are segments
179 | 
180 | partitions are further divided into segments, default size of segment is either 1 GB of data or a week of data.
181 | currently writting segments is called active segment. active segment will never be deleted even the retension is passed.
182 | Kafka broker will keep an open file handle to every segment in every partition—even inactive segments. This leads to an usually high number of open file handles, and the OS must be tuned accordingly.
183 | message additional infor
184 | 
185 | Each message contains—in addition to its key, value, and offset—things like the message size, checksum code that allows us to detect corruption, magic byte that indicates the version of the message format, compression codec (Snappy, GZip, or LZ4), and a timestamp (added in release 0.10.0). The timestamp is given either by the producer when the message was sent or by the broker when the message arrived—depending on configuration.
186 | Indexes
187 | 
188 | Kafka maintains indexes for each partition, indexes maps offsets to segment files and position within the file.
189 | compaction
190 | 
191 | Policies: delete --> delete events older then retension time. compact --> keeps only the recent version of a particular key.
192 | 
193 | How compactions works?
194 | 
195 | clean
196 | dirty
197 | Deleted events --> producer will send a mesasge with key and value as null.
198 | 
199 | Compact policy will never delete a compact messages in current segment.
200 | Where does kafka stores dynamic per broker configurations?
201 | 
202 | zookeeper.
203 | Where does dynamic cluster-wide default configs stored?
204 | 
205 | zookeeper.
206 | 


--------------------------------------------------------------------------------
/Kafka-Streams/Stream-Notes.txt:
--------------------------------------------------------------------------------
 1 | Kafka streams
 2 | What is a topology?
 3 | 
 4 | Topology represent the DAG for streaming process.
 5 | what are kafka streams?
 6 | 
 7 | Kafka streams leveeraged the consumer and producer API, so all the properties applicable to consumer and producer is sill applicable here.
 8 | application.id specific to stream applications will be used for
 9 | consumer group.id = application.id
10 | default client.id prefix
11 | prefix to the internal changelog topics
12 | Kafka streams creates internal intermidiate topics
13 | 
14 | repartitioning topics
15 | changelog topics
16 | What are the types of stores in streams?
17 | 
18 | 1. local or internal state
19 | 2. external state
20 | What is state store?
21 | 
22 | Some operations(like avg, count etc) in streams depends on previously calculated infromation for which state need to be stored.
23 | What happens to local state if a node goes down?
24 | 
25 | local state will be stored in rockDB and it is also writted in kafka topic, in case of node failure state will be recreated from kafka.
26 | What are the type of windowing options in streams?
27 | 
28 | 1. Tumbling window or sliding window
29 | 2. hoping window
30 | How out of sync records are handled in streams?
31 | 
32 | Stream vs table??
33 | 
34 | stream represents all the events occured in table. insert only mode.
35 | table represents the current state of the records. upsert mode.
36 | 


--------------------------------------------------------------------------------
/Schema-Registry/Avro Schema/Customer.avro:
--------------------------------------------------------------------------------
 1 | {
 2 |      "type": "record",
 3 |      "namespace": "com.example",
 4 |      "name": "Customer",
 5 |      "version": "1",
 6 |      "fields": [
 7 |        { "name": "first_name", "type": "string", "doc": "First Name of Customer" },
 8 |        { "name": "last_name", "type": "string", "doc": "Last Name of Customer" },
 9 |        { "name": "age", "type": "int", "doc": "Age at the time of registration" },
10 |        { "name": "height", "type": "float", "doc": "Height at the time of registration in cm" },
11 |        { "name": "weight", "type": "float", "doc": "Weight at the time of registration in kg" },
12 |        { "name": "automated_email", "type": "boolean", "default": true, "doc": "Field indicating if the user is enrolled in marketing emails" }
13 |      ]
14 | }
15 | 


--------------------------------------------------------------------------------
/Schema-Registry/CustomKafkaProducer.java:
--------------------------------------------------------------------------------
 1 | import org.apache.kafka.clients.producer.ProducerConfig;
 2 | import org.apache.kafka.clients.producer.ProducerRecord;
 3 | import org.apache.kafka.clients.producer.RecordMetadata;
 4 | import org.apache.kafka.common.serialization.StringSerializer;
 5 | 
 6 | import java.util.Properties;
 7 | import java.util.concurrent.Future;
 8 | 
 9 | public class CustomKafkaProducer {
10 | 
11 |     org.apache.kafka.clients.producer.KafkaProducer<String, String> producer;
12 | 
13 |     public CustomKafkaProducer() {
14 |         super();
15 |         this.producer = new org.apache.kafka.clients.producer.KafkaProducer<String, String>(getProperties());
16 |     }
17 | 
18 | 
19 |     private Properties getProperties(){
20 |         Properties config = new Properties();
21 |         config.put("client.id", "sample-client-id");
22 |         config.put("bootstrap.servers", "13.233.142.162:9092,13.233.142.162:29092");
23 |         config.put("acks", "all");
24 |         config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
25 |         config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
26 |         return config;
27 |     }
28 | 
29 |     public boolean produce(String topic, String key, String value){
30 |         final ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, key, value);
31 |         Future<RecordMetadata> future = producer.send(record);
32 |         return true;
33 |     }
34 | 
35 | }
36 | 


--------------------------------------------------------------------------------
/Schema-Registry/KafkaProducerMain.java:
--------------------------------------------------------------------------------
 1 | import com.example.Customer;
 2 | import io.confluent.kafka.serializers.KafkaAvroSerializer;
 3 | import org.apache.kafka.clients.producer.*;
 4 | import org.apache.kafka.common.serialization.StringSerializer;
 5 | import org.slf4j.Logger;
 6 | import org.slf4j.LoggerFactory;
 7 | 
 8 | import java.util.Properties;
 9 | 
10 | public class KafkaProducerMain {
11 | 
12 |     private static Logger LOGGER = LoggerFactory.getLogger(KafkaProducerMain.class);
13 | 
14 |     public static void main(String[] args) throws Exception {
15 |         Properties properties = new Properties();
16 |         // normal producer
17 |         properties.setProperty("bootstrap.servers", "13.233.161.161:29092");
18 |         properties.setProperty("acks", "all");
19 |         properties.setProperty("retries", "10");
20 |         // avro part
21 |         properties.setProperty("key.serializer", StringSerializer.class.getName());
22 |         properties.setProperty("value.serializer", KafkaAvroSerializer.class.getName());
23 |         properties.setProperty("schema.registry.url", "http://13.233.161.161:8081");
24 | 
25 |         Producer<String, Customer> producer = new KafkaProducer<String, Customer>(properties);
26 | 
27 |         String topic = "customer-avro";
28 | 
29 |         // copied from avro examples
30 |         Customer customer = Customer.newBuilder()
31 |                 .setAge(34)
32 |                 .setAutomatedEmail(false)
33 |                 .setFirstName("John")
34 |                 .setLastName("Doe")
35 |                 .setHeight(178f)
36 |                 .setWeight(75f)
37 |                 .build();
38 | 
39 |         ProducerRecord<String, Customer> producerRecord = new ProducerRecord<String, Customer>(
40 |                 topic, customer
41 |         );
42 | 
43 |         System.out.println(customer);
44 |         producer.send(producerRecord, new Callback() {
45 |             public void onCompletion(RecordMetadata metadata, Exception exception) {
46 |                 if (exception == null) {
47 |                     System.out.println(metadata);
48 |                 } else {
49 |                     exception.printStackTrace();
50 |                 }
51 |             }
52 |         });
53 | 
54 |         producer.flush();
55 |         producer.close();
56 | 
57 |     }
58 | }
59 | 


--------------------------------------------------------------------------------
/Schema-Registry/Schema-Registry.txt:
--------------------------------------------------------------------------------
 1 | Kafka schema-registry
 2 | How to connect and run schema registry commands?
 3 | docker-commands
 4 | docker exec -it schema-registry /bin/bash
 5 | 
 6 | useful kafka commands
 7 | kafka-avro-console-producer
 8 | --broker-list broker:29092 --topic bar
 9 | --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
10 | 
11 | sample mesages {"f1": "value1"} {"f2": "value2"}
12 | 
13 | kafka-avro-console-consumer
14 | --topic customer-avro
15 | --bootstrap-server broker:29092
16 | --from-beginning
17 | 
18 | Where does kafka stores schema's?
19 | 
20 | Kafka stores schema infromation in _schema topic.
21 | 
22 | What are the schema registry properties we need to provide in producer to register schema?
23 | 
24 | props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
25 | props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class);
26 | props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class);
27 | props.put("schema.registry.url", "http://localhost:8081");
28 | How to registry schema for a key?
29 | 
30 | props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class);
31 | This will register schema for a key.
32 | What is a specificrecord or genericRecord?
33 | 
34 | Specificrecord:
35 |     It is a auto generated pojo class by maven plugin when a .avsc schema file is provided.
36 |     This will throw an compilation error when a field is missing.
37 | GenericRecord:
38 |     schema will be provided explicitly and field will be accessed either by name or index.
39 |     This will throw an compilation error when a field is missing.
40 | Will kafka producer always register schema for each produced message?
41 | 
42 | No. schema will be registered when the first message is published and it will be cached in a hashmap by KafkaAvroSerializer.
43 | Next time it will retrive it from cache.
44 | Will kafka consumer always fetch schema for each consumed message?
45 | 
46 | No. schema will be registered when the first message is fetched from registry and it will be cached in a hashmap by KafkaAvroDeSerializer.
47 | Next time it will retrive it from cache.
48 | When deserialize a message if schema is not avilable in cache
49 | 
50 | Schema will be fetched from schema registry.
51 | 
52 | What is magicbyte?
53 | 
54 | magicbyte is a schema id got after schema is registered with schema registry.
55 | It will appened to the message and consumer will refer to this while decoding the message.
56 | What are type of compatibility schema registry supports and how?
57 | 
58 | These are the compatibility types: BACKWARD: (default) consumers using the new schema can read data written by producers using the latest registered schema BACKWARD_TRANSITIVE: consumers using the new schema can read data written by producers using all previously registered schemas FORWARD: consumers using the latest registered schema can read data written by producers using the new schema FORWARD_TRANSITIVE: consumers using all previously registered schemas can read data written by producers using the new schema FULL: the new schema is forward and backward compatible with the latest registered schema FULL_TRANSITIVE: the new schema is forward and backward compatible with all previously registered schemas NONE: schema compatibility checks are disabled
59 | 
60 | what is the default compatibility type
61 | 
62 | BACKWARD
63 | will a custom pojo can be auto registered in register schema?
64 | 
65 | No, this has to be handled explicitly and you need to modify your class similar to auto generate class by maven plugin.
66 | How to disable auto schema registry?
67 | 
68 | client applications automatically register new schemas. This is very convenient in development environments, but in production environments we recommend that client applications do not automatically register new schemas.
69 | props.put(AbstractKafkaAvroSerDeConfig.AUTO_REGISTER_SCHEMAS, false);
70 | 


--------------------------------------------------------------------------------
/Schema-Registry/log4j.properties:
--------------------------------------------------------------------------------
1 | log4j.rootLogger=TRACE, stdout
2 | log4j.appender.stdout=org.apache.log4j.ConsoleAppender
3 | log4j.appender.stdout.Target=System.out
4 | log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
5 | log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss.SSS} %-5p [%c] - %m%n
6 | 


--------------------------------------------------------------------------------
/Set-UP README.md:
--------------------------------------------------------------------------------
 1 | Setup & Commands
 2 | How to run confluent kafka docker in AWS
 3 | 1. Create an ec2 instance
 4 | 2. install git
 5 | 3. install docker
 6 | 4. install docker-compose
 7 | 5. git clone https://github.com/confluentinc/cp-docker-images
 8 | 6. cd cp-docker-images
 9 | 7. git checkout 5.2.1-post
10 | 8. cd examples/cp-all-in-one/
11 | 9. docker-compose up -d --build
12 | Linux command
13 | # List topics docker-compose exec broker kafka-topics --zookeeper zookeeper:2181 --list
14 | # Create topic docker-compose exec broker kafka-topics --create --zookeeper \
15 |     zookeeper:2181 --replication-factor 1 --partitions 1 --topic users
16 | Docker command
17 | # docker command to run in bash or sh shell
18 |     bash interactive mode - docker exec -it broker /bin/bash
19 |     sh mode - docker-compose exec broker sh
20 |     kafka-topics --zookeeper zookeeper:2181 --list
21 | 


--------------------------------------------------------------------------------
/azure-pipelines.yml:
--------------------------------------------------------------------------------
 1 | # Starter pipeline
 2 | # Start with a minimal pipeline that you can customize to build and deploy your code.
 3 | # Add steps that build, run tests, deploy, and more:
 4 | # https://aka.ms/yaml
 5 | 
 6 | trigger:
 7 | - master
 8 | 
 9 | pool:
10 |   name: Default
11 |   vmImage: vs2017-win2016
12 | 
13 | steps:
14 | - script: echo Hello, world!
15 |   displayName: 'Run a one-line script'
16 | 
17 | - script: |
18 |     echo Add other tasks to build, test, and deploy your project.
19 |     echo See https://aka.ms/yaml
20 |   displayName: 'Run a multi-line script'
21 | - task: WhiteSource@21
22 |   inputs:
23 |       cwd: '$(System.DefaultWorkingDirectory)'
24 |       projectName: 'Kafka-Notes'
25 | 


--------------------------------------------------------------------------------