├── .gitignore ├── KAFKA-CORE-README.md ├── README.md ├── SETUP-COMMANDS-README.md ├── _config.yml ├── kafka-connectors └── README.md ├── kafka-consumer └── README.md ├── kafka-docker ├── Dockerfile ├── README.md └── docker-compose.yml ├── kafka-ksql └── README.md ├── kafka-producer ├── README.md ├── kafka-producer.iml ├── pom.xml └── src │ ├── main │ ├── java │ │ └── com │ │ │ └── varma │ │ │ └── kafka │ │ │ ├── CustomKafkaProducer.java │ │ │ └── KafkaProducerMain.java │ └── resources │ │ └── log4j.properties │ └── test │ └── java │ └── com │ └── varma │ └── kafka │ ├── CustomKafkaProducerTest.java │ └── ProducerTest.java ├── kafka-schema-registry ├── README.md ├── pom.xml └── src │ └── main │ ├── java │ └── com │ │ └── varma │ │ └── kafka │ │ ├── CustomKafkaProducer.java │ │ └── KafkaProducerMain.java │ └── resources │ ├── avro │ └── customer.avsc │ └── log4j.properties ├── kafka-streams ├── Processor-api-README.md ├── README.md └── kafka-DSL-README.md └── pom.xml /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | 3 | .idea 4 | **/target 5 | **/*.iml 6 | -------------------------------------------------------------------------------- /KAFKA-CORE-README.md: -------------------------------------------------------------------------------- 1 | # Kafka Core # 2 | 3 | **How produced messages are sent to kafka server?** 4 | 5 | **What is a batch?** 6 | 7 | A batch is just a collection of messages, all of which are being produced to the same topic and partition. 8 | Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power 9 | 10 | **What is the role of a controller?** 11 | 12 | Within a cluster of brokers, one broker will also function as the cluster controller (elected automatically from the live members of the cluster). 13 | The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. 14 | 15 | **What is a leader?** 16 | 17 | A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. 18 | A partition may be assigned to multiple brokers, which will result in the partition being replicated (as seen in Figure 1-7). 19 | This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure 20 | 21 | **Types of replica** 22 | 23 | Leader replica 24 | follower replica 25 | prefered replica 26 | insync replicas Only in-sync replicas are eligible to be elected as partition leaders in case the existing leader fails. 27 | out of sync : If a replica hasn’t requested a message in more than 10 seconds or if it has requested messages but hasn’t caught up to the most recent message in more than 10 seconds, the replica is considered out of sync. 28 | replica.lag.time.max.ms: The amount of time a follower can be inactive or behind before it is considered out of sync. 29 | By default, Kafka is configured with auto.leader.rebalance.enable=true, which will check if the preferred leader replica is not the current leader but is in-sync and trigger leader election to make the preferred leader the current leader. 30 | 31 | **What is retention?** 32 | 33 | Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g., 7 days) or until the topic reaches a certain size in bytes (e.g., 1 GB). 34 | Once these limits are reached, messages are expired and deleted so that the retention configuration is a minimum amount of data available at any time. 35 | 36 | **What is log compaction?** 37 | 38 | Topics can also be configured as log compacted, which means that Kafka will retain only the last message produced with a specific key. 39 | This can be useful for changelog-type data, where only the last update is interesting. 40 | 41 | ## Zookeeper ## 42 | 43 | **What is the role of zookeeper in kafka** 44 | 45 | zookeeper maintains the list of brokers, brokers info is stored in zookeeper under /brokers/ids 46 | 47 | ***Emphemeral node** 48 | 49 | Every time a broker process starts, it registers itself with its ID in Zookeeper by creating an ephemeral node. 50 | When a broker loses connectivity to Zookeeper, the ephemeral node that the broker created when starting will be automatically removed from Zookeeper. 51 | Even though the node representing the broker is gone when the broker is stopped, the broker ID still exists in other data structures. For example, the list of replicas of each topic contains the broker IDs for the replica. 52 | This way, if you completely lose a broker and start a brand new broker with the ID of the old one, it will immediately join the cluster in place of the missing broker with the same partitions and topics assigned to it. 53 | 54 | **What is ensemble?** 55 | 56 | A Zookeeper cluster is called an ensemble. 57 | Due to the algorithm used, it is recommended that ensembles contain an odd number of servers (e.g., 3, 5, etc.) as a majority of ensemble members (a quorum) must be working in order for Zookeeper to respond to requests. 58 | This means that in a three-node ensemble, you can run with one node missing. 59 | With a five-node ensemble, you can run with two nodes missing. 60 | 61 | 62 | **Ensemble configuration?** 63 | 64 | Example: 65 | tickTime=2000 66 | dataDir=/var/lib/zookeeper 67 | clientPort=2181 68 | initLimit=20 69 | syncLimit=5 70 | server.1=zoo1.example.com:2888:3888 71 | server.2=zoo2.example.com:2888:3888 72 | server.3=zoo3.example.com:2888:3888 73 | 74 | 75 | initLimit is the amount of time to allow followers to connect with a leader. 76 | SyncLimit The syncLimit value limits how out-of-sync followers can be with the leader 77 | clientPort 2181 78 | The servers are specified in the format server.X=hostname:peerPort:leaderPort, with 79 | the following parameters: 80 | X The ID number of the server. This must be an integer, but it does not need to be zero-based or sequential. 81 | hostname The hostname or IP address of the server. 82 | peerPort The TCP port over which servers in the ensemble communicate with each other. (default port: 2888) 83 | leaderPort The TCP port over which leader election is performed. (default port: 3888) 84 | 85 | 86 | ## Kafka Broker ## 87 | 88 | **broker.id** 89 | 90 | Every Kafka broker must have an integer identifier, which is set using the broker.id configuration. 91 | By default, this integer is set to 0, but it can be any value. 92 | The most important thing is that the integer must be unique within a single Kafka cluster. 93 | The selection of this number is arbitrary, and it can be moved between brokers if necessary for maintenance tasks. 94 | A good guideline is to set this value to something intrinsic to the host so that when performing maintenance it is not onerous to map broker ID numbers to hosts. 95 | For example, if your hostnames contain a unique number (such as host1.example.com, host2.example.com, etc.), that is a good choice for the broker.id value. 96 | 97 | **port** 98 | 99 | The example configuration file starts Kafka with a listener on TCP port 9092. 100 | 101 | **zookeeper.connect** 102 | 103 | The location of the Zookeeper used for storing the broker metadata is set using the zookeeper. 104 | Connect configuration parameter. The example configuration uses a Zookeeper running on port 2181 on the local host, which is specified as localhost:2181. 105 | 106 | **log.dirs** 107 | 108 | Kafka persists all messages to disk, and these log segments are stored in the directories specified in the log.dirs configuration. 109 | This is a comma-separated list of paths on the local system. 110 | If more than one path is specified, the broker will store partitions on them in a “least-used” fashion with one partition’s log segments stored within the same path. 111 | Note that the broker will place a new partition in the path that has the least number of partitions currently stored in it, not the least amount of disk space used. 112 | 113 | **num.recovery.threads.per.data.dir** 114 | 115 | Kafka uses a configurable pool of threads for handling log segments. Currently, this thread pool is used: 116 | • When starting normally, to open each partition’s log segments 117 | • When starting after a failure, to check and truncate each partition’s log segments 118 | • When shutting down, to cleanly close log segments 119 | By default, only one thread per log directory is used. 120 | As these threads are only used during startup and shutdown, it is reasonable to set a larger number of threads in order to parallelize operations. 121 | Specifically, when recovering from an unclean shutdown, this can mean the difference of several hours when restarting a broker with a large number of partitions! When setting this parameter, remember that the number configured is per log directory specified with log.dirs. 122 | This means that if num.recovery.threads.per.data.dir is set to 8, and there are 3 paths specified in log.dirs, this is a total of 24 threads. 123 | 124 | **auto.create.topics.enable** 125 | 126 | The default Kafka configuration specifies that the broker should automatically create a topic under the following circumstances: 127 | • When a producer starts writing messages to the topic 128 | • When a consumer starts reading messages from the topic 129 | • When any client requests metadata for the topic 130 | In many situations, this can be undesirable behavior, especially as there is no way to validate the existence of a topic through the Kafka protocol without causing it to be created. 131 | If you are managing topic creation explicitly, whether manually or through a provisioning system, you can set the auto.create.topics.enable configuration to false. 132 | 133 | **num.partitions** 134 | 135 | default is 1 136 | 137 | **log.retention.ms** 138 | 139 | The most common configuration for how long Kafka will retain messages is by time. 140 | The default is specified in the configuration file using the log.retention.hours 141 | parameter, and it is set to 168 hours, or one week. However, there are two other 142 | parameters allowed, log.retention.minutes and log.retention.ms. All three of 143 | these specify the same configuration—the amount of time after which messages may 144 | be deleted—but the recommended parameter to use is log.retention.ms, as the 145 | smaller unit size will take precedence if more than one is specified. This will make 146 | sure that the value set for log.retention.ms is always the one used. If more than one 147 | is specified, the smaller unit size will take precedence. 148 | 149 | **log.retention.bytes** 150 | 151 | This property will define the amount of data retainied per partition, if the topic has 8 partition then data retained per topic would be 8GB. 152 | if both the log.retention.ms and log.retention.bytes are configured, messages may be removed when either criteria is met. 153 | 154 | **log.segment.bytes** 155 | 156 | Once the log segment has reached the size specified by the log.segment.bytes parameter, which defaults to 1 GB, the log segment is closed and a new one is opened. 157 | Once a log segment has been closed, it can be considered for expiration. 158 | 159 | **log.segment.ms** 160 | 161 | Specifies the amount of time after which a log segment should be closed. 162 | Kafka will close a log segment either when the size limit is reached or when the time limit is reached, whichever comes first. 163 | By default, there is no setting for log.segment.ms, which results in only closing log segments by size. 164 | 165 | **message.max.bytes** 166 | 167 | message.max.bytes parameter, which defaults to 1000000, or 1 MB. 168 | This configuration deals with compressed message size, actual uncompressed message can be larger then it. 169 | 170 | There are noticeable performance impacts from increasing the allowable message size. 171 | Larger messages will mean that the broker threads that deal with processing network connections and requests will be working longer on each request. 172 | Larger messages also increase the size of disk writes, which will impact I/O throughput. 173 | 174 | **fetch.message.max.bytes** 175 | 176 | If this value is smaller than message.max.bytes, then consumers that encounter larger messages will fail to fetch those messages, resulting in a situation where the consumer gets stuck and cannot proceed. 177 | 178 | 179 | **Memory** 180 | 181 | Most of the cases, consumer is caught up and lagging behind the producers very little. consumer will read are from system page cache resulting in faster reads. 182 | Having more memory avilable to the system for page cache will improve the performance of consumer clients. 183 | Kafka itself does not need much heap memory configured for the Java Virtual Machine (JVM). Even a broker that is handling X messages per second and a data rate of X megabits per second can run with a 5 GB heap. 184 | 185 | CPU 186 | CPU power is mainly utilized for compressing messages from disc and recompress the message batch in order to store in disc. 187 | 188 | **Request processing:** 189 | 190 | client request --> broker --> partitions leaders --> reponse --> broker --> client 191 | All requests sent to the broker from a specific client will be processed in the order in which they were received—this guarantee is what allows Kafka to behave as a message queue and provide ordering guarantees on the messages it stores. 192 | 193 | **Request header** 194 | 195 | request type 196 | request version 197 | correlation id 198 | client id 199 | 200 | **client cache topic metadata** 201 | 202 | client request for the metadata (request type: metadata request, which includes a list of topics the client is interested in). 203 | metadata containts which partitions exist in the topics, the replicas for each partition, and which replica is the leader 204 | all brokers caches the metadata informations. 205 | metadata.max.age.ms defines the time to refresh the medadata in client. 206 | if a client receives "not a leader", it will refresh metadata before retrying. 207 | 208 | **where does kafka writes the produced messages** 209 | 210 | On Linux, the messages are written to the filesystem cache and there is no guarantee about when they will be written to disk. 211 | Kafka does not wait for the data to get persisted to disk—it relies on replication for message durability. 212 | 213 | **what are segments** 214 | 215 | partitions are further divided into segments, default size of segment is either 1 GB of data or a week of data. 216 | currently writting segments is called active segment. active segment will never be deleted even the retension is passed. 217 | Kafka broker will keep an open file handle to every segment in every partition—even inactive segments. This leads to an usually high number of open file handles, and the OS must be tuned accordingly. 218 | 219 | **message additional infor** 220 | 221 | Each message contains—in addition to its key, value, and offset—things like the message size, checksum code that allows us to detect corruption, magic byte that indicates the version of the message format, compression codec (Snappy, GZip, or LZ4), and a timestamp (added in release 0.10.0). The timestamp is given either by the producer when the message was sent or by the broker when the message arrived—depending on configuration. 222 | 223 | **Indexes** 224 | 225 | Kafka maintains indexes for each partition, indexes maps offsets to segment files and position within the file. 226 | 227 | **compaction** 228 | 229 | Policies: 230 | delete --> delete events older then retension time. 231 | compact --> keeps only the recent version of a particular key. 232 | 233 | **How compactions works?** 234 | 235 | clean 236 | dirty 237 | Deleted events --> producer will send a mesasge with key and value as null. 238 | 239 | Compact policy will never delete a compact messages in current segment. 240 | 241 | **Where does kafka stores dynamic per broker configurations?** 242 | 243 | zookeeper. 244 | 245 | **Where does dynamic cluster-wide default configs stored?** 246 | 247 | zookeeper. 248 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Confluent Certified Developer for Apache Kafka (CCDAK) 2 | 3 | This repos is to keep all the relevant informations related for confluent-kafka-certification (CCDAK). 4 | Note: This is work in progress, it will be improved overtime. 5 | 6 | ## Contents ## 7 | 8 | | Description | Links | 9 | | ---------------- | --------------------------------------- | 10 | | setup & commands | [README.md](./SETUP-COMMANDS-README.md) | 11 | | kafka-core | [README.md](./KAFKA-CORE-README.md) | 12 | | Kafka producer | [README.md](kafka-producer/README.md) | 13 | | Kafka consummer | [README.md](kafka-consumer/README.md) | 14 | | Kakfa Connectors | [README.md](kafka-connectors/README.md) | 15 | | Kakfa schema-registry | [README.md](kafka-schema-registry/README.md) | 16 | | Kakfa Streams | [README.md](kafka-streams/README.md) | 17 | | Kakfa KSQL | [README.md](kafka-ksql/README.md) | 18 | | Kafka-DSL | | 19 | | Kakfa KSQL | | 20 | | Kafka-control-center | | 21 | | Kafka-security | | 22 | 23 | ## Links: ## 24 | 25 | * Kafka-rest [https://docs.confluent.io/current/kafka-rest/index.html](https://docs.confluent.io/current/kafka-rest/index.html) 26 | * kafka-rest [https://docs.confluent.io/current/kafka-rest/docs/index.html](https://docs.confluent.io/current/kafka-rest/docs/index.html) 27 | * kafka-rest [https://docs.confluent.io/current/kafka-rest/quickstart.html](https://docs.confluent.io/current/kafka-rest/quickstart.html) 28 | * schema registry [https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html](https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html) 29 | * Kafka-connectors [https://www.baeldung.com/kafka-connectors-guide](https://www.baeldung.com/kafka-connectors-guide) 30 | * Kafka-connect [https://data-flair.training/blogs/kafka-connect/amp/](https://data-flair.training/blogs/kafka-connect/amp/) 31 | * Kafka-connect-avro [https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html](https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html) 32 | * kafka-connect [https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/](https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/) 33 | * kafka-connect [https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html](https://docs.confluent.io/current/cp-docker-images/docs/tutorials/connect-avro-jdbc.html) 34 | * Kafka-streams [https://www.confluent.io/stream-processing-cookbook/](https://www.confluent.io/stream-processing-cookbook/) 35 | * Kafka-streams [https://docs.confluent.io/current/streams/index.html](https://docs.confluent.io/current/streams/index.html) 36 | * KSQL [https://docs.confluent.io/current/ksql/docs/faq.html](https://docs.confluent.io/current/ksql/docs/faq.html) 37 | * KSQL [https://docs.confluent.io/current/ksql/docs/tutorials/index.html](https://docs.confluent.io/current/ksql/docs/tutorials/index.html) 38 | * KSQL [https://www.michael-noll.com/blog/2018/04/05/of-stream-and-tables-in-kafka-and-stream-processing-part1/](https://www.michael-noll.com/blog/2018/04/05/of-stream-and-tables-in-kafka-and-stream-processing-part1/) 39 | * Kafka Security [https://medium.com/@stephane.maarek/introduction-to-apache-kafka-security-c8951d410adf](https://medium.com/@stephane.maarek/introduction-to-apache-kafka-security-c8951d410adf) 40 | * Kafka-control-center [https://docs.confluent.io/current/tutorials/cp-demo/docs/index.html](https://docs.confluent.io/current/tutorials/cp-demo/docs/index.html) 41 | 42 | ## Video tutorials: ## 43 | 44 | * Kafka schema registry and rest-proxy - https://youtu.be/5fjw62LGYNg 45 | * Avro: schema evolution - By Stephane Maarek - https://youtu.be/SZX9DM_gyOE 46 | * Avro producer - By Stephane Maarek https://youtu.be/_6HTHH1NCK0 47 | * Kafka Connect Architecture - By Stephane Maarek https://youtu.be/YOGN7qr2nSE 48 | * Kafka Connect Concepts - By Stephane Maarek https://youtu.be/BUv1IgWm-gQ 49 | * Kafka Connect Distributed architecture - By Stephane Maarek https://youtu.be/52HXoxthRs0 50 | * Kafka-connect https://www.youtube.com/playlist?list=PLt1SIbA8guutTlfh0J7bGboW_Iplm6O_B 51 | * KSQL https://www.youtube.com/watch?v=ExEWJVjj-RA&list=PLa7VYi0yPIH2eX8q3mPpZAn3qCS1eDX8W 52 | * kafka-streams explained by Neha https://www.youtube.com/watch?v=A9KQufewd-s&feature=youtu.be 53 | * kafka-streams https://www.youtube.com/watch?v=Z3JKCLG3VP4&list=PLa7VYi0yPIH1vDclVOB49xUruBAWkOCZD 54 | * kafka-streams https://youtu.be/Z3JKCLG3VP4 55 | * kafka-streams https://youtu.be/LxxeXI1mPKo 56 | * kafka-streams https://youtu.be/-y2ALVkU5Bc 57 | * kafka-streams - By Stephane Maarek https://youtu.be/wPw3tb_dl70 58 | * kafka-streams - By Tim berglund https://youtu.be/7JYEEx7SBuE 59 | * kafka-streams - By Tim berglund https://youtu.be/3kJgYIkAeHs 60 | 61 | ## Architecture and Advanced Concepts ## 62 | 63 | * Exactly once semantics : https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/ 64 | * Kafka partition https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab 65 | * kafka stateful DSL : https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#stateful-transformations 66 | 67 | ## Real-time use-cases & Blog's ## 68 | 69 | https://www.confluent.io/stream-processing-cookbook/ 70 | https://www.confluent.io/blog/ 71 | 72 | ## Alternate links for preparation ## 73 | 74 | http://lahotisolutions.blogspot.com/2019/03/apache-kafka-notes.html 75 | https://www.quora.com/How-do-I-prepare-for-Kafka-certification-confluent-1 76 | 77 | ## Sources ## 78 | Content in this github is sourced from following locations 79 | 80 | * [Confluent documentation](https://docs.confluent.io/current/) 81 | * [Apache documentation](https://kafka.apache.org/documentation/) 82 | * Kafka definitive guide Book -------------------------------------------------------------------------------- /SETUP-COMMANDS-README.md: -------------------------------------------------------------------------------- 1 | # Setup & Commands # 2 | 3 | ## How to run confluent kafka docker in AWS ## 4 | ``` 5 | 1. Create an ec2 instance 6 | 2. install git 7 | 3. install docker 8 | 4. install docker-compose 9 | 5. git clone https://github.com/confluentinc/cp-docker-images 10 | 6. cd cp-docker-images 11 | 7. git checkout 5.2.1-post 12 | 8. cd examples/cp-all-in-one/ 13 | 9. docker-compose up -d --build 14 | ``` 15 | ## Linux command 16 | ``` 17 | # List topics docker-compose exec broker kafka-topics --zookeeper zookeeper:2181 --list 18 | # Create topic docker-compose exec broker kafka-topics --create --zookeeper \ 19 | zookeeper:2181 --replication-factor 1 --partitions 1 --topic users 20 | ``` 21 | 22 | ## Docker command 23 | ``` 24 | # docker command to run in bash or sh shell 25 | bash interactive mode - docker exec -it broker /bin/bash 26 | sh mode - docker-compose exec broker sh 27 | kafka-topics --zookeeper zookeeper:2181 --list 28 | ``` -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman -------------------------------------------------------------------------------- /kafka-connectors/README.md: -------------------------------------------------------------------------------- 1 | # Kafka connectors 2 | 3 | ## Video tutorials: 4 | * https://www.youtube.com/playlist?list=PLt1SIbA8guutTlfh0J7bGboW_Iplm6O_B 5 | 6 | **what is a worker?** 7 | 8 | A single java process, it can be standalone or disctributed mode. 9 | 10 | **Modes to run kafka connect server** 11 | 12 | * Standalone mode 13 | * Distributed mode - recommended for production use. 14 | 15 | **What is standalone Mode** 16 | 17 | * A single process runs your connectors and tasks. 18 | * Configration is bundled with your process 19 | * Very easy to started with, useful for development and testing. 20 | * Not fault tolerant, no scalability, hard to monitor 21 | 22 | **What is Distributed mode** 23 | 24 | * Multiple workers run your connetors and tasks. 25 | * Configuration is submitted using REST API. 26 | * Easy to scale and fault tolerant(rebalance in case a worker dies) 27 | * useful for production deployment of connectors. 28 | 29 | **java memory property to control java heap size** 30 | 31 | export KAFKA_HEAP_OPTS="-Xms256M -Xmx2G" 32 | https://stackoverflow.com/questions/50621962/how-to-set-kafka-connect-connector-and-tasks-jvm-heap-size 33 | 34 | **What is a task and how connector will break them into tasks?** 35 | 36 | 37 | **What is task.maxs property?** 38 | 39 | Defined the no of tasks that we want to run in parallel. 40 | For source connector we usually keep it 1(one). 41 | For sink connectors this property will be set to higher? 42 | 43 | -------------------------------------------------------------------------------- /kafka-consumer/README.md: -------------------------------------------------------------------------------- 1 | # Kafka Consumer 2 | 3 | **What is consumer group?** 4 | 5 | 6 | 7 | **What is consumer?** 8 | 9 | When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic. 10 | If we add more consumers to a single group with a single topic than we have partitions, some of the consumers will be idle and get no messages at all. 11 | 12 | **What is rebalance?** 13 | 14 | Moving partition ownership from one consumer to another is called rebalance. 15 | During a rebalance, consumers can’t consume messages, so a rebalance is basically a short window of unavailability of the entire consumer group. 16 | 17 | **What is group coordinator?** 18 | 19 | The way consumers maintain membership in a consumer group and ownership of the partitions assigned to them is by sending heartbeats to a Kafka broker designated as the group coordinator. 20 | Heartbeats are sent when the consumer polls (i.e., retrieves records) and when it commits records it has consumed. 21 | 22 | **what are the consumer mandatory properties** 23 | 24 | bootstrap.servers, key.deserializer and value.deserializer. 25 | group.id is not mandatory, but in most of the situations it will be populated. 26 | 27 | **subscribing to topics** 28 | 29 | subscribe method will take list of topics as parameters like below 30 | consumer.subscribe(Collections.singletonList("customerCountries")); 31 | It is also possible to call subscribe with a regular expression. 32 | consumer.subscribe("test.*"); 33 | 34 | **poll loops** 35 | 36 | poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving the developer with a clean API that simply returns available data from the assigned partitions. 37 | The parameter we pass, poll(), is a timeout interval and controls how long poll() will block if data is not available in the consumer buffer. If this is set to 0, poll() will return immediately; otherwise, it will wait for the specified number of milliseconds for data to arrive from the broker. 38 | consumer.poll(100); 39 | poll() returns a list of records. Each record contains the topic and partition the record came from, the offset of the record within the partition, and of course the key and the value of the record. 40 | Always close() the consumer before exiting. This will close the network connections and sockets. 41 | The poll loop does a lot more than just get data. The first time you call poll() with a new consumer, it is responsible for finding the GroupCoordinator, joining the consumer group, and receiving a partition assignment. If a rebalance is triggered, it will be handled inside the poll loop as well. And of course the heartbeats that keep consumers alive are sent from within the poll loop. 42 | 43 | 44 | **fetch.min.bytes** 45 | 46 | This property allows a consumer to specify the minimum amount of data that it wants to receive from the broker when fetching records. 47 | 48 | **fetch.max.wait.ms** 49 | 50 | By setting fetch.min.bytes, you tell Kafka to wait until it has enough data to send before responding to the consumer. 51 | default value is 500 ms. 52 | 53 | **max.partition.fetch.bytes** 54 | 55 | This property controls the maximum number of bytes the server will return per partition. 56 | 57 | **session.timeout.ms** 58 | 59 | The amount of time a consumer can be out of contact with the broker while still considered alive defaults to 3 seconds. 60 | heatbeat.interval.ms must be lower than session.timeout.ms, and is usually set to one-third of the timeout value. So if session.timeout.ms is 3 seconds, heartbeat.interval.ms should be 1 second. 61 | 62 | **auto.offset.reset** 63 | 64 | latest or earliest 65 | 66 | **enable.auto.commit** 67 | 68 | default is true. 69 | If you set enable.auto.commit to true, then you might also want to control how frequently offsets will be committed using auto.commit.interval.ms. 70 | 71 | **What are the various partitions assignment strategies?** 72 | 73 | org.apache.kafka.clients.consumer.RangeAssignor -> This is default option 74 | org.apache.kafka.clients.consumer.RoundRobinAssignor -> recommend when multi topics are consumered together. 75 | more details are here: https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab 76 | 77 | **partition.assignment.strategy** 78 | 79 | * Range : default 80 | * Roundrobin 81 | 82 | **client.id** 83 | 84 | This can be any string, and will be used by the brokers to identify messages sent from the client. It is used in logging and metrics, and for quotas. 85 | 86 | **max.poll.records** 87 | 88 | This controls the maximum number of records that a single call to poll() will return. This is useful to help control the amount of data your application will need to process in the polling loop. 89 | 90 | **receive.buffer.bytes and send.buffer.bytes** 91 | 92 | These are the sizes of the TCP send and receive buffers used by the sockets when writing and reading data. If these are set to -1, the OS defaults will be used. It can be a good idea to increase those when producers or consumers communicate with brokers in a different datacenter, because those network links typically have higher latency and lower bandwidth. 93 | 94 | **commits and offsets** 95 | 96 | Kafka stores the offsets in __consumer_offsets topic. 97 | commitsync() will retry untill it either succeeds or encounters a nonretraiable failure. commitAsync() will not retry. 98 | 99 | **commitasync()** 100 | 101 | If we are using commit async and handling retries in our code, then we need to be careful about the order of commits. 102 | 103 | **combining synchronous and asynchronous commits** 104 | 105 | But if we know that this is the last commit before we close the consumer, or before a rebalance, we want to make extra sure that the commit succeeds. 106 | Therefore, a common pattern is to combine commitAsync() with commitSync() just before shutdown. 107 | 108 | **Rebalance listeners** 109 | 110 | ConsumerRebalanceListener onPartitionsRevoked(Collection partitions) and onPartitionsRevoked(Collection partitions). 111 | consumer.subscribe(topics, new HandleRebalanceCsutomListener()); 112 | 113 | -------------------------------------------------------------------------------- /kafka-docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM confluentinc/cp-kafka-connect:5.2.1 2 | 3 | ENV CONNECT_PLUGIN_PATH="/usr/share/java,/usr/share/confluent-hub-components" 4 | 5 | RUN confluent-hub install --no-prompt confluentinc/kafka-connect-datagen:latest -------------------------------------------------------------------------------- /kafka-docker/README.md: -------------------------------------------------------------------------------- 1 | # Kafka Docker 2 | 3 | ## Source ## 4 | https://docs.confluent.io/current/quickstart/ce-docker-quickstart.html 5 | https://github.com/confluentinc/cp-docker-images/tree/5.2.1-post/examples -------------------------------------------------------------------------------- /kafka-docker/docker-compose.yml: -------------------------------------------------------------------------------- 1 | --- 2 | version: '2' 3 | services: 4 | zookeeper: 5 | image: confluentinc/cp-zookeeper:5.2.1 6 | hostname: zookeeper 7 | container_name: zookeeper 8 | ports: 9 | - "2181:2181" 10 | environment: 11 | ZOOKEEPER_CLIENT_PORT: 2181 12 | ZOOKEEPER_TICK_TIME: 2000 13 | 14 | broker: 15 | image: confluentinc/cp-enterprise-kafka:5.2.1 16 | hostname: broker 17 | container_name: broker 18 | depends_on: 19 | - zookeeper 20 | ports: 21 | - "29092:29092" 22 | - "9092:9092" 23 | environment: 24 | KAFKA_BROKER_ID: 1 25 | KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' 26 | KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT 27 | KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092 28 | KAFKA_METRIC_REPORTERS: io.confluent.metrics.reporter.ConfluentMetricsReporter 29 | KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 30 | KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 31 | CONFLUENT_METRICS_REPORTER_BOOTSTRAP_SERVERS: broker:29092 32 | CONFLUENT_METRICS_REPORTER_ZOOKEEPER_CONNECT: zookeeper:2181 33 | CONFLUENT_METRICS_REPORTER_TOPIC_REPLICAS: 1 34 | CONFLUENT_METRICS_ENABLE: 'true' 35 | CONFLUENT_SUPPORT_CUSTOMER_ID: 'anonymous' 36 | 37 | schema-registry: 38 | image: confluentinc/cp-schema-registry:5.2.1 39 | hostname: schema-registry 40 | container_name: schema-registry 41 | depends_on: 42 | - zookeeper 43 | - broker 44 | ports: 45 | - "8081:8081" 46 | environment: 47 | SCHEMA_REGISTRY_HOST_NAME: schema-registry 48 | SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: 'zookeeper:2181' 49 | 50 | connect: 51 | image: confluentinc/kafka-connect-datagen:latest 52 | build: 53 | context: . 54 | dockerfile: Dockerfile 55 | hostname: connect 56 | container_name: connect 57 | depends_on: 58 | - zookeeper 59 | - broker 60 | - schema-registry 61 | ports: 62 | - "8083:8083" 63 | environment: 64 | CONNECT_BOOTSTRAP_SERVERS: 'broker:29092' 65 | CONNECT_REST_ADVERTISED_HOST_NAME: connect 66 | CONNECT_REST_PORT: 8083 67 | CONNECT_GROUP_ID: compose-connect-group 68 | CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs 69 | CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1 70 | CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000 71 | CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets 72 | CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1 73 | CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status 74 | CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1 75 | CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter 76 | CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter 77 | CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:8081 78 | CONNECT_INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter" 79 | CONNECT_INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter" 80 | CONNECT_ZOOKEEPER_CONNECT: 'zookeeper:2181' 81 | # Assumes image is based on confluentinc/kafka-connect-datagen:latest which is pulling 5.1.1 Connect image 82 | CLASSPATH: /usr/share/java/monitoring-interceptors/monitoring-interceptors-5.2.1.jar 83 | CONNECT_PRODUCER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor" 84 | CONNECT_CONSUMER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor" 85 | CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components" 86 | CONNECT_LOG4J_LOGGERS: org.apache.zookeeper=ERROR,org.I0Itec.zkclient=ERROR,org.reflections=ERROR 87 | 88 | control-center: 89 | image: confluentinc/cp-enterprise-control-center:5.2.1 90 | hostname: control-center 91 | container_name: control-center 92 | depends_on: 93 | - zookeeper 94 | - broker 95 | - schema-registry 96 | - connect 97 | - ksql-server 98 | ports: 99 | - "9021:9021" 100 | environment: 101 | CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092' 102 | CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181' 103 | CONTROL_CENTER_CONNECT_CLUSTER: 'connect:8083' 104 | CONTROL_CENTER_KSQL_URL: "http://ksql-server:8088" 105 | CONTROL_CENTER_KSQL_ADVERTISED_URL: "http://localhost:8088" 106 | CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081" 107 | CONTROL_CENTER_REPLICATION_FACTOR: 1 108 | CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1 109 | CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1 110 | CONFLUENT_METRICS_TOPIC_REPLICATION: 1 111 | PORT: 9021 112 | 113 | ksql-server: 114 | image: confluentinc/cp-ksql-server:5.2.1 115 | hostname: ksql-server 116 | container_name: ksql-server 117 | depends_on: 118 | - broker 119 | - connect 120 | ports: 121 | - "8088:8088" 122 | environment: 123 | KSQL_CONFIG_DIR: "/etc/ksql" 124 | KSQL_LOG4J_OPTS: "-Dlog4j.configuration=file:/etc/ksql/log4j-rolling.properties" 125 | KSQL_BOOTSTRAP_SERVERS: "broker:29092" 126 | KSQL_HOST_NAME: ksql-server 127 | KSQL_APPLICATION_ID: "cp-all-in-one" 128 | KSQL_LISTENERS: "http://0.0.0.0:8088" 129 | KSQL_CACHE_MAX_BYTES_BUFFERING: 0 130 | KSQL_KSQL_SCHEMA_REGISTRY_URL: "http://schema-registry:8081" 131 | KSQL_PRODUCER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor" 132 | KSQL_CONSUMER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor" 133 | 134 | ksql-cli: 135 | image: confluentinc/cp-ksql-cli:5.2.1 136 | container_name: ksql-cli 137 | depends_on: 138 | - broker 139 | - connect 140 | - ksql-server 141 | entrypoint: /bin/sh 142 | tty: true 143 | 144 | ksql-datagen: 145 | # Downrev ksql-examples to 5.1.2 due to DEVX-798 (work around issues in 5.2.0) 146 | image: confluentinc/ksql-examples:5.1.2 147 | hostname: ksql-datagen 148 | container_name: ksql-datagen 149 | depends_on: 150 | - ksql-server 151 | - broker 152 | - schema-registry 153 | - connect 154 | command: "bash -c 'echo Waiting for Kafka to be ready... && \ 155 | cub kafka-ready -b broker:29092 1 40 && \ 156 | echo Waiting for Confluent Schema Registry to be ready... && \ 157 | cub sr-ready schema-registry 8081 40 && \ 158 | echo Waiting a few seconds for topic creation to finish... && \ 159 | sleep 11 && \ 160 | tail -f /dev/null'" 161 | environment: 162 | KSQL_CONFIG_DIR: "/etc/ksql" 163 | KSQL_LOG4J_OPTS: "-Dlog4j.configuration=file:/etc/ksql/log4j-rolling.properties" 164 | STREAMS_BOOTSTRAP_SERVERS: broker:29092 165 | STREAMS_SCHEMA_REGISTRY_HOST: schema-registry 166 | STREAMS_SCHEMA_REGISTRY_PORT: 8081 167 | 168 | rest-proxy: 169 | image: confluentinc/cp-kafka-rest:5.2.1 170 | depends_on: 171 | - zookeeper 172 | - broker 173 | - schema-registry 174 | ports: 175 | - 8082:8082 176 | hostname: rest-proxy 177 | container_name: rest-proxy 178 | environment: 179 | KAFKA_REST_HOST_NAME: rest-proxy 180 | KAFKA_REST_BOOTSTRAP_SERVERS: 'broker:29092' 181 | KAFKA_REST_LISTENERS: "http://0.0.0.0:8082" 182 | KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081' -------------------------------------------------------------------------------- /kafka-ksql/README.md: -------------------------------------------------------------------------------- 1 | # Kafka ksql 2 | 3 | ## Steps to connect to ksql running with docker. ## 4 | 5 | docker exec -it ksql-server /bin/bash 6 | 7 | ## comamnd to data generate to ksql stream topic ## 8 | docker exec -it ksql-datagen bash 9 | ksql-datagen \ 10 | bootstrap-server=broker:29092 \ 11 | quickstart=pageviews \ 12 | format=delimited \ 13 | topic=pageviews \ 14 | maxInterval=500 15 | 16 | ksql-datagen schema=./userprofile.avro format=json topic=USERPROFILE key=userid maxInterval=5000 iterations=100ootstrap-server=broker:29092 17 | 18 | **KSQL will execute in 2 modes** 19 | 20 | Interactive and headless 21 | 22 | **is rest API supported in headless mode?** 23 | 24 | No. rest API is not supported in headless way. 25 | 26 | **what is KTable?** 27 | 28 | **what is GlobalKTable?** 29 | 30 | **Ktable vs GlobalKTable?** 31 | 32 | **Type of join supported by KSQL?** 33 | 34 | **output types in joins** 35 | 36 | join between a stream and a stream will return a new stream 37 | join between a stream and a table will return a stream 38 | join between a table and a table will return a table 39 | 40 | **How to terminate a query** 41 | 42 | terminate query "query_name" 43 | 44 | **How to run a ksql script from cli** 45 | 46 | run script "./path/to/file.ksql" 47 | 48 | **How to print a topic from beginning** 49 | 50 | print 'topicname' from beginning 51 | 52 | **How to stream from beginning** 53 | 54 | set 'auto.offset.reset'='earliest'; 55 | select * from stream; 56 | 57 | **Create a new stream from a stream** 58 | 59 | create stream user_profile_pretty as select firstname || ' ' || ucase( lastname) || ' from ' || countrycode || ' has a rating of ' || cast(rating as varchar) || ' stars. ' || case when rating < 2.5 then 'Poor' when rating between 2.5 and 4.2 then 'Good'else 'Excellent'end as description from userprofile; 60 | 61 | **How to set infinite retention in kafka** 62 | 63 | The kafka topic underpinning the static reference table can have infinite retention (log.retention.hours set to -1). That way the data will "always" be there for a join -------------------------------------------------------------------------------- /kafka-producer/README.md: -------------------------------------------------------------------------------- 1 | # Kafka producer 2 | 3 | ## Key points 4 | 5 | **What happens after producerRecord is sent? 6 | 7 | Step #1: Once we send the ProducerRecord, the first thing the producer will do is serialize the key and value objects to ByteArrays so they can be sent over the network 8 | Step #2: Data is sent to partitioner, If we specified a partition in the ProducerRecord, the partitioner doesn’t do anything and simply returns the partition we specified. 9 | If we didn’t, the partitioner will choose a partition for us, usually based on the ProducerRecord key. 10 | Step #3: Adds the record to a batch of records that will also be sent to the same topic and partition 11 | Step #4: A separate thread is responsible for sending those batches of records to the appropriate Kafka brokers. 12 | Step #5: Broker receives the messages, it sends back a response. 13 | Successful : it will return a RecordMetadata object with the topic, partition, and the offset of the record within the partition. 14 | Failed: it will return a error code, producer may retry sending few more times before giving up and returning an error. 15 | 16 | 17 | **When will a produced message will be ready to consume?** 18 | 19 | Messages written to the partition leader are not immediately readable by consumers regardless of the producer's acknowledgment settings. 20 | When all in-sync replicas have acknowledged the write, then the message is considered committed, which makes it available for reading. 21 | This ensures that messages cannot be lost by a broker failure after they have already been read. 22 | 23 | **What are the producer key configuration?** 24 | 25 | key configurations are explained here : https://docs.confluent.io/current/clients/producer.html 26 | 27 | **key.serializer** 28 | 29 | producer interface allows user to provide the key.serializer. 30 | Available serializers: ByteArraySerializer, StringSerializer and IntegerSerializer. 31 | setting key.serializer is required even if you intend to send only values. 32 | 33 | **value.serializer** 34 | 35 | producer interface allows the user to provide the value.serializer. like the same way as key.serializer. 36 | 37 | **retries** 38 | 39 | Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resents the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionally that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before the successful acknowledgment. Users should generally prefer to leave this config unset and instead use delivery.timeout.ms to control retry behavior. 40 | default: 2147483647 41 | By default, the producer will wait 100ms between retries, but you can control this using the retry.backoff.ms parameter. 42 | 43 | **acks** 44 | 45 | If acks=0, the producer will not wait for a reply from the broker before assuming the message was sent successfully. 46 | If acks=1, the producer will receive a success response from the broker the moment the leader replica received the message 47 | If the client uses callbacks, latency will be hidden, but throughput will be limited by the number of in-flight messages (i.e., how many messages the producer will send before receiving replies from the server). 48 | If acks=all, the producer will receive a success response from the broker once all in-sync replicas received the message. 49 | if acks is set to all, the request will be stored in a buffer called purgatory until the leader observes that the follower replicas replicated the message, at which point a response is sent to the client 50 | 51 | **buffer.memory** 52 | 53 | This sets the amount of memory the producer will use to buffer messages waiting to be sent to brokers. 54 | If messages are sent by the application faster than they can be delivered to the server, the producer may run out of space and additional send() calls will either block or throw an exception, based on the block.on.buffer.full parameter (replaced with max.block.ms in release 0.9.0.0, which allows blocking for a certain time and then throwing an exception). 55 | 56 | **batch.size** 57 | 58 | When multiple records are sent to the same partition, the producer will batch them together. 59 | This parameter controls the amount of memory in bytes (not messages!) that will be used for each batch. 60 | When the batch is full, all the messages in the batch will be sent. However, this does not mean that the producer will wait for the batch to become full. 61 | The producer will send half-full batches and even batches with just a single message in them. 62 | Therefore, setting the batch size too large will not cause delays in sending messages; it will just use more memory for the batches. 63 | Setting the batch size too small will add some overhead because the producer will need to send messages more frequently. 64 | 65 | **linger.ms** 66 | 67 | linger.ms controls the amount of time to wait for additional messages before sending the current batch. 68 | KafkaProducer sends a batch of messages either when the current batch is full or when the linger.ms limit is reached. 69 | By default, the producer will send messages as soon as there is a sender thread available to send them, even if there’s just one message in the batch. 70 | By setting linger.ms higher than 0, we instruct the producer to wait a few milliseconds to add additional messages to the batch before sending it to the brokers. 71 | This increases latency but also increases throughput (because we send more messages at once, there is less overhead per message). 72 | 73 | **compression.type** 74 | 75 | By default, messages are sent uncompressed 76 | supported compression types: snappy, gzip and lz4. 77 | snappy is recommended, with low CPU and good performance and decent compression ratio. 78 | Gzip use more CPU and time, but result in better compression ratio. 79 | 80 | **max.in.flight.requests.per.connection** 81 | 82 | This controls how many messages the producer will send to the server without receiving responses. 83 | Setting this high can increase memory usage while improving throughput, but setting it too high can reduce throughput as batching becomes less efficient. 84 | Setting this to 1 will guarantee that messages will be written to the broker in the order in which they were sent, even when retries occur. 85 | 86 | **Methods of sending messages** 87 | 88 | * Fire-and-forget : 89 | We may lose data in this situation. Possible cases of losing data: SerializationException when it fails to serialize the message, a BufferExhaustedException or TimeoutException if the buffer is full, or an InterruptException if the sending thread was interrupted. 90 | not recommended for production use. 91 | * Synchronous send 92 | We user Future.get() to wait for a reply from Kafka 93 | * Asynchronous send 94 | We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker. 95 | 96 | **Type of errors** 97 | 98 | Retriable: errors are those that can be resolved by sending the message again. 99 | For example, a connection error can be resolved because the connection may get reestablished. 100 | A “no leader” error can be resolved when a new leader is elected for the partition. 101 | KafkaProducer can be configured to retry those errors automatically. 102 | non-retriable: For example, “message size too large.” In those cases, KafkaProducer will not attempt a retry and will return the exception immediately. 103 | 104 | **Explain bootstrap.servers** 105 | 106 | bootstrap.servers property so that the producer can find the Kafka cluster. 107 | 108 | **Explain client.id** 109 | 110 | Although not required, you should always set a client.id since this allows you to easily correlate requests on the broker with the client instance which made it. 111 | 112 | 113 | **Where can I find the full list of configs documentation?** 114 | 115 | https://docs.confluent.io/current/installation/configuration/producer-configs.html#cp-config-producer 116 | 117 | **unclean.leader.election.enable** 118 | 119 | the default value is true, this will allow out of sync replicas to become leaders. 120 | This should be disabled in critical applications like banking system managing transactions. 121 | 122 | **min.insync.replicas** 123 | 124 | This will ensure the minimum number of replicas are in sync. 125 | NotEnoughReplicasException will be thrown to the producer when the in-sync replicas are less then what is configured. 126 | 127 | 128 | -------------------------------------------------------------------------------- /kafka-producer/kafka-producer.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /kafka-producer/pom.xml: -------------------------------------------------------------------------------- 1 | 5 | 6 | 4.0.0 7 | com.varma.kafka1 8 | kafka-producer 9 | 1.0.0 10 | 11 | 12 | 13 | confluent 14 | http://packages.confluent.io/maven/ 15 | 16 | 17 | 18 | 19 | 20 | org.apache.kafka 21 | kafka_2.12 22 | 5.3.0-ccs 23 | 24 | 25 | 26 | org.apache.kafka 27 | kafka_2.11 28 | 2.2.0 29 | test 30 | test 31 | 32 | 33 | 34 | org.apache.kafka 35 | kafka-clients 36 | 0.11.0.2 37 | test 38 | test 39 | 40 | 41 | 42 | org.slf4j 43 | slf4j-api 44 | 1.7.25 45 | 46 | 47 | 48 | org.slf4j 49 | slf4j-jdk14 50 | 1.7.25 51 | 52 | 53 | 54 | io.confluent 55 | kafka-avro-serializer 56 | 5.2.1 57 | 58 | 59 | 60 | junit 61 | junit 62 | 4.13.1 63 | test 64 | 65 | 66 | 67 | -------------------------------------------------------------------------------- /kafka-producer/src/main/java/com/varma/kafka/CustomKafkaProducer.java: -------------------------------------------------------------------------------- 1 | package com.varma.kafka; 2 | 3 | import org.apache.kafka.clients.producer.ProducerConfig; 4 | import org.apache.kafka.clients.producer.ProducerRecord; 5 | import org.apache.kafka.clients.producer.RecordMetadata; 6 | import org.apache.kafka.common.serialization.StringSerializer; 7 | 8 | import java.util.Properties; 9 | import java.util.concurrent.Future; 10 | 11 | public class CustomKafkaProducer { 12 | 13 | org.apache.kafka.clients.producer.KafkaProducer producer; 14 | 15 | public CustomKafkaProducer() { 16 | super(); 17 | this.producer = new org.apache.kafka.clients.producer.KafkaProducer(getProperties()); 18 | } 19 | 20 | 21 | private Properties getProperties(){ 22 | Properties config = new Properties(); 23 | config.put("client.id", "sample-client-id"); 24 | config.put("bootstrap.servers", "13.233.142.162:9092,13.233.142.162:29092"); 25 | config.put("acks", "all"); 26 | config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 27 | config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 28 | return config; 29 | } 30 | 31 | public boolean produce(String topic, String key, String value){ 32 | final ProducerRecord record = new ProducerRecord(topic, key, value); 33 | Future future = producer.send(record); 34 | return true; 35 | } 36 | 37 | } 38 | -------------------------------------------------------------------------------- /kafka-producer/src/main/java/com/varma/kafka/KafkaProducerMain.java: -------------------------------------------------------------------------------- 1 | package com.varma.kafka; 2 | 3 | import org.slf4j.Logger; 4 | import org.slf4j.LoggerFactory; 5 | 6 | public class KafkaProducerMain { 7 | 8 | private static Logger LOGGER = LoggerFactory.getLogger(KafkaProducerMain.class); 9 | 10 | public static void main(String[] args) throws Exception { 11 | LOGGER.info("Running kafka producer..."); 12 | CustomKafkaProducer customKafkaProducer = new CustomKafkaProducer(); 13 | customKafkaProducer.produce("sample-topic","1", "This is a sample message"); 14 | LOGGER.info("Running kafka producer completed!! "); 15 | } 16 | } 17 | -------------------------------------------------------------------------------- /kafka-producer/src/main/resources/log4j.properties: -------------------------------------------------------------------------------- 1 | log4j.rootLogger=TRACE, stdout 2 | log4j.appender.stdout=org.apache.log4j.ConsoleAppender 3 | log4j.appender.stdout.Target=System.out 4 | log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 5 | log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss.SSS} %-5p [%c] - %m%n -------------------------------------------------------------------------------- /kafka-producer/src/test/java/com/varma/kafka/CustomKafkaProducerTest.java: -------------------------------------------------------------------------------- 1 | package com.varma.kafka; 2 | 3 | import io.confluent.common.utils.MockTime; 4 | import io.confluent.common.utils.Time; 5 | import kafka.admin.AdminUtils; 6 | import kafka.admin.RackAwareMode; 7 | import kafka.server.KafkaConfig; 8 | import kafka.server.KafkaServer; 9 | import kafka.utils.TestUtils; 10 | import kafka.utils.ZKStringSerializer$; 11 | import kafka.utils.ZkUtils; 12 | import kafka.zk.EmbeddedZookeeper; 13 | import org.I0Itec.zkclient.ZkClient; 14 | import org.junit.Test; 15 | 16 | import java.util.Properties; 17 | 18 | public class CustomKafkaProducerTest { 19 | 20 | private static final String ZKHOST = "127.0.0.1"; 21 | private static final String BROKERHOST = "127.0.0.1"; 22 | private static final String BROKERPORT = "9092"; 23 | private static final String TOPIC = "test"; 24 | 25 | @Test 26 | public void testProducer(){ 27 | /* EmbeddedZookeeper zkServer = new EmbeddedZookeeper(); 28 | String zkConnect = ZKHOST + ":" + zkServer.port(); 29 | ZkClient zkClient = new ZkClient(zkConnect, 30000, 30000, ZKStringSerializer$.MODULE$); 30 | ZkUtils zkUtils = ZkUtils.apply(zkClient, false); 31 | 32 | // setup Broker 33 | Properties brokerProps = new Properties(); 34 | brokerProps.setProperty("zookeeper.connect", zkConnect); 35 | brokerProps.setProperty("broker.id", "0"); 36 | // brokerProps.setProperty("log.dirs", Files.createTempDirectory("kafka-").toAbsolutePath().toString()); 37 | brokerProps.setProperty("listeners", "PLAINTEXT://" + BROKERHOST +":" + BROKERPORT); 38 | brokerProps.setProperty("offsets.topic.replication.factor" , "1"); 39 | KafkaConfig config = new KafkaConfig(brokerProps); 40 | *//* Time mock = new MockTime(); 41 | KafkaServer kafkaServer = TestUtils.createServer(config, mock);*//* 42 | 43 | // create topic 44 | AdminUtils.createTopic(zkUtils, TOPIC, 1, 1, new Properties(), RackAwareMode.Disabled$.MODULE$);*/ 45 | } 46 | 47 | } -------------------------------------------------------------------------------- /kafka-producer/src/test/java/com/varma/kafka/ProducerTest.java: -------------------------------------------------------------------------------- 1 | package com.varma.kafka; 2 | 3 | import org.junit.Test; 4 | 5 | import static org.junit.Assert.*; 6 | 7 | public class ProducerTest { 8 | 9 | @Test 10 | public void testKafkaProducer(){ 11 | 12 | } 13 | 14 | } -------------------------------------------------------------------------------- /kafka-schema-registry/README.md: -------------------------------------------------------------------------------- 1 | # Kafka schema-registry 2 | 3 | ## How to connect and run schema registry commands? 4 | 5 | ### docker-commands 6 | docker exec -it schema-registry /bin/bash 7 | 8 | ### useful kafka commands 9 | 10 | kafka-avro-console-producer \ 11 | --broker-list broker:29092 --topic bar \ 12 | --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}' 13 | 14 | sample mesages 15 | {"f1": "value1"} 16 | {"f2": "value2"} 17 | 18 | kafka-avro-console-consumer \ 19 | --topic customer-avro \ 20 | --bootstrap-server broker:29092 \ 21 | --from-beginning 22 | 23 | **Where does kafka stores schema's?** 24 | 25 | Kafka stores schema infromation in _schema topic. 26 | 27 | **What are the schema registry properties we need to provide in producer to register schema?** 28 | 29 | props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); 30 | props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class); 31 | props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); 32 | props.put("schema.registry.url", "http://localhost:8081"); 33 | 34 | **How to registry schema for a key?** 35 | 36 | props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class); 37 | This will register schema for a key. 38 | 39 | **What is a specificrecord or genericRecord?** 40 | 41 | Specificrecord: 42 | It is a auto generated pojo class by maven plugin when a .avsc schema file is provided. 43 | This will throw an compilation error when a field is missing. 44 | GenericRecord: 45 | schema will be provided explicitly and field will be accessed either by name or index. 46 | This will throw an compilation error when a field is missing. 47 | 48 | **Will kafka producer always register schema for each produced message?** 49 | 50 | No. schema will be registered when the first message is published and it will be cached in a hashmap by KafkaAvroSerializer. 51 | Next time it will retrive it from cache. 52 | 53 | **Will kafka consumer always fetch schema for each consumed message?** 54 | 55 | No. schema will be registered when the first message is fetched from registry and it will be cached in a hashmap by KafkaAvroDeSerializer. 56 | Next time it will retrive it from cache. 57 | 58 | **When deserialize a message if schema is not avilable in cache** 59 | 60 | Schema will be fetched from schema registry. 61 | 62 | **What is magicbyte?** 63 | 64 | magicbyte is a schema id got after schema is registered with schema registry. 65 | It will appened to the message and consumer will refer to this while decoding the message. 66 | 67 | **What are type of compatibility schema registry supports and how?** 68 | 69 | These are the compatibility types: 70 | BACKWARD: (default) consumers using the new schema can read data written by producers using the latest registered schema 71 | BACKWARD_TRANSITIVE: consumers using the new schema can read data written by producers using all previously registered schemas 72 | FORWARD: consumers using the latest registered schema can read data written by producers using the new schema 73 | FORWARD_TRANSITIVE: consumers using all previously registered schemas can read data written by producers using the new schema 74 | FULL: the new schema is forward and backward compatible with the latest registered schema 75 | FULL_TRANSITIVE: the new schema is forward and backward compatible with all previously registered schemas 76 | NONE: schema compatibility checks are disabled 77 | 78 | **what is the default compatibility type** 79 | 80 | BACKWARD 81 | 82 | **will a custom pojo can be auto registered in register schema?** 83 | 84 | No, this has to be handled explicitly and you need to modify your class similar to auto generate class by maven plugin. 85 | 86 | **How to disable auto schema registry?** 87 | 88 | client applications automatically register new schemas. This is very convenient in development environments, but in production environments we recommend that client applications do not automatically register new schemas. 89 | props.put(AbstractKafkaAvroSerDeConfig.AUTO_REGISTER_SCHEMAS, false); 90 | -------------------------------------------------------------------------------- /kafka-schema-registry/pom.xml: -------------------------------------------------------------------------------- 1 | 5 | 6 | 4.0.0 7 | com.varma.kafka1 8 | kafka-schema-registry 9 | 1.0.0 10 | 11 | 12 | 1.8.2 13 | 0.11.0.1-cp1 14 | 3.3.1 15 | 16 | 17 | 18 | 19 | 20 | confluent 21 | http://packages.confluent.io/maven/ 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | org.apache.avro 31 | avro 32 | ${avro.version} 33 | 34 | 35 | 36 | 37 | 38 | org.apache.kafka 39 | kafka-clients 40 | ${kafka.version} 41 | 42 | 43 | 44 | io.confluent 45 | kafka-avro-serializer 46 | ${confluent.version} 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | org.apache.maven.plugins 55 | maven-compiler-plugin 56 | 3.7.0 57 | 58 | 1.8 59 | 1.8 60 | 61 | 62 | 63 | 64 | 65 | 66 | org.apache.avro 67 | avro-maven-plugin 68 | ${avro.version} 69 | 70 | 71 | generate-sources 72 | 73 | schema 74 | protocol 75 | idl-protocol 76 | 77 | 78 | ${project.basedir}/src/main/resources/avro 79 | String 80 | false 81 | true 82 | private 83 | 84 | 85 | 86 | 87 | 88 | 89 | org.codehaus.mojo 90 | build-helper-maven-plugin 91 | 3.0.0 92 | 93 | 94 | add-source 95 | generate-sources 96 | 97 | add-source 98 | 99 | 100 | 101 | target/generated-sources/avro 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | -------------------------------------------------------------------------------- /kafka-schema-registry/src/main/java/com/varma/kafka/CustomKafkaProducer.java: -------------------------------------------------------------------------------- 1 | package com.varma.kafka; 2 | 3 | import org.apache.kafka.clients.producer.ProducerConfig; 4 | import org.apache.kafka.clients.producer.ProducerRecord; 5 | import org.apache.kafka.clients.producer.RecordMetadata; 6 | import org.apache.kafka.common.serialization.StringSerializer; 7 | 8 | import java.util.Properties; 9 | import java.util.concurrent.Future; 10 | 11 | public class CustomKafkaProducer { 12 | 13 | org.apache.kafka.clients.producer.KafkaProducer producer; 14 | 15 | public CustomKafkaProducer() { 16 | super(); 17 | this.producer = new org.apache.kafka.clients.producer.KafkaProducer(getProperties()); 18 | } 19 | 20 | 21 | private Properties getProperties(){ 22 | Properties config = new Properties(); 23 | config.put("client.id", "sample-client-id"); 24 | config.put("bootstrap.servers", "13.233.142.162:9092,13.233.142.162:29092"); 25 | config.put("acks", "all"); 26 | config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 27 | config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 28 | return config; 29 | } 30 | 31 | public boolean produce(String topic, String key, String value){ 32 | final ProducerRecord record = new ProducerRecord(topic, key, value); 33 | Future future = producer.send(record); 34 | return true; 35 | } 36 | 37 | } 38 | -------------------------------------------------------------------------------- /kafka-schema-registry/src/main/java/com/varma/kafka/KafkaProducerMain.java: -------------------------------------------------------------------------------- 1 | package com.varma.kafka; 2 | 3 | import com.example.Customer; 4 | import io.confluent.kafka.serializers.KafkaAvroSerializer; 5 | import org.apache.kafka.clients.producer.*; 6 | import org.apache.kafka.common.serialization.StringSerializer; 7 | import org.slf4j.Logger; 8 | import org.slf4j.LoggerFactory; 9 | 10 | import java.util.Properties; 11 | 12 | public class KafkaProducerMain { 13 | 14 | private static Logger LOGGER = LoggerFactory.getLogger(KafkaProducerMain.class); 15 | 16 | public static void main(String[] args) throws Exception { 17 | Properties properties = new Properties(); 18 | // normal producer 19 | properties.setProperty("bootstrap.servers", "13.233.161.161:29092"); 20 | properties.setProperty("acks", "all"); 21 | properties.setProperty("retries", "10"); 22 | // avro part 23 | properties.setProperty("key.serializer", StringSerializer.class.getName()); 24 | properties.setProperty("value.serializer", KafkaAvroSerializer.class.getName()); 25 | properties.setProperty("schema.registry.url", "http://13.233.161.161:8081"); 26 | 27 | Producer producer = new KafkaProducer(properties); 28 | 29 | String topic = "customer-avro"; 30 | 31 | // copied from avro examples 32 | Customer customer = Customer.newBuilder() 33 | .setAge(34) 34 | .setAutomatedEmail(false) 35 | .setFirstName("John") 36 | .setLastName("Doe") 37 | .setHeight(178f) 38 | .setWeight(75f) 39 | .build(); 40 | 41 | ProducerRecord producerRecord = new ProducerRecord( 42 | topic, customer 43 | ); 44 | 45 | System.out.println(customer); 46 | producer.send(producerRecord, new Callback() { 47 | public void onCompletion(RecordMetadata metadata, Exception exception) { 48 | if (exception == null) { 49 | System.out.println(metadata); 50 | } else { 51 | exception.printStackTrace(); 52 | } 53 | } 54 | }); 55 | 56 | producer.flush(); 57 | producer.close(); 58 | 59 | } 60 | } 61 | -------------------------------------------------------------------------------- /kafka-schema-registry/src/main/resources/avro/customer.avsc: -------------------------------------------------------------------------------- 1 | { 2 | "type": "record", 3 | "namespace": "com.example", 4 | "name": "Customer", 5 | "version": "1", 6 | "fields": [ 7 | { "name": "first_name", "type": "string", "doc": "First Name of Customer" }, 8 | { "name": "last_name", "type": "string", "doc": "Last Name of Customer" }, 9 | { "name": "age", "type": "int", "doc": "Age at the time of registration" }, 10 | { "name": "height", "type": "float", "doc": "Height at the time of registration in cm" }, 11 | { "name": "weight", "type": "float", "doc": "Weight at the time of registration in kg" }, 12 | { "name": "automated_email", "type": "boolean", "default": true, "doc": "Field indicating if the user is enrolled in marketing emails" } 13 | ] 14 | } -------------------------------------------------------------------------------- /kafka-schema-registry/src/main/resources/log4j.properties: -------------------------------------------------------------------------------- 1 | log4j.rootLogger=TRACE, stdout 2 | log4j.appender.stdout=org.apache.log4j.ConsoleAppender 3 | log4j.appender.stdout.Target=System.out 4 | log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 5 | log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss.SSS} %-5p [%c] - %m%n -------------------------------------------------------------------------------- /kafka-streams/Processor-api-README.md: -------------------------------------------------------------------------------- 1 | # Kafka streams 2 | 3 | -------------------------------------------------------------------------------- /kafka-streams/README.md: -------------------------------------------------------------------------------- 1 | # Kafka streams 2 | 3 | **What is a topology?** 4 | 5 | Topology represent the DAG for streaming process. 6 | 7 | **what are kafka streams?** 8 | 9 | Kafka streams leveeraged the consumer and producer API, so all the properties applicable to consumer and producer is sill applicable here. 10 | application.id specific to stream applications will be used for 11 | consumer group.id = application.id 12 | default client.id prefix 13 | prefix to the internal changelog topics 14 | 15 | **Kafka streams creates internal intermidiate topics** 16 | 17 | repartitioning topics 18 | changelog topics 19 | 20 | **What are the types of stores in streams?** 21 | 22 | 1. local or internal state 23 | 2. external state 24 | 25 | **What is state store?** 26 | 27 | Some operations(like avg, count etc) in streams depends on previously calculated infromation for which state need to be stored. 28 | 29 | **What happens to local state if a node goes down?** 30 | 31 | local state will be stored in rockDB and it is also writted in kafka topic, in case of node failure state will be recreated from kafka. 32 | 33 | 34 | **What are the type of windowing options in streams?** 35 | 36 | 1. Tumbling window or sliding window 37 | 2. hoping window 38 | 39 | **How out of sync records are handled in streams?** 40 | 41 | **Stream vs table??** 42 | 43 | stream represents all the events occured in table. insert only mode. 44 | table represents the current state of the records. upsert mode. -------------------------------------------------------------------------------- /kafka-streams/kafka-DSL-README.md: -------------------------------------------------------------------------------- 1 | # Kafka streams 2 | 3 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 6 | 4.0.0 7 | 8 | 9 | com.varma.kafka 10 | kafka-modules 11 | pom 12 | 1.0 13 | 14 | 15 | 16 | kafka-producer 17 | kafka-schema-registry 18 | 19 | 20 | --------------------------------------------------------------------------------