| Property Name | Value |
|---|---|
spark.cassandra.connection.host |
38 | Your Cassandra Endpoint: ACOUNT_NAME.cassandra.cosmosdb.azure.com |
39 |
spark.cassandra.connection.port |
42 | 10350 |
43 |
spark.cassandra.connection.ssl.enabled |
46 | true |
47 |
spark.cassandra.auth.username |
50 | COSMOSDB_ACCOUNTNAME |
51 |
spark.cassandra.auth.password |
54 | COSMOSDB_KEY |
55 |
| Property Name | Description |
|---|---|
spark.cassandra.output.batch.size.rows |
65 | Leave this to 1. This is prefered for Cosmos DB's provisioning model in order to achieve higher throughput for heavy workloads. |
66 |
spark.cassandra.connection.connections_per_executor_max |
69 | 10*nWhich would be equivalent to 10 connections per node in an n-node Cassandra cluster. Hence if you require 5 connections per node per executor for a 5 node Cassandra cluster, then you would need to set this configuration to 25. (Modify based on the degree of parallelism/number of executors that your spark job are configured for) |
70 |
spark.cassandra.output.concurrent.writes |
73 | 100Defines the number of parallel writes that can occur per executor. As batch.size.rows is 1, make sure to scale up this value accordingly. (Modify this based on the degree of parallelism/throughput that you want to achieve for your workload) |
74 |
spark.cassandra.concurrent.reads |
77 | 512Defines the number of parallel reads that can occur per executor. (Modify this based on the degree of parallelism/throughput that you want to achieve for your workload) |
78 |
spark.cassandra.output.throughput_mb_per_sec |
81 | Defines the total write throughput per executor. This can be used as an upper cap for your spark job throughput, and base it on the provisioned throughput of your Cosmos DB Collection. | 82 |
spark.cassandra.input.reads_per_sec |
85 | Defines the total read throughput per executor. This can be used as an upper cap for your spark job throughput, and base it on the provisioned throughput of your Cosmos DB Collection. | 86 |
spark.cassandra.output.batch.grouping.buffer.size |
89 | 1000 | 90 |
spark.cassandra.connection.keep_alive_ms |
93 | 60000 | 94 |
CosmosDbConnectionFactory.scala
102 | * CosmosDbMultipleRetryPolicy.scala
103 |
104 | ### Retry Policy
105 | The retry policy for Cosmos DB is configured to handle http status code 429 - Request Rate Large exceptions. The Cosmos Db Cassandra API, translates these exceptions to overloaded errors on the Cassandra native protocol, which we want to retry with back-offs.
106 | The reason for doing so is because Cosmos DB follows a provisioned throughput model, and having this retry policy would protect your spark jobs against spikes of data ingress/egress that would momentarily exceed the allocated throughput for your collection, resulting in the request rate limiting exceptions.
107 |
108 | *Note - that this retry policy is meant to only protect your spark jobs against momentary spikes. If you have not configured enough RUs on your collection for the intended throughput of your workload such that the retries don't catch up, then the retry policy will result in rethrows.*
109 |
110 | ## Known Issues
111 |
112 | ### Tokens and Token Range Filters
113 | We do not currently support methods that make use of Tokens for filtering data. Hence please avoid using any APIs that perform table scans.
114 |
115 | ## Resources
116 | - [DataStax Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector)
117 | - [CosmosDB Cassandra API](https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction)
118 | - [Apache Spark](https://spark.apache.org/docs/latest/index.html)
119 | - [HDI Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql)
120 |
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
2 |