├── src └── main │ ├── resources │ ├── META-INF │ │ └── MANIFEST.MF │ └── log4j.properties │ └── scala │ └── com │ └── microsoft │ └── azure │ └── cosmosdb │ └── cassandra │ ├── CosmosDbMultipleRetryPolicy.scala │ ├── SampleCosmosDBApp.scala │ └── CosmosDbConnectionFactory.scala ├── CHANGELOG.md ├── .gitignore ├── .github ├── ISSUE_TEMPLATE.md └── PULL_REQUEST_TEMPLATE.md ├── LICENSE.md ├── CONTRIBUTING.md ├── README.md └── pom.xml /src/main/resources/META-INF/MANIFEST.MF: -------------------------------------------------------------------------------- 1 | Manifest-Version: 1.0 2 | Main-Class: com.microsoft.azure.cosmosdb.cassandra.SampleCosmosDBApp 3 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## [project-title] Changelog 2 | 3 | 4 | # x.y.z (yyyy-mm-dd) 5 | 6 | *Features* 7 | * ... 8 | 9 | *Bug Fixes* 10 | * ... 11 | 12 | *Breaking Changes* 13 | * ... 14 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Generated Class files/logs 2 | *.class 3 | *.log 4 | # Ignore files generated by IntelliJ 5 | *.iml 6 | 7 | # Build output 8 | target/ 9 | *.class 10 | 11 | # Log file 12 | *.log 13 | 14 | # BlueJ files 15 | *.ctxt 16 | 17 | # Mobile Tools for Java (J2ME) 18 | .mtj.tmp/ 19 | 20 | # Package Files # 21 | *.jar 22 | *.war 23 | *.ear 24 | *.zip 25 | *.tar.gz 26 | *.rar 27 | 28 | # virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml 29 | hs_err_pid* 30 | 31 | # IDE 32 | .idea/ 33 | *.iml 34 | 35 | # macOS 36 | .DS_Store 37 | 38 | # Azure Functions 39 | local.settings.json 40 | bin/ 41 | obj/ 42 | .metals/metals.h2.db 43 | .classpath 44 | .factorypath 45 | .project 46 | .metals/metals.lock.db 47 | .settings/org.eclipse.jdt.core.prefs 48 | .settings/org.eclipse.jdt.apt.core.prefs 49 | .settings/org.eclipse.m2e.core.prefs 50 | .vscode/settings.json 51 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 4 | > Please provide us with the following information: 5 | > --------------------------------------------------------------- 6 | 7 | ### This issue is for a: (mark with an `x`) 8 | ``` 9 | - [ ] bug report -> please search issues before submitting 10 | - [ ] feature request 11 | - [ ] documentation issue or request 12 | - [ ] regression (a behavior that used to work and stopped in a new release) 13 | ``` 14 | 15 | ### Minimal steps to reproduce 16 | > 17 | 18 | ### Any log messages given by the failure 19 | > 20 | 21 | ### Expected/desired behavior 22 | > 23 | 24 | ### OS and Version? 25 | > Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) 26 | 27 | ### Versions 28 | > 29 | 30 | ### Mention any other details that might be useful 31 | 32 | > --------------------------------------------------------------- 33 | > Thanks! We'll be in touch soon. 34 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | ## Purpose 2 | 3 | * ... 4 | 5 | ## Does this introduce a breaking change? 6 | 7 | ``` 8 | [ ] Yes 9 | [ ] No 10 | ``` 11 | 12 | ## Pull Request Type 13 | What kind of change does this Pull Request introduce? 14 | 15 | 16 | ``` 17 | [ ] Bugfix 18 | [ ] Feature 19 | [ ] Code style update (formatting, local variables) 20 | [ ] Refactoring (no functional changes, no api changes) 21 | [ ] Documentation content changes 22 | [ ] Other... Please describe: 23 | ``` 24 | 25 | ## How to Test 26 | * Get the code 27 | 28 | ``` 29 | git clone [repo-address] 30 | cd [repo-name] 31 | git checkout [branch-name] 32 | npm install 33 | ``` 34 | 35 | * Test the code 36 | 37 | ``` 38 | ``` 39 | 40 | ## What to Check 41 | Verify that the following are valid 42 | * ... 43 | 44 | ## Other Information 45 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Microsoft Corporation. All rights reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE -------------------------------------------------------------------------------- /src/main/resources/log4j.properties: -------------------------------------------------------------------------------- 1 | # Set everything to be logged to the console 2 | log4j.rootCategory=INFO, console 3 | log4j.appender.console=org.apache.log4j.ConsoleAppender 4 | log4j.appender.console.target=System.err 5 | log4j.appender.console.layout=org.apache.log4j.PatternLayout 6 | log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n 7 | 8 | # Set the default spark-shell log level to WARN. When running the spark-shell, the 9 | # log level for this class is used to overwrite the root logger's log level, so that 10 | # the user can have different defaults for the shell and regular Spark apps. 11 | log4j.logger.org.apache.spark.repl.Main=WARN 12 | 13 | # Settings to quiet third party logs that are too verbose 14 | log4j.logger.org.spark_project.jetty=WARN 15 | log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR 16 | log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO 17 | log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO 18 | 19 | # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support 20 | log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL 21 | log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR 22 | 23 | # Parquet related logging 24 | log4j.logger.org.apache.parquet.CorruptStatistics=ERROR 25 | log4j.logger.parquet.CorruptStatistics=ERROR 26 | -------------------------------------------------------------------------------- /src/main/scala/com/microsoft/azure/cosmosdb/cassandra/CosmosDbMultipleRetryPolicy.scala: -------------------------------------------------------------------------------- 1 | /** 2 | * 3 | * Copyright (c) Microsoft. All rights reserved. 4 | * 5 | */ 6 | 7 | package com.microsoft.azure.cosmosdb.cassandra 8 | 9 | import com.datastax.driver.core.exceptions._ 10 | import com.datastax.driver.core.policies.RetryPolicy.RetryDecision 11 | import com.datastax.driver.core.{ConsistencyLevel, Statement} 12 | import com.datastax.spark.connector.cql.MultipleRetryPolicy 13 | 14 | 15 | /** 16 | * This retry policy extends the MultipleRetryPolicy, and additionally performs retries with back-offs for overloaded exceptions. For more details regarding this, please refer to the "Retry Policy" section of README.md 17 | */ 18 | class CosmosDbMultipleRetryPolicy(maxRetryCount: Int) 19 | extends MultipleRetryPolicy(maxRetryCount){ 20 | 21 | /** 22 | * The retry policy performs growing/fixed back-offs for overloaded exceptions based on the max retries: 23 | * 1. If Max retries == -1, i.e., retry infinitely, then we follow a fixed back-off scheme of 5 seconds on each retry. 24 | * 2. If Max retries != -1, and is any positive number n, then we follow a growing back-off scheme of (i*1) seconds where 'i' is the i'th retry. 25 | * If you'd like to modify the back-off intervals, please update GrowingBackOffTimeMs and FixedBackOffTimeMs accordingly. 26 | */ 27 | val GrowingBackOffTimeMs: Int = 1000 28 | val FixedBackOffTimeMs: Int = 5000 29 | 30 | // scalastyle:off null 31 | private def retryManyTimesWithBackOffOrThrow(nbRetry: Int): RetryDecision = maxRetryCount match { 32 | case -1 => 33 | Thread.sleep(FixedBackOffTimeMs) 34 | RetryDecision.retry(null) 35 | case maxRetries => 36 | if (nbRetry < maxRetries) { 37 | Thread.sleep(GrowingBackOffTimeMs * nbRetry) 38 | RetryDecision.retry(null) 39 | } else { 40 | RetryDecision.rethrow() 41 | } 42 | } 43 | 44 | override def init(cluster: com.datastax.driver.core.Cluster): Unit = {} 45 | override def close(): Unit = {} 46 | 47 | override def onRequestError( 48 | stmt: Statement, 49 | cl: ConsistencyLevel, 50 | ex: DriverException, 51 | nbRetry: Int): RetryDecision = { 52 | ex match { 53 | case _: OverloadedException => retryManyTimesWithBackOffOrThrow(nbRetry) 54 | case _ => RetryDecision.rethrow() 55 | } 56 | } 57 | } -------------------------------------------------------------------------------- /src/main/scala/com/microsoft/azure/cosmosdb/cassandra/SampleCosmosDBApp.scala: -------------------------------------------------------------------------------- 1 | /** 2 | * 3 | * Copyright (c) Microsoft. All rights reserved. 4 | * 5 | */ 6 | 7 | package com.microsoft.azure.cosmosdb.cassandra 8 | 9 | import com.datastax.spark.connector._ 10 | import com.datastax.spark.connector.cql.CassandraConnector 11 | import org.apache.spark.rdd.RDD 12 | import org.apache.spark.{SparkConf, SparkContext} 13 | 14 | import scala.util.Random 15 | 16 | object SampleCosmosDBApp extends Serializable { 17 | 18 | // GENERATE RANDOM DATA 19 | def randomDataPerPartitionId(sc: SparkContext, DoP: Int, start: Int, end: Int, col1: Int, col2: Int): RDD[(Int, Int, Int, String)] = { 20 | sc.parallelize(start until end, DoP).map { id2 => 21 | val id1Val: Int = Random.nextInt(col1) 22 | val id2Val: Int = Random.nextInt(col2) 23 | val col1Val: Int = Random.nextInt(col1) 24 | val col2Val: String = s"text_${Random.nextInt(col2)}" 25 | (id1Val, id2Val, col1Val, col2Val) 26 | } 27 | } 28 | 29 | // MAIN 30 | def main (arg: Array[String]): Unit = { 31 | 32 | // CONFIG. *NOTE*: Please read the README.md for more details regarding each conf value. 33 | val conf = new SparkConf(true) 34 | .setAppName("SampleCosmosDBCassandraApp") 35 | // Cosmos DB Cassandra API Connection configs 36 | .set("spark.cassandra.connection.host", "") 37 | .set("spark.cassandra.connection.port", "10350") 38 | .set("spark.cassandra.connection.ssl.enabled", "true") 39 | .set("spark.cassandra.auth.username", "COSMOSDB_ACCOUNTNAME") 40 | .set("spark.cassandra.auth.password", "COSMODB_KEY") 41 | // Parallelism and throughput configs. 42 | .set("spark.cassandra.output.batch.size.rows", "1") 43 | // *NOTE*: The values below are meant as defaults for a sample workload. Please read the README.md for more information on fine tuning these conf value. 44 | .set("spark.cassandra.connection.connections_per_executor_max", "10") 45 | .set("spark.cassandra.output.concurrent.writes", "100") 46 | .set("spark.cassandra.concurrent.reads", "512") 47 | .set("spark.cassandra.output.batch.grouping.buffer.size", "1000") 48 | .set("spark.cassandra.connection.keep_alive_ms", "60000") 49 | // Cosmos DB Connection Factory, configured with retry policy for rate limiting. 50 | .set("spark.cassandra.connection.factory", "com.microsoft.azure.cosmosdb.CosmosDbConnectionFactory") 51 | 52 | 53 | // SPARK CONTEXT 54 | val sc = new SparkContext(conf) 55 | 56 | // CREATE KEYSPACE/TABLE, AND ANY ARBITRARY QUERY STRING. 57 | CassandraConnector(conf).withSessionDo { session => 58 | session.execute("CREATE KEYSPACE kspc WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }") 59 | session.execute("CREATE TABLE kspc.tble (id1 int, id2 int, col1 int, col2 text, PRIMARY KEY(id1, id2))") 60 | } 61 | 62 | // INSERT DATA 63 | val collection = sc.parallelize(Seq((1, 1, 1, "text_1"), (2, 2, 2, "text_2"))) 64 | collection.saveToCassandra("kspc", "tble", SomeColumns("id1", "id2", "col1", "col2")) 65 | 66 | // INSERT GENERATED DATA 67 | randomDataPerPartitionId(sc, DoP= 10, start = 0, end = 10000, col1 = 100000, col2 = 100000). 68 | saveToCassandra("large", "large", SomeColumns("id1", "id2", "col1", "col2")) 69 | 70 | // SELECT 71 | val rdd1 = sc.cassandraTable("kspc", "tble").select("id1", "id2", "col1", "col2").where("id1 = ?", 1) 72 | 73 | // SELECT and print lines 74 | val rdd2 = sc.cassandraTable("kspc", "tble").select("id1", "id2", "col1", "col2").where("id1 = ?", 2).collect().foreach(println) 75 | 76 | sc.stop 77 | } 78 | } -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to [project-title] 2 | 3 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 4 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 5 | the rights to use your contribution. For details, visit https://cla.microsoft.com. 6 | 7 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide 8 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions 9 | provided by the bot. You will only need to do this once across all repos using our CLA. 10 | 11 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 12 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 13 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 14 | 15 | - [Code of Conduct](#coc) 16 | - [Issues and Bugs](#issue) 17 | - [Feature Requests](#feature) 18 | - [Submission Guidelines](#submit) 19 | 20 | ## Code of Conduct 21 | Help us keep this project open and inclusive. Please read and follow our [Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 22 | 23 | ## Found an Issue? 24 | If you find a bug in the source code or a mistake in the documentation, you can help us by 25 | [submitting an issue](#submit-issue) to the GitHub Repository. Even better, you can 26 | [submit a Pull Request](#submit-pr) with a fix. 27 | 28 | ## Want a Feature? 29 | You can *request* a new feature by [submitting an issue](#submit-issue) to the GitHub 30 | Repository. If you would like to *implement* a new feature, please submit an issue with 31 | a proposal for your work first, to be sure that we can use it. 32 | 33 | * **Small Features** can be crafted and directly [submitted as a Pull Request](#submit-pr). 34 | 35 | ## Submission Guidelines 36 | 37 | ### Submitting an Issue 38 | Before you submit an issue, search the archive, maybe your question was already answered. 39 | 40 | If your issue appears to be a bug, and hasn't been reported, open a new issue. 41 | Help us to maximize the effort we can spend fixing issues and adding new 42 | features, by not reporting duplicate issues. Providing the following information will increase the 43 | chances of your issue being dealt with quickly: 44 | 45 | * **Overview of the Issue** - if an error is being thrown a non-minified stack trace helps 46 | * **Version** - what version is affected (e.g. 0.1.2) 47 | * **Motivation for or Use Case** - explain what are you trying to do and why the current behavior is a bug for you 48 | * **Browsers and Operating System** - is this a problem with all browsers? 49 | * **Reproduce the Error** - provide a live example or a unambiguous set of steps 50 | * **Related Issues** - has a similar issue been reported before? 51 | * **Suggest a Fix** - if you can't fix the bug yourself, perhaps you can point to what might be 52 | causing the problem (line of code or commit) 53 | 54 | You can file new issues by providing the above information at the corresponding repository's issues link: https://github.com/[organization-name]/[repository-name]/issues/new]. 55 | 56 | ### Submitting a Pull Request (PR) 57 | Before you submit your Pull Request (PR) consider the following guidelines: 58 | 59 | * Search the repository (https://github.com/[organization-name]/[repository-name]/pulls) for an open or closed PR 60 | that relates to your submission. You don't want to duplicate effort. 61 | 62 | * Make your changes in a new git fork: 63 | 64 | * Commit your changes using a descriptive commit message 65 | * Push your fork to GitHub: 66 | * In GitHub, create a pull request 67 | * If we suggest changes then: 68 | * Make the required updates. 69 | * Rebase your fork and force push to your GitHub repository (this will update your Pull Request): 70 | 71 | ```shell 72 | git rebase master -i 73 | git push -f 74 | ``` 75 | 76 | That's it! Thank you for your contribution! 77 | -------------------------------------------------------------------------------- /src/main/scala/com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory.scala: -------------------------------------------------------------------------------- 1 | /** 2 | * 3 | * Copyright (c) Microsoft. All rights reserved. 4 | * 5 | */ 6 | package com.microsoft.azure.cosmosdb.cassandra 7 | 8 | import java.nio.file.{Files, Path, Paths} 9 | import java.security.{KeyStore, SecureRandom} 10 | import javax.net.ssl.{KeyManagerFactory, SSLContext, TrustManagerFactory} 11 | 12 | import com.datastax.driver.core._ 13 | import com.datastax.driver.core.policies.ExponentialReconnectionPolicy 14 | import com.datastax.spark.connector.cql.CassandraConnectorConf.CassandraSSLConf 15 | import com.datastax.spark.connector.cql._ 16 | import org.apache.commons.io.IOUtils 17 | 18 | object CosmosDbConnectionFactory extends CassandraConnectionFactory { 19 | 20 | /** Returns the Cluster.Builder object used to setup Cluster instance. */ 21 | def clusterBuilder(conf: CassandraConnectorConf): Cluster.Builder = { 22 | val options = new SocketOptions() 23 | .setConnectTimeoutMillis(conf.connectTimeoutMillis) 24 | .setReadTimeoutMillis(conf.readTimeoutMillis) 25 | 26 | val builder = Cluster.builder() 27 | .addContactPoints(conf.hosts.toSeq: _*) 28 | .withPort(conf.port) 29 | /** 30 | * Make use of the custom RetryPolicy for Cosmos DB. This Is needed for retrying scenarios specific to Cosmos DB. 31 | * Please refer to the "Retry Policy" section of the README.md for more information regarding this. 32 | */ 33 | .withRetryPolicy( 34 | new CosmosDbMultipleRetryPolicy(conf.queryRetryCount)) 35 | .withReconnectionPolicy( 36 | new ExponentialReconnectionPolicy(conf.minReconnectionDelayMillis, conf.maxReconnectionDelayMillis)) 37 | .withLoadBalancingPolicy( 38 | new LocalNodeFirstLoadBalancingPolicy(conf.hosts, conf.localDC)) 39 | .withAuthProvider(conf.authConf.authProvider) 40 | .withSocketOptions(options) 41 | .withCompression(conf.compression) 42 | .withQueryOptions( 43 | new QueryOptions() 44 | .setRefreshNodeIntervalMillis(0) 45 | .setRefreshNodeListIntervalMillis(0) 46 | .setRefreshSchemaIntervalMillis(0)) 47 | 48 | if (conf.cassandraSSLConf.enabled) { 49 | maybeCreateSSLOptions(conf.cassandraSSLConf) match { 50 | case Some(sslOptions) ⇒ builder.withSSL(sslOptions) 51 | case None ⇒ builder.withSSL() 52 | } 53 | } else { 54 | builder 55 | } 56 | } 57 | 58 | private def getKeyStore( 59 | ksType: String, 60 | ksPassword: Option[String], 61 | ksPath: Option[Path]): Option[KeyStore] = { 62 | 63 | ksPath match { 64 | case Some(path) => 65 | val ksIn = Files.newInputStream(path) 66 | try { 67 | val keyStore = KeyStore.getInstance(ksType) 68 | keyStore.load(ksIn, ksPassword.map(_.toCharArray).orNull) 69 | Some(keyStore) 70 | } finally { 71 | IOUtils.closeQuietly(ksIn) 72 | } 73 | case None => None 74 | } 75 | } 76 | 77 | private def maybeCreateSSLOptions(conf: CassandraSSLConf): Option[SSLOptions] = { 78 | lazy val trustStore = 79 | getKeyStore(conf.trustStoreType, conf.trustStorePassword, conf.trustStorePath.map(Paths.get(_))) 80 | lazy val keyStore = 81 | getKeyStore(conf.keyStoreType, conf.keyStorePassword, conf.keyStorePath.map(Paths.get(_))) 82 | 83 | if (conf.enabled) { 84 | val trustManagerFactory = for (ts <- trustStore) yield { 85 | val tmf = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm) 86 | tmf.init(ts) 87 | tmf 88 | } 89 | 90 | val keyManagerFactory = if (conf.clientAuthEnabled) { 91 | for (ks <- keyStore) yield { 92 | val kmf = KeyManagerFactory.getInstance(KeyManagerFactory.getDefaultAlgorithm) 93 | kmf.init(ks, conf.keyStorePassword.map(_.toCharArray).orNull) 94 | kmf 95 | } 96 | } else { 97 | None 98 | } 99 | 100 | val context = SSLContext.getInstance(conf.protocol) 101 | context.init( 102 | keyManagerFactory.map(_.getKeyManagers).orNull, 103 | trustManagerFactory.map(_.getTrustManagers).orNull, 104 | new SecureRandom) 105 | 106 | Some( 107 | JdkSSLOptions.builder() 108 | .withSSLContext(context) 109 | .withCipherSuites(conf.enabledAlgorithms.toArray) 110 | .build()) 111 | } else { 112 | None 113 | } 114 | } 115 | 116 | /** Creates and configures the Cassandra connection */ 117 | override def createCluster(conf: CassandraConnectorConf): Cluster = { 118 | clusterBuilder(conf).build() 119 | } 120 | 121 | } 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | --- 2 | page_type: sample 3 | languages: 4 | - scala 5 | products: 6 | - azure 7 | description: "This maven project provides samples and best practices for using the DataStax Spark Cassandra Connector against Azure Cosmos DB's Cassandra API." 8 | urlFragment: azure-cosmos-db-cassandra-api-spark-connector-sample 9 | --- 10 | 11 | # Azure Cosmos DB Cassandra API - Datastax Spark Connector Sample 12 | 13 | This maven project provides samples and best practices for using the [DataStax Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector) against [Azure Cosmos DB's Cassandra API](https://docs.microsoft.com/azure/cosmos-db/cassandra-introduction). 14 | For the purposes of providing an end-to-end sample, we've made use of an [Azure HDI Spark Cluster](https://docs.microsoft.com/azure/hdinsight/spark/apache-spark-jupyter-spark-sql) to run the spark jobs provided in the example. 15 | All samples provided are in scala, built with maven. 16 | 17 | *Note - this sample is configured against the 2.0.6 version of the spark connector.* 18 | 19 | ## Running this Sample 20 | 21 | ### Prerequisites 22 | - Cosmos DB Account configured with Cassandra API 23 | - Spark Cluster 24 | 25 | ## Quick Start 26 | Information regarding submitting spark jobs is not covered as part of this sample, please refer to Apache Spark's [documentation](https://spark.apache.org/docs/latest/submitting-applications.html). 27 | In order run this sample, correctly configure the sample to your cluster(as discussed below), build the project, generate the required jar(s), and then submit the job to your spark cluster. 28 | 29 | ## Cassandra API Connection Parameters 30 | In order for your spark jobs to connect with Cosmos DB's Cassandra API, you must set the following configurations: 31 | 32 | *Note - all these values can be found on the ["Connection String" blade](https://docs.microsoft.com/azure/cosmos-db/manage-account#keys) of your CosmosDB Account* 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 |
Property NameValue
spark.cassandra.connection.hostYour Cassandra Endpoint: ACOUNT_NAME.cassandra.cosmosdb.azure.com
spark.cassandra.connection.port10350
spark.cassandra.connection.ssl.enabledtrue
spark.cassandra.auth.usernameCOSMOSDB_ACCOUNTNAME
spark.cassandra.auth.passwordCOSMOSDB_KEY
57 | 58 | ## Configurations for Throughput optimization 59 | Because Cosmos DB follows a provisioned throughput model, it is important to tune the relevant configurations of the connector to optimize for this model. 60 | General information regarding these configurations can be found on the [Configuration Reference](https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md) page of the DataStax Spark Cassandra Connector github repository. 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 |
Property NameDescription
spark.cassandra.output.batch.size.rowsLeave this to 1. This is prefered for Cosmos DB's provisioning model in order to achieve higher throughput for heavy workloads.
spark.cassandra.connection.connections_per_executor_max10*n

Which would be equivalent to 10 connections per node in an n-node Cassandra cluster. Hence if you require 5 connections per node per executor for a 5 node Cassandra cluster, then you would need to set this configuration to 25.
(Modify based on the degree of parallelism/number of executors that your spark job are configured for)
spark.cassandra.output.concurrent.writes100

Defines the number of parallel writes that can occur per executor. As batch.size.rows is 1, make sure to scale up this value accordingly. (Modify this based on the degree of parallelism/throughput that you want to achieve for your workload)
spark.cassandra.concurrent.reads512

Defines the number of parallel reads that can occur per executor. (Modify this based on the degree of parallelism/throughput that you want to achieve for your workload)
spark.cassandra.output.throughput_mb_per_secDefines the total write throughput per executor. This can be used as an upper cap for your spark job throughput, and base it on the provisioned throughput of your Cosmos DB Collection.
spark.cassandra.input.reads_per_secDefines the total read throughput per executor. This can be used as an upper cap for your spark job throughput, and base it on the provisioned throughput of your Cosmos DB Collection.
spark.cassandra.output.batch.grouping.buffer.size1000
spark.cassandra.connection.keep_alive_ms60000
96 | 97 | Regarding throughput and degree of parallelism, it is important to tune the relevant parameters based on the amount of load you expect your upstream/downstream flows to be, the executors provisioned for your spark jobs, and the throughput you have provisioned for your Cosmos DB account. 98 | 99 | ## Connection Factory Configuration and Retry Policy 100 | As part of this sample, we have provided a connection factory and custom retry policy for Cosmos DB. We need a custom connection factory as that is the only way to configure a retry policy on the connector - [SPARKC-437](https://datastax-oss.atlassian.net/browse/SPARKC-437). 101 | * CosmosDbConnectionFactory.scala 102 | * CosmosDbMultipleRetryPolicy.scala 103 | 104 | ### Retry Policy 105 | The retry policy for Cosmos DB is configured to handle http status code 429 - Request Rate Large exceptions. The Cosmos Db Cassandra API, translates these exceptions to overloaded errors on the Cassandra native protocol, which we want to retry with back-offs. 106 | The reason for doing so is because Cosmos DB follows a provisioned throughput model, and having this retry policy would protect your spark jobs against spikes of data ingress/egress that would momentarily exceed the allocated throughput for your collection, resulting in the request rate limiting exceptions. 107 | 108 | *Note - that this retry policy is meant to only protect your spark jobs against momentary spikes. If you have not configured enough RUs on your collection for the intended throughput of your workload such that the retries don't catch up, then the retry policy will result in rethrows.* 109 | 110 | ## Known Issues 111 | 112 | ### Tokens and Token Range Filters 113 | We do not currently support methods that make use of Tokens for filtering data. Hence please avoid using any APIs that perform table scans. 114 | 115 | ## Resources 116 | - [DataStax Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector) 117 | - [CosmosDB Cassandra API](https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction) 118 | - [Apache Spark](https://spark.apache.org/docs/latest/index.html) 119 | - [HDI Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql) 120 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 5 | 4.0.0 6 | 7 | com.microsoft.azure.cosmosdb 8 | azure-cosmos-cassandra-spark-helper 9 | 1.2.0 10 | ${project.groupId}:${project.artifactId} 11 | Cassandra Api Spark Connector Helper for Microsoft Azure CosmosDB 12 | http://azure.microsoft.com/en-us/services/documentdb/ 13 | 14 | 15 | MIT License 16 | http://www.opensource.org/licenses/mit-license.php 17 | 18 | 19 | 20 | 21 | org.apache.spark 22 | spark-core_2.11 23 | 2.4.4 24 | provided 25 | 26 | 27 | org.apache.spark 28 | spark-sql_2.11 29 | 2.4.4 30 | provided 31 | 32 | 33 | org.apache.spark 34 | spark-streaming_2.11 35 | 2.4.4 36 | provided 37 | 38 | 39 | org.apache.spark 40 | spark-mllib_2.11 41 | 2.4.4 42 | provided 43 | 44 | 45 | com.datastax.spark 46 | spark-cassandra-connector_2.11 47 | 2.4.3 48 | 49 | 50 | 51 | 52 | 53 | 54 | net.alchim31.maven 55 | scala-maven-plugin 56 | 3.2.2 57 | 58 | false 59 | 2.11.12 60 | incremental 61 | 62 | 63 | 64 | scala-compile-first 65 | process-resources 66 | 67 | add-source 68 | 69 | 70 | 71 | scala-compile 72 | compile 73 | 74 | compile 75 | 76 | 77 | 78 | scala-testCompile 79 | test-compile 80 | 81 | testCompile 82 | 83 | 84 | 85 | scala-doc 86 | prepare-package 87 | 88 | doc 89 | doc-jar 90 | 91 | 92 | 93 | 94 | 95 | org.apache.maven.plugins 96 | maven-source-plugin 97 | 2.2.1 98 | 99 | 100 | attach-sources 101 | 102 | jar-no-fork 103 | 104 | 105 | 106 | 107 | 108 | org.apache.maven.plugins 109 | maven-shade-plugin 110 | 3.0.0 111 | 112 | 113 | package 114 | 115 | shade 116 | 117 | 118 | 119 | 120 | com.microsoft.azure.cosmosdb.cassandra 121 | cosmosdb_cassandra_connector_shaded.com.microsoft.azure.cosmosdb.cassandra 122 | 123 | com.microsoft.azure.cosmosdb.cassandra.* 124 | 125 | 126 | 127 | com.datastax 128 | cosmosdb_connector_shaded.com.datastax 129 | 130 | 131 | 132 | 133 | *:* 134 | 135 | META-INF/services/javax.annotation.processing.Processor 136 | META-INF/*.MF 137 | META-INF/*.SF 138 | META-INF/*.DSA 139 | META-INF/*.RSA 140 | 141 | 142 | 143 | commons-logging:commons-logging 144 | 145 | ** 146 | 147 | 148 | 149 | 150 | 151 | org.tachyonproject:tachyon-client 152 | org.apache.hadoop:* 153 | org.apache.spark:* 154 | org.scala-lang:* 155 | org.apache.tinkerpop:* 156 | 157 | 158 | true 159 | ${project.artifactId}-${project.version}-uber 160 | 161 | 162 | 163 | 164 | 165 | maven-compiler-plugin 166 | 3.5 167 | 168 | 1.8 169 | 1.8 170 | 171 | 172 | 173 | 174 | 175 | 176 | CosmosDB Programmability Devs 177 | askcosmosdb@microsoft.com 178 | Microsoft 179 | http://www.microsoft.com/ 180 | 181 | 182 | 183 | scm:git:git@github.com:Azure-Samples/azure-cosmos-db-cassandra-api-spark-connector-sample.git 184 | scm:git:git@github.com:Azure-Samples/azure-cosmos-db-cassandra-api-spark-connector-sample.git 185 | git@github.com:Azure-Samples/azure-cosmos-db-cassandra-api-spark-connector-sample.git 186 | 187 | --------------------------------------------------------------------------------