Writing Test Runs

Below are the results of some writing test runs using the different Spark to CosmosDB connector methods.

CosmosDB Configurations

Below are the Cosmos DB configurations used:

Partitioned Collection: 250,000 RUs, 50 partitions
- Collection 1: customerfactors_part 110GB with 19.80M documents
- Collection 2: customerfactors 462.44GB with 64.00M documents

Apache Spark Configurations

Below are the Apache Spark configurations used:

HDI Cluster: Spark 2.1 HDI 3.5 multi-node Spark cluster, 2 masters, 2 zookeepers, driver memory 10g
- Cluster 1: 4 workers (8 cores, 28GB memory), 32 cores, 128GB memory
  - Config 1.1: executor: 7 cores, 22GB memory
  - Config 1.2: executor: 2 cores, 6GB memory
- Cluster 2: 9 workers (8 cores, 56GB memory), 72 cores, 504GB memory
  - Config 2.1: executor: 7 cores, 22GB memory
- Cluster 3: 14 workers (4 cores, 28GB memory), 56 cores, 392GB memory
  - Config 3.1: executor: 3 cores, 22GB memory

Writing scenarios

Data migration: reading from a CosmosDB collection and writing directly to another CosmosDB collection

import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config

val readConfig = Config(Map(
    "Endpoint" -> "",
    "Masterkey" -> "",
    "Database" -> "writingtests",
    "Collection" -> "customerfactors",
    "PageSize" -> "2000000",
    "preferredRegions" -> "North Europe;",
    "SamplingRatio" -> "1.0"))

val readDf = spark.sqlContext.read.cosmosDB(readConfig)

val writeConfig = Config(Map(
    "Endpoint" -> "",
    "Masterkey" -> "",
    "Database" -> "writingtests",
    "Collection" -> "customerfactors_shadow", 
    "WritingBatchSize" -> "500",
    "preferredRegions" -> "North Europe;",
    "SamplingRatio" -> "1.0"))

val writtenDf = coll.write.cosmosDB(writeConfig)

Writing performance

Below are the results of writing performance on various setups.

Cluster	Configuration	Collection	# of rows	Writing batch size	Run time	Avg. task time	Batch per executor	Peak RU/s	End to End throughput
1	1.1	2	64M	500	01:18:00	00:35:00	2	120k	13782
1	1.1	1	19M	1000	00:18:00	00:09:06	2		18333
1	1.2	1	19M	500	01:42:00		2		3235
1	1.2	1	19M	2500	02:06:00	01:00:00	2	70k	2619
2	2.1	1	19M	1000	00:09:12	00:09:12	1	120k	35869
3	3.1	1	19M	1000	00:18:00	00:09:30	2	200k	18333
3	3.1	2	64M	1000	01:18:00	00:30:00	2	200k	13782

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing Test Runs

CosmosDB Configurations

Apache Spark Configurations

Writing scenarios

Writing performance

Clone this wiki locally