-
Notifications
You must be signed in to change notification settings - Fork 121
Writing Test Runs
Khoa Dang edited this page Aug 4, 2017
·
1 revision
Below are the results of some writing test runs using the different Spark to CosmosDB connector methods.
Below are the Cosmos DB configurations used:
- Partitioned Collection: 250,000 RUs, 50 partitions
- Collection 1:
customerfactors_part
110GB with 19.80M documents - Collection 2:
customerfactors
462.44GB with 64.00M documents
- Collection 1:
Below are the Apache Spark configurations used:
- HDI Cluster: Spark 2.1 HDI 3.5 multi-node Spark cluster, 2 masters, 2 zookeepers, driver memory 10g
- Cluster 1: 4 workers (8 cores, 28GB memory), 32 cores, 128GB memory
- Config 1.1: executor: 7 cores, 22GB memory
- Config 1.2: executor: 2 cores, 6GB memory
- Cluster 2: 9 workers (8 cores, 56GB memory), 72 cores, 504GB memory
- Config 2.1: executor: 7 cores, 22GB memory
- Cluster 3: 14 workers (4 cores, 28GB memory), 56 cores, 392GB memory
- Config 3.1: executor: 3 cores, 22GB memory
- Cluster 1: 4 workers (8 cores, 28GB memory), 32 cores, 128GB memory
Data migration: reading from a CosmosDB collection and writing directly to another CosmosDB collection
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config
val readConfig = Config(Map(
"Endpoint" -> "",
"Masterkey" -> "",
"Database" -> "writingtests",
"Collection" -> "customerfactors",
"PageSize" -> "2000000",
"preferredRegions" -> "North Europe;",
"SamplingRatio" -> "1.0"))
val readDf = spark.sqlContext.read.cosmosDB(readConfig)
val writeConfig = Config(Map(
"Endpoint" -> "",
"Masterkey" -> "",
"Database" -> "writingtests",
"Collection" -> "customerfactors_shadow",
"WritingBatchSize" -> "500",
"preferredRegions" -> "North Europe;",
"SamplingRatio" -> "1.0"))
val writtenDf = coll.write.cosmosDB(writeConfig)
Below are the results of writing performance on various setups.
Cluster | Configuration | Collection | # of rows | Writing batch size | Run time | Avg. task time | Batch per executor | Peak RU/s | End to End throughput |
---|---|---|---|---|---|---|---|---|---|
1 | 1.1 | 2 | 64M | 500 | 01:18:00 | 00:35:00 | 2 | 120k | 13782 |
1 | 1.1 | 1 | 19M | 1000 | 00:18:00 | 00:09:06 | 2 | 18333 | |
1 | 1.2 | 1 | 19M | 500 | 01:42:00 | 2 | 3235 | ||
1 | 1.2 | 1 | 19M | 2500 | 02:06:00 | 01:00:00 | 2 | 70k | 2619 |
2 | 2.1 | 1 | 19M | 1000 | 00:09:12 | 00:09:12 | 1 | 120k | 35869 |
3 | 3.1 | 1 | 19M | 1000 | 00:18:00 | 00:09:30 | 2 | 200k | 18333 |
3 | 3.1 | 2 | 64M | 1000 | 01:18:00 | 00:30:00 | 2 | 200k | 13782 |