Skip to content

Query Test Runs

Denny Lee edited this page May 23, 2017 · 6 revisions

Below are the results of some query test runs using the different Spark to Cosmos DB connector methods.

Cosmos DB Configurations Used for Query Tests

Below are the Cosmos DB configurations used:

  • Single partition collection: 10,000 RUs
  • airport.codes has 512 documents
  • DepartureDelays.flights has 1.05M documents (single collection)
  • Partitioned Collection: 250,000 RUs
  • DepartureDelays.flights (pColl) has 1.39M documents (partitioned collection)

Apache Spark Configurations Used for Query Tests

Below are the Apache Spark configurations used:

  • dev box: Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
  • HDI Cluster: HDI 3.5 multi-node Spark cluster (2 master, multiple workers, 3 zookeepers) using Spark 2.0.2

Query and Collections Used

The queries were:

  • Q1: SELECT c.City FROM c WHERE c.State='WA'
  • Q2a: SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c
  • Q2b: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100
  • Q3: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
  • Q4: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c

Note the slight differences between Q2a and Q2b; this is because Q2a via pyDocumentDB is using a Cosmos DB SQL query (i.e. using TOP) while Q2b via azure-cosmosdb-spark is using a Spark SQL query (i.e. using LIMIT).

against the following collections:

  • C1: airport.codes
  • C2: DepartureDelays.flights
  • C3: DepartureDelays.flights_pcoll

The query results below are from executing df.count()

pyDocumentDB Performance

Below are the results of connecting Spark to Cosmos DB via pyDocumentDB from the dev box:

Single Collection

Below are the results from querying a single collection

Query # of rows Collection Response Time (First) Response Time (Second)
Q1 7 C1 0:00:00.225645 0:00:00.006784
Q2a 100 C2 0:00:00.214985 0:00:00.009669
Q3 14,808 C2 0:00:01.498699 0:00:01.323917
Q4 1,048,575 C2 0:01:37.518344

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions)

Query # of rows Collection Response Time (First) Response Time (Second)
Q2a 100 C3 0:00:00.774820 0:00:00.508290
Q3 23,078 C3 0:00:05.146107 0:00:03.234670
Q4 1,391,578 C3 0:02:36.335267

.

azure-cosmosdb-spark Performance

Below are the results of connecting Spark to Cosmos DB via azure-cosmosdb-spark:

Single Collection

Below are the results from querying a single collection from the dev box:

Query # of rows Collection Response Time (First) Response Time (Second)
Q2 100 C2 00:00:01.183 00:00:00.958
Q3 14,808 C2 00:00:01.802 00:00:01.558
Q4 1,048,575 C2 00:00:56.642 00:00:54.931

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions):

Dev Box Results

Query # of rows Collection Response Time (First) Response Time (Second)
Q2 100 C3 0:00:00.774820 0:00:00.508290
Q3 23,078 C3 0:00:05.146107 0:00:03.234670
Q4 1,391,578 C3 0:02:36.335267

HDI Cluster: 2 workers

Query # of rows Collection Response Time (First) Response Time (Second)
Q2 100 C3 00:00:01.286 00:00:00.868
Q3 23,078 C3 00:00:01.582 00:00:01.339
Q4 1,391,578 C3 00:00:16.955 00:00:12.982

HDI Cluster: for Q4, multiple worker configurations

Re-running Q4 but scaling the cluster to see the impact of parallel queries

# of workers Response Time (First) Response Time (Second)
4 00:00:11.129 00:00:09.958
6 00:00:10.028 00:00:10.495
8 00:00:10.323 00:00:09.723
12 00:00:08.899 00:00:09.153
20 00:00:10.210 00:00:10.398