-
Notifications
You must be signed in to change notification settings - Fork 121
Query Test Runs
Below are the results of some query test runs using the different Spark to Cosmos DB connector methods.
Below are the Cosmos DB configurations used:
- Single partition collection: 10,000 RUs
-
airport.codes
has 512 documents -
DepartureDelays.flights
has 1.05M documents (single collection) - Partitioned Collection: 250,000 RUs
-
DepartureDelays.flights (pColl)
has 1.39M documents (partitioned collection)
Below are the Apache Spark configurations used:
- dev box: Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
- HDI Cluster: HDI 3.5 multi-node Spark cluster (2 master, multiple workers, 3 zookeepers) using Spark 2.0.2
The queries were:
- Q1:
SELECT c.City FROM c WHERE c.State='WA'
- Q2a:
SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c
- Q2b:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100
- Q3:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
- Q4:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c
Note the slight differences between
Q2a
andQ2b
; this is because Q2a viapyDocumentDB
is using a Cosmos DB SQL query (i.e. usingTOP
) while Q2b viaazure-cosmosdb-spark
is using a Spark SQL query (i.e. usingLIMIT
).
against the following collections:
- C1:
airport.codes
- C2:
DepartureDelays.flights
- C3:
DepartureDelays.flights_pcoll
The query results below are from executing
df.count()
Below are the results of connecting Spark to Cosmos DB via pyDocumentDB
from the dev box:
Below are the results from querying a single collection
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q1 | 7 | C1 | 0:00:00.225645 | 0:00:00.006784 |
Q2a | 100 | C2 | 0:00:00.214985 | 0:00:00.009669 |
Q3 | 14,808 | C2 | 0:00:01.498699 | 0:00:01.323917 |
Q4 | 1,048,575 | C2 | 0:01:37.518344 |
Below are the results from querying a partitioned collection (25 partitions)
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2a | 100 | C3 | 0:00:00.774820 | 0:00:00.508290 |
Q3 | 23,078 | C3 | 0:00:05.146107 | 0:00:03.234670 |
Q4 | 1,391,578 | C3 | 0:02:36.335267 |
.
Below are the results of connecting Spark to Cosmos DB via azure-cosmosdb-spark
:
Below are the results from querying a single collection from the dev box:
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2 | 100 | C2 | 00:00:01.183 | 00:00:00.958 |
Q3 | 14,808 | C2 | 00:00:01.802 | 00:00:01.558 |
Q4 | 1,048,575 | C2 | 00:00:56.642 | 00:00:54.931 |
Below are the results from querying a partitioned collection (25 partitions):
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2 | 100 | C3 | 0:00:00.774820 | 0:00:00.508290 |
Q3 | 23,078 | C3 | 0:00:05.146107 | 0:00:03.234670 |
Q4 | 1,391,578 | C3 | 0:02:36.335267 |
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2 | 100 | C3 | 00:00:01.286 | 00:00:00.868 |
Q3 | 23,078 | C3 | 00:00:01.582 | 00:00:01.339 |
Q4 | 1,391,578 | C3 | 00:00:16.955 | 00:00:12.982 |
Re-running Q4
but scaling the cluster to see the impact of parallel queries
# of workers | Response Time (First) | Response Time (Second) |
---|---|---|
4 | 00:00:11.129 | 00:00:09.958 |
6 | 00:00:10.028 | 00:00:10.495 |
8 | 00:00:10.323 | 00:00:09.723 |
12 | 00:00:08.899 | 00:00:09.153 |
20 | 00:00:10.210 | 00:00:10.398 |