Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCSP-36546 Scan Multiple Collections #193

32 changes: 32 additions & 0 deletions source/release-notes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,38 @@
Release Notes
=============

MongoDB Connector for Spark 10.3
--------------------------------

The 10.3 connector release includes the following new features:

- Added support for reading multiple collections when using micro-batch or
continuous streaming modes.

.. warning:: Breaking Change

Support for reading multiple collections introduces the following breaking
changes:

- If the name of a collection used in your ``collection`` configuration
option contains a comma, the
{+connector-short+} treats it as two different collections. To avoid
this, you must escape the comma by preceding it with a backslash (\\).

- If the name of a collection used in your ``collection`` configuration
option is "*", the {+connector-short+} interprets it as a specification
to scan all collections. To avoid this, you must escape the asterisk by preceding it
with a backslash (\\).

- If the name of a collection used in your ``collection`` configuration
option contains a backslash (\\), the
{+connector-short+} treats the backslash as an escape character, which
might change how it interprets the value. To avoid this, you must escape
the backslash by preceding it with another backslash.

To learn more about scanning multiple collections, see the :ref:`collection
configuration property <spark-streaming-input-conf>` description.

MongoDB Connector for Spark 10.2
--------------------------------

Expand Down
95 changes: 93 additions & 2 deletions source/streaming-mode/streaming-read-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@
* - ``collection``
jordan-smith721 marked this conversation as resolved.
Show resolved Hide resolved
jordan-smith721 marked this conversation as resolved.
Show resolved Hide resolved
- | **Required.**
| The collection name configuration.
| You can specify multiple collections by separating the collection names
with a comma.
|
| To learn more about specifying multiple collections, see :ref:`spark-specify-multiple-collections`.

* - ``comment``
- | The comment to append to the read operation. Comments appear in the
Expand Down Expand Up @@ -168,7 +172,7 @@
omit the ``fullDocument`` field and publishes only the value of the
field.
- If you don't specify a schema, the connector infers the schema
from the change stream document rather than from the underlying collection.
from the change stream document.

**Default**: ``false``

Expand Down Expand Up @@ -203,4 +207,91 @@
Specifying Properties in ``connection.uri``
-------------------------------------------

.. include:: /includes/connection-read-config.rst
.. include:: /includes/connection-read-config.rst

.. _spark-specify-multiple-collections:

Specifying Multiple Collections in the ``collection`` Property
--------------------------------------------------------------

You can specify multiple collections in the ``collection`` change stream
configuration property by separating the collection names
with a comma. Do not add a space between the collections unless the space is a
part of the collection name.

Specify multiple collections as shown in the following example:

.. code-block:: java

...
.option("spark.mongodb.collection", "collectionOne,collectionTwo")

If a collection name is "*", or if the name includes a comma or a backslash (\\),
you must escape the character as follows:

- If the name of a collection used in your ``collection`` configuration
option contains a comma, the {+connector-short+} treats it as two different
collections. To avoid this, you must escape the comma by preceding it with
a backslash (\\). Escape a collection named "my,collection" as follows:

.. code-block:: java

"my\,collection"

- If the name of a collection used in your ``collection`` configuration
option is "*", the {+connector-short+} interprets it as a specification
to scan all collections. To avoid this, you must escape the asterisk by preceding it
with a backslash (\\). Escape a collection named "*" as follows:

.. code-block:: java

"\*"

- If the name of a collection used in your ``collection`` configuration
option contains a backslash (\\), the
{+connector-short+} treats the backslash as an escape character, which
might change how it interprets the value. To avoid this, you must escape
the backslash by preceding it with another backslash. Escape a collection named "\\collection" as follows:

.. code-block:: java

"\\collection"

.. note::

When specifying the collection name as a string literal in Java, you must
further escape each backslash with another one. For example, escape a collection
named "\\collection" as follows:

.. code-block:: java

"\\\\collection"

You can stream from all collections in the database by passing an
asterisk (*) as a string for the collection name.

Specify all collections as shown in the following example:

.. code-block:: java

...
.option("spark.mongodb.collection", "*")

If you create a collection while streaming from all collections, the new
collection is automatically included in the stream.

You can drop collections at any time while streaming from multiple collections.

.. important:: Inferring the Schema with Multiple Collections

If you set the ``change.stream.publish.full.document.only``
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
by using the schema of the scanned documents.

Schema inference happens at the beginning of streaming, and does not take
into account collections that are created during streaming.

When streaming from multiple collections and inferring the schema, the connector samples
each collection sequentially. Streaming from a large number of

Check failure on line 295 in source/streaming-mode/streaming-read-config.txt

View workflow job for this annotation

GitHub Actions / TDBX Vale rules

[vale] reported by reviewdog 🐶 [MongoDB.Wordiness] Consider using 'many' instead of 'a large number of'. Raw Output: {"message": "[MongoDB.Wordiness] Consider using 'many' instead of 'a large number of'.", "location": {"path": "source/streaming-mode/streaming-read-config.txt", "range": {"start": {"line": 295, "column": 49}}}, "severity": "ERROR"}
collections can cause the schema inference to have noticeably slower
performance. This performance impact occurs only while inferring the schema.
17 changes: 11 additions & 6 deletions source/streaming-mode/streaming-read.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ Read from MongoDB in Streaming Mode
:depth: 1
:class: singlecol

.. facet::
:name: genre
:values: reference

.. meta::
:keywords: change stream

Overview
--------

Expand Down Expand Up @@ -344,12 +351,10 @@ The following example shows how to stream data from MongoDB to your console.

.. important:: Inferring the Schema of a Change Stream

When the {+connector-short+} infers the schema of a DataFrame
read from a change stream, by default,
it uses the schema of the underlying collection rather than that
of the change stream. If you set the ``change.stream.publish.full.document.only``
option to ``true``, the connector uses the schema of the
change stream instead.
If you set the ``change.stream.publish.full.document.only``
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
by using the schema of the scanned documents. If you set the option to
``false``, you must specify a schema.

For more information about this setting, and to see a full list of change stream
configuration options, see the
Expand Down
Loading