diff --git a/source/release-notes.txt b/source/release-notes.txt index ba5a835..bc0d653 100644 --- a/source/release-notes.txt +++ b/source/release-notes.txt @@ -2,6 +2,38 @@ Release Notes ============= +MongoDB Connector for Spark 10.3 +-------------------------------- + +The 10.3 connector release includes the following new features: + +- Added support for reading multiple collections when using micro-batch or + continuous streaming modes. + + .. warning:: Breaking Change + + Support for reading multiple collections introduces the following breaking + changes: + + - If the name of a collection used in your ``collection`` configuration + option contains a comma, the + {+connector-short+} treats it as two different collections. To avoid + this, you must escape the comma by preceding it with a backslash (\\). + + - If the name of a collection used in your ``collection`` configuration + option is "*", the {+connector-short+} interprets it as a specification + to scan all collections. To avoid this, you must escape the asterisk by preceding it + with a backslash (\\). + + - If the name of a collection used in your ``collection`` configuration + option contains a backslash (\\), the + {+connector-short+} treats the backslash as an escape character, which + might change how it interprets the value. To avoid this, you must escape + the backslash by preceding it with another backslash. + + To learn more about scanning multiple collections, see the :ref:`collection + configuration property ` description. + MongoDB Connector for Spark 10.2 -------------------------------- diff --git a/source/streaming-mode/streaming-read-config.txt b/source/streaming-mode/streaming-read-config.txt index 621a412..997d175 100644 --- a/source/streaming-mode/streaming-read-config.txt +++ b/source/streaming-mode/streaming-read-config.txt @@ -46,6 +46,10 @@ You can configure the following properties when reading data from MongoDB in str * - ``collection`` - | **Required.** | The collection name configuration. + | You can specify multiple collections by separating the collection names + with a comma. + | + | To learn more about specifying multiple collections, see :ref:`spark-specify-multiple-collections`. * - ``comment`` - | The comment to append to the read operation. Comments appear in the @@ -168,7 +172,7 @@ You can configure the following properties when reading a change stream from Mon omit the ``fullDocument`` field and publishes only the value of the field. - If you don't specify a schema, the connector infers the schema - from the change stream document rather than from the underlying collection. + from the change stream document. **Default**: ``false`` @@ -203,4 +207,91 @@ You can configure the following properties when reading a change stream from Mon Specifying Properties in ``connection.uri`` ------------------------------------------- -.. include:: /includes/connection-read-config.rst \ No newline at end of file +.. include:: /includes/connection-read-config.rst + +.. _spark-specify-multiple-collections: + +Specifying Multiple Collections in the ``collection`` Property +-------------------------------------------------------------- + +You can specify multiple collections in the ``collection`` change stream +configuration property by separating the collection names +with a comma. Do not add a space between the collections unless the space is a +part of the collection name. + +Specify multiple collections as shown in the following example: + +.. code-block:: java + + ... + .option("spark.mongodb.collection", "collectionOne,collectionTwo") + +If a collection name is "*", or if the name includes a comma or a backslash (\\), +you must escape the character as follows: + +- If the name of a collection used in your ``collection`` configuration + option contains a comma, the {+connector-short+} treats it as two different + collections. To avoid this, you must escape the comma by preceding it with + a backslash (\\). Escape a collection named "my,collection" as follows: + + .. code-block:: java + + "my\,collection" + +- If the name of a collection used in your ``collection`` configuration + option is "*", the {+connector-short+} interprets it as a specification + to scan all collections. To avoid this, you must escape the asterisk by preceding it + with a backslash (\\). Escape a collection named "*" as follows: + + .. code-block:: java + + "\*" + +- If the name of a collection used in your ``collection`` configuration + option contains a backslash (\\), the + {+connector-short+} treats the backslash as an escape character, which + might change how it interprets the value. To avoid this, you must escape + the backslash by preceding it with another backslash. Escape a collection named "\\collection" as follows: + + .. code-block:: java + + "\\collection" + + .. note:: + + When specifying the collection name as a string literal in Java, you must + further escape each backslash with another one. For example, escape a collection + named "\\collection" as follows: + + .. code-block:: java + + "\\\\collection" + +You can stream from all collections in the database by passing an +asterisk (*) as a string for the collection name. + +Specify all collections as shown in the following example: + +.. code-block:: java + + ... + .option("spark.mongodb.collection", "*") + +If you create a collection while streaming from all collections, the new +collection is automatically included in the stream. + +You can drop collections at any time while streaming from multiple collections. + +.. important:: Inferring the Schema with Multiple Collections + + If you set the ``change.stream.publish.full.document.only`` + option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame`` + by using the schema of the scanned documents. + + Schema inference happens at the beginning of streaming, and does not take + into account collections that are created during streaming. + + When streaming from multiple collections and inferring the schema, the connector samples + each collection sequentially. Streaming from a large number of + collections can cause the schema inference to have noticeably slower + performance. This performance impact occurs only while inferring the schema. diff --git a/source/streaming-mode/streaming-read.txt b/source/streaming-mode/streaming-read.txt index d7433cc..ac8fb7b 100644 --- a/source/streaming-mode/streaming-read.txt +++ b/source/streaming-mode/streaming-read.txt @@ -15,6 +15,13 @@ Read from MongoDB in Streaming Mode :depth: 1 :class: singlecol +.. facet:: + :name: genre + :values: reference + +.. meta:: + :keywords: change stream + Overview -------- @@ -344,12 +351,10 @@ The following example shows how to stream data from MongoDB to your console. .. important:: Inferring the Schema of a Change Stream - When the {+connector-short+} infers the schema of a DataFrame - read from a change stream, by default, - it uses the schema of the underlying collection rather than that - of the change stream. If you set the ``change.stream.publish.full.document.only`` - option to ``true``, the connector uses the schema of the - change stream instead. + If you set the ``change.stream.publish.full.document.only`` + option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame`` + by using the schema of the scanned documents. If you set the option to + ``false``, you must specify a schema. For more information about this setting, and to see a full list of change stream configuration options, see the