Merge pull request #226 from alxndrnh/copy_example_gen_prod

Copy example gen prod
tensorflow · May 19, 2023 · 8507648 · 8507648
2 parents 3c10a96 + ff15f35
commit 8507648
Show file tree

Hide file tree

Showing 3 changed files with 159 additions and 16 deletions.
diff --git a/tfx_addons/copy_example_gen/README.md b/tfx_addons/copy_example_gen/README.md
@@ -10,31 +10,66 @@
 **Project name:** CopyExampleGen component
 
 ## Project Description
-CopyExampleGen will allow the user to copy a pre-existing Tfrecord dataset or raw data and ingest it into the pipeline, ultimately skipping the process of shuffling and running the Beam job. This process will require a dict input with split_names and their respective URI. This will output an Examples Artifact (same as the Artifact output from the ExampleGen component)  in which downstream components can use.
+CopyExampleGen will allow the user to copy pre-existing tfrecords and ingest it into the pipeline as examples, ultimately skipping the process of shuffling and running the Beam job that is in the standard component, ExampleGen. This process will require a dict input with split names as keys and their respective URIs as the value from the user. Following suit, the component will set the artifact’s properties, generate output dict, and register contexts and execution for downstream components to use. Lastly, tfrecord file(s) in uri must resemble same `.gz` file format as the output of ExampleGen component.
+
+Example of pipeline component definition:
+```python
+tfrecord_dict : Dict[str, str] = {
+  "train" : "gs://path/to/tfrecords/examples/Split-train/",
+  "eval" : "gs://path/to/tfrecords/examples/Split-eval/"
+}
+
+ copy_example_gen = component.CopyExampleGen(
+      input_dict = json.dumps(tfrecords_dict)
+ )
+```
+
+As of April 10th, 2023, tfx.dsl.components.Parameter only supports primitive types therefore, in order to properly use CopyExampleGen, the 'input_dict' of type Dict[str, str] needs to be converted into a JSON str. We can do this by simply using `json.dumps()` by adding 'tfrecords_dict' in as an argument.
+
 
 ## Project Category
-Component
+Addon Component
 
 ## Project Use-Case(s)
-CopyExampleGen will allow the user to add a dict input with split_names as the key and their respective pre-existing Tfrecords URIs as their value, then format the director structure so that it matches that of an Example Artifact.
+CopyExampleGen will replace ExampleGen when tfrecords and split names are already in the possession of the user. Hence, a Beam job will not be run nor will the tfrecords be shuffled and/ or randomized saving data ingestion pipeline process time.
+
+Currently, ingesting data with the ExampleGen component does not provide a way to split without random data shuffling and always runs a beam job. This component will save significant time (hours for large amounts of data) per pipeline run when a pipeline run does not require data to be shuffled. Some challenges users have had:
+
+  1. “Reshuffle doesn't work well with DirectRunner and causes OOMing. Users have been patching out shuffling in every release and doing it in the DB query. They have given up on Beam based ExampleGen and have created an entire custom ExampleGen that reads from the database and doesn’t use Beam”.
+
+  2. “When the use case is a time series problem using sliding windows, shuffling before splitting in train and eval set is counterproductive as the user would need a coherent training set”.
 
-Currently, ingesting data with the ExampleGen requires a Beam job to be ran and requires the data to be shuffled. This component will save users hours/ days of having to create a workaround fully custom ExampleGen component. Some challenges our users have had:
-Reshuffle doesn't work well with DirectRunner and causes OOMing. Users have been patching out shuffling in every release and doing it in the DB query. They have given up on Beam based ExampleGen and have created an entire custom ExampleGen that reads from the database and doesn’t use Beam. Link.
-When the use case is a time series problem using sliding windows, shuffling before splitting in train and eval set is counterproductive as the user would need a coherent training set. Link.
-Almost impossible to use ExampleGen based components for large datasets. Without it, Beam knows how to write to disk after transforming from input format to output format, allowing it to transform (slowly) large datasets that would otherwise not fit into memory. Link.
 
 ## Project Implementation
-Use case #1 - Tfrecords as input URIs:
-This component will:
-1. Accept a dict i.e. {'split_name1': './path/to/split_name1/tfrecord1', 'split_name2': './path/to/split_name2/tfrecord2'}
-2. Retrieve the tfrecords
-3. Create an Examples Artifact, following Examples directory structure and properties required for an Examples Artifact
-4. Register the Examples Artifact into MLMD
-5. Output as 'examples' to be ingested from downstream components
+### Component
+
+Custom Python function component: CopyExampleGen
 
+ - `input_json_str`: will be the input parameter for CopyExampleGen of type `tfx.dsl.components.Parameter[str]`, where the user will assign their Dict[str, str] input, tfrecords_dict. However, because Python custom component development only supports primitive types, we must assign `input_json_str` to `json.dumps(tfrecords_dict)` and place the tfrecords_dict in as an argument.
+
+ - `output_example`: Output artifact can be referenced as an object of its' specified type ArtifactType in the component function being declared. For example, if the ArtifactType is Examples, one can reference properties in an Examples ArtifactType (span, version, split_names, etc.) by calling the OutputArtifact object. This will be the variable we reference to build and register our Examples Artifact after pasrsing the tfrecords_dict input.
+
+
+### Python Custom Component Implementation Details
+
+  Using fileio.mkdir and fileio.copy, the component will then create a directory folder for each name in `split_name`. Following the creation of the `Split-name` folder, the files in the uri path will then be copied into the designated `Split-name` folder.
+
+  Thoughts from original implementation in phase 1:
+  This step can possibly use the [importer.generate_output_dict](https://github.com/tensorflow/tfx/blob/f8ce19339568ae58519d4eecfdd73078f80f84a2/tfx/dsl/components/common/importer.py#L153) function:
+  Create standard ‘output_dict’ variable. The value will be created by calling the worker function. If file copying is done before this step, this method can probably be used as is to register the artifact.
+
+  Using the keys and values from `tfrecords_dict`:
+  Parse the input_dict.keys() to a str to resemble the necessary format of property `split-names` i.e. '["train","eval"]'
+
+## Possible Future Development Directions
+  1. There's a few open questions about how the file copying should actually done. Where does the copying that importer does actually happen? And what's the best way to change that? Are there other ways in TFX to do copying in a robust way? Maybe something in tfx.io? If there's an existing method, what has to happen in the `parse_tfrecords_dict`. Depending on the copying capabilities available, will there be a need to detect the execution environment? Does TFX rely on other tools to execute a copy that handle this? Is detection of the execution environment and the copying itself separate? What could be reused? 
+
+  - If it's not easy to detect the execution environment without also performing a copy, will the user have to specify the execution environment and therefore how to do the copy (e.g., local copy, GCS, S3). And then what's the best way to handle that?
+
+  2. Should the dictionary of file inputs take a path to a folder? Globs? Lists of individual files?
+  3. Assuming file copying is done entirely separately, [importer.generate_output_dict](https://github.com/tensorflow/tfx/blob/f8ce19339568ae58519d4eecfdd73078f80f84a2/tfx/dsl/components/common/importer.py#L153) be used as is to register the artifacts, or does some separate code using [MLMD](https://www.tensorflow.org/tfx/guide/mlmd) in a different way need to be written
 
-## Project Dependencies
-Using: Python 3.8.2, Tensorflow 2.11.0, TFX 1.12.0
 
 ## Project Team
 Alex Ho, alexanderho@google.com, @alxndrnh
+
diff --git a/tfx_addons/copy_example_gen/__init__.py b/tfx_addons/copy_example_gen/__init__.py
@@ -0,0 +1,14 @@
+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
diff --git a/tfx_addons/copy_example_gen/component.py b/tfx_addons/copy_example_gen/component.py
@@ -0,0 +1,94 @@
+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""CopyExampleGen custom component.
+
+This component will accept tfrecord files and register them as an
+Examples Artifact for downstream components to use. CopyExampleGen accepts
+a dictionary where keys are the split-names and their respective value is a
+uri to the folder that contains the tfrecords file(s).
+
+Tfrecord file(s) in uri must resemble same `.gz` file format as the output of
+ExampleGen component.
+
+User will need to create a dictionary of type Dict[str, str], in this case
+we will title this dictionary 'tfrecords_dict' and assign it to a dictionary:
+
+  tfrecords_dict: Dict[str, str]={
+      "train":"gs://path/to/examples/Split-train/",
+      "eval":"gs://path/to/examples/Split-eval/"
+    }
+
+'tfx.dsl.components.Parameter' only supports primitive types therefore, in
+order to properly use CopyExampleGen, the 'input_dict' of type Dict[str, str]
+needs to be converted into a JSON str. We can do this by simply using
+'json.dumps()' by adding 'tfrecords_dict' in as a parameter like so:
+
+  copy_example=component.CopyExampleGen(
+      input_json_str=json.dumps(tfrecords_dict)
+    )
+
+"""
+import json
+import os
+
+from tfx import v1 as tfx
+from tfx.dsl.component.experimental.decorators import component
+from tfx.dsl.io import fileio
+from tfx.v1.types.standard_artifacts import Examples
+
+
+@component
+def CopyExampleGen(  # pylint: disable=C0103
+    input_json_str: tfx.dsl.components.Parameter[str],
+    output_example: tfx.dsl.components.OutputArtifact[Examples]
+) -> tfx.dsl.components.OutputDict():
+  """
+  CopyExampleGen first converts the string input to a type Dict and extracts
+  the keys from the dictionary, input_dict, and creates a string containing
+  the names. This string is assigned to the output_example.split_uri property
+  to register split_names property.
+
+  This component then creates a directory folder for each name in split_name.
+  Following the creation of the `Split-name` folder, the files in the uri path
+  will then be copied into the designated `Split-name` folder.
+
+  """
+
+  # Convert primitive type str to Dict[str, str].
+  input_dict = json.loads(input_json_str)
+
+  # Creates directories from the split-names and tfrecord uris provided into
+  # output_example.split_names property.
+  tfrecords_list = []
+  output_example_uri = output_example.uri
+
+  for split_label, split_tfrecords_uri in input_dict.items():
+    # Create Split-name folder name and create directory.
+    # output_example_uri = output_example.uri
+    split_value = (f"/Split-{split_label}/")
+    fileio.mkdir(f"{output_example_uri}{split_value}")
+
+    # Pull all files from uri.
+    tfrecords_list = fileio.glob(f"{split_tfrecords_uri}*.gz")
+
+    # Copy files into folder directories.
+    for tfrecord in tfrecords_list:
+      file_name = os.path.basename(os.path.normpath(tfrecord))
+      file_destination = (f"{output_example_uri}{split_value}{file_name}")
+      fileio.copy(tfrecord, file_destination, True)
+
+  # Build split_names in required Examples Artifact properties format.
+  example_properties_split_names = "[\"{}\"]".format('","'.join(
+      input_dict.keys()))
+  output_example.split_names = example_properties_split_names