Verify gcs target #218

blackvvine · 2022-08-15T15:07:42Z

Fixes #209

alxmrs

I've had to stop this review short; however, I have a few notes for things to work on so far.

One general idea: Another approach to solve this would be to allow the pipeline to fail in the fetch step when this error is encountered (or, errors where failure is preferable).

alxmrs · 2022-08-16T17:00:46Z

weather_dl/download_pipeline/config.py

@@ -109,3 +111,46 @@ def optimize_selection_partition(selection: t.Dict) -> t.Dict:
        del selection_['year']

    return selection_
+
+
+def prepare_partitions(config: Config) -> t.Iterator[Config]:


I'm curious: why did you move these functions to this file? It seems like partition.py is a good place for them.

This was to avoid a cyclic dependency issue. In partitions.py we got:

from .parsers import prepare_target_name

I just reverted that commit and confirmed it'd cause an ImportError

weather_dl/download_pipeline/parsers.py

alxmrs · 2022-08-16T17:23:23Z

weather_dl/download_pipeline/parsers.py

+    for partition_conf in prepare_partitions(config):
+        target = prepare_target_name(partition_conf)
+        parsed = urlparse(target)
+        if parsed.scheme == 'gs':


Is there a way this this validation could be written to make the check portable (to other clouds)? For example, via beam's FileSystem's API?

That's a great point. I figured the "path writeable" check needs to distinguish between file systems so stuck to GCS. Nevertheless FileSystem does have an FS-agnostic way for checking whether the directory exists. I've updated the CL to support this.

alxmrs

I have a few opened ended questions that generally ask how this validation will interact with other features in weather-dl. Also, some nits :) .

alxmrs · 2022-09-15T16:56:57Z

weather_dl/download_pipeline/parsers.py

+        target = prepare_target_name(partition_conf)
+        parsed = urlparse(target)
+        if FileSystems.exists(target):
+            raise ValueError(f"Path {target} already exists.")


I think we're ok with users downloading to paths that already exist. I'm not sure if we want to do this.

For example, we have skipping logic to not re-download anything that already exists, plus a -f --force flag that let's users override this behavior. How does this validation interact with these two features?

alxmrs · 2022-09-15T16:57:43Z

weather_dl/download_pipeline/parsers.py

+            if parsed.scheme == "gs":
+                GcsIO().open(target, 'w').close()
+            elif parsed.scheme == "s3":
+                S3IO().open(target, 'w').close()
+            elif parsed.scheme == "" or parsed.scheme == "fs":
+                open(target, 'w').close()


Can we use a FileSystems.create() call instead? I think it's a bit more portable.

alxmrs · 2022-09-15T17:00:18Z

weather_dl/download_pipeline/parsers.py

+    partition_dict = OrderedDict(
+        (key, typecast(key, partition_config.selection[key][0]))
+        for key in partition_config.partition_keys)


I'm ok with making this multiline, but then let's go all the way :)

Suggested change

partition_dict = OrderedDict(

(key, typecast(key, partition_config.selection[key][0]))

for key in partition_config.partition_keys)

partition_dict = OrderedDict(

(key, typecast(key, partition_config.selection[key][0]))

for key in partition_config.partition_keys

)

alxmrs · 2022-09-15T17:36:24Z

weather_dl/download_pipeline/parsers.py

+        except Exception:
+            raise ValueError(f"Unable to write to {target}")


One more idea – let's chain the errors:

Suggested change

except Exception:

raise ValueError(f"Unable to write to {target}")

except Exception as e:

raise ValueError(f"Unable to write to {target}") from e

alxmrs · 2022-09-15T17:41:40Z

I just re-read #209 -- a general thought: Should the fix even be in the config parser? It seems like the source of the error has to do with the pipeline args / environment.

Furthermore, another possible fix is to change what kinds of errors we choose to retry and others we let bring down the pipeline. These are generally set here:

weather-tools/weather_dl/download_pipeline/util.py

Line 20 in 65b484e

    
           def _retry_if_valid_input_but_server_or_socket_error_and_timeout_filter(exception) -> bool:

blackvvine · 2022-09-18T21:50:34Z

I just re-read #209 -- a general thought: Should the fix even be in the config parser? It seems like the source of the error has to do with the pipeline args / environment.

Furthermore, another possible fix is to change what kinds of errors we choose to retry and others we let bring down the pipeline. These are generally set here:

weather-tools/weather_dl/download_pipeline/util.py

Line 20 in 65b484e

def _retry_if_valid_input_but_server_or_socket_error_and_timeout_filter(exception) -> bool:

I see point that logically it might fit better in the error handler, but IMO it makes more sense for the check to be done in the arg parser since it's fail-fast. Even if we fix the error handler to tear down the job in case target location is unreachable, it does needlessly incur costs for the user to run the job up to the point of failure, and it takes somewhere between 5-20 minutes for it to reach there.

alxmrs · 2022-09-23T02:08:57Z

Ok, the validation is very reasonable to me. I like the principle of failing fast.

My one nit is: this should at least occur in the run function and not in the parser, since it's testing resource choices rather than parsing (e.g. it's validating, not parsing).

Iman Akbari and others added 4 commits August 12, 2022 19:21

Specify GCP package versions to speed up Beam environment resolution

e9f5c8a

Fixes google#210

Fix flake8 error

50c2e36

Move partition preparation to config.py to reuse in parsers

3c4f0e0

Verify GCS paths in weather-dl before submitting the job

7be6c71

alxmrs reviewed Aug 16, 2022

View reviewed changes

blackvvine added 3 commits August 19, 2022 21:15

Minor style fix in parsers.py

d599fc1

Make target path verification support different clouds

14e020a

Fix redundant import in config.py (weather-dl)

4a94b5d

alxmrs reviewed Sep 15, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify gcs target #218

Verify gcs target #218

blackvvine commented Aug 15, 2022

alxmrs left a comment

alxmrs Aug 16, 2022

blackvvine Aug 20, 2022

blackvvine Aug 20, 2022

alxmrs Aug 16, 2022

blackvvine Aug 20, 2022

alxmrs left a comment

alxmrs Sep 15, 2022

alxmrs Sep 15, 2022

alxmrs Sep 15, 2022

alxmrs Sep 15, 2022

alxmrs Sep 15, 2022 •

edited

Loading

alxmrs commented Sep 15, 2022

blackvvine commented Sep 18, 2022

alxmrs commented Sep 23, 2022

		except Exception:
		raise ValueError(f"Unable to write to {target}")

Verify gcs target #218

Are you sure you want to change the base?

Verify gcs target #218

Conversation

blackvvine commented Aug 15, 2022

alxmrs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alxmrs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alxmrs Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

alxmrs commented Sep 15, 2022

blackvvine commented Sep 18, 2022

alxmrs commented Sep 23, 2022

alxmrs Sep 15, 2022 •

edited

Loading