should earthaccess provide a way to keep track of input query parameters? #847

JessicaS11 · 2024-10-24T14:33:23Z

JessicaS11
Oct 24, 2024
Maintainer

In exploring replacing the icepyx.Query module with direct usage of earthaccess, a few fundamental questions are surfacing that I think are also relevant topics to explore in earthaccess. This one concerns a basic difference in an ipx.Query object and earthaccess.search (which returns a list). Specifically, earthaccess does not in any way "store" the user's search criteria. It passes them through and returns a list of results. Both approaches (storing an object with the search criteria and results vs not) have their advantages and disadvantages.

I'm curious to hear from others if the benefit of having some of this information stored in an object is worth the cost of having the object (and having the user interact with it rather than, e.g., the functions surfaced directly through earthaccess.api). My personal bias (surprise!) is that the object is nice to have: I can see exactly what search parameters I used (temporal, spatial, cloud-or-not, collection, etc.) for the set of results attached to it. I can use that information to feed into another API or tool, and if I change my code without updating my object I'm not confused by the results I have (plus, I don't have to scroll to the top of my notebook if I want any of that info). And I can have multiple objects with different search parameters and results attached (that I'm thus less likely to muck up, and also that I can do per-dataset operations on).

I think this is an important conversation for moving forward with how earthaccess and icepyx will interface (and other plugins too). How will the plugins need to check if the earthaccess results they've been passed are valid for working with their tool? How would icepyx "get" and "use" earthaccess search results for its other capabilities (submitting a subset request; reading in data), given users can change filenames and not all datasets have the same metadata (so there'd be a lot of try/if statements to guess at what the user has passed in)?

jhkennedy · 2024-10-25T19:31:02Z

jhkennedy
Oct 25, 2024
Maintainer

@JessicaS11 I think keeping track of the search parameters that got you to a search result is a good idea. As you note, we return a vanilla list, so there's nowhere to store that kind of information in the results object. For this, and the reasons discussed here, I think we should pivot to returning a results object so we can provide richer methods/metadata about results.

Looking at ipx.Query, it looks like both searching/ordering and the results of those operations are contained in the class. I think on the earthaccess side, it would look more like earthaccess search methods (e.g., search_data) would stay package level methods^ but would return a Results object instead of a vanilla list:

>>> results = earthacces.search_data(...)
>>> type(results)
Earthaccess.Results
>>> search_args = results.search_args() # some dict, or a dict-like object that allows you to do:
>>> results_again = earthaccess.search_data(**search_args) 
>>> results == results_again
True

Which I think would get you most, if not all, of the functionality you want.

given users can change filenames all datasets have the same metadata (so there'd be a lot of try/if statements to guess at what the user has passed in)?

Can you expand on both of these? I am not sure I quite follow

^ There's a good argument to pull the search stuff into a class as well so that you could search multiple maturities or different catalogs at the same time. Technically, earthaccess already has this as packages are just singleton class objects, but allowing multiple instances could be helpful (e.g., you wouldn't have to pass an auth object around).

I do, however, prefer keeping the search classes and the results classes separate instead of combined like in ipx.Query.

0 replies

chuckwondo · 2024-10-25T20:11:12Z

chuckwondo
Oct 25, 2024
Maintainer

I'm a bit confused. Don't earthaccess.DataCollections and earthaccess.DataGranules already serve (at least most of) this purpose?

3 replies

jhkennedy Oct 25, 2024
Maintainer

@chuckwondo you're right, DataCollections and DataGranules do provide the stuff in my footnote/aside -- that is they are effectively API query classes. I always forget about them as they have rather unfortunate names (the non-plural versions of them, DataGranule and DataCollection, are result objects), and we mostly steer users towards towards eathaccess.search_datasets/earthaccess.search_data which calls them under the hood.

The issue @JessicaS11 is struggling with is that DataCollections/DataGranules return plain lists of DataCollction/DataGranule objects:
https://github.com/nsidc/earthaccess/blob/main/earthaccess/search.py#L112-L115

And so from the search result there's no way to reproduce the search that got that result and we can't provide a richer representation of a search result

but yes, I think you're right that those classes could do effectively everything she needs with little (stuffing the results into an object attribute) or no modification (delay calling get).

jhkennedy Oct 25, 2024
Maintainer

And by unfortunately named, I mean that:

DataCollections/DataGranules don't give a hint of what they do, unlike the base cmr_python CollectionQuery and GranuleQuery class names
They would be the logical name for a class that contains a group of DataCollection and DataGranule instances (e.g., a search result)

This also applies to DataServices:
https://github.com/nsidc/earthaccess/blob/main/earthaccess/services.py#L11

jhkennedy Oct 25, 2024
Maintainer

🤔 I suppose if you think about them as lazy, they are effectively a results object. With that view, what I'd like is a "load" method that stores the actual results, methods to index the results, and (potentially) ways to combine multiple results, which all could be added to these classes.

I still think they are overloaded, however, and querying and results should be separate classes.

JessicaS11 · 2024-10-29T19:11:49Z

JessicaS11
Oct 29, 2024
Maintainer Author

Notes from today's hack session:
Goal: return a results object (in earthaccess.api) instead of a list of specific results
[short-term] Plan: add granules property and new methods to earthaccess.DataCollections to make it behave like a list, then return to the user in earthaccess.search_datasets the actual DataCollections object.
[longer-term] Plan: Ultimately, earthaccess API methods like download and open would then act on this object rather than expecting the user to supply granules and provider inputs

For further discussion: separate the query and results objects entirely. This would be a breaking change but also help users when they need to be authenticated with multiple providers (since the auth wouldn't be attached to a package level earthaccess object). Another question is whether or not users directly call DataCollections rather than the api search_datasets as shown (e.g.) in the Earthdata Cloud Cookbook, which could influence how breaking a change it truly is.

3 replies

weiji14 Nov 12, 2024

Had a chat with @ebolch today on why icepyx would appreciate a 'results class' (which he's working on in #860). The key idea is to have the class store the following:

Input spatiotemporal search arguments (e.g. bbox, time range, etc) made to earthaccess.search_datasets
Additional metadata returned from the CMR query

[short-term] Plan: add granules property and new methods to earthaccess.DataCollections to make it behave like a list, then return to the user in earthaccess.search_datasets the actual DataCollections object.

So instead of earthaccess.search_datasets returning a list[DataCollection] currently, we want it to return a DataCollection which is a list[DataGranules] or something? Similar to how a pystac.ItemCollection is a list[pystac.Item]?

JessicaS11 Nov 13, 2024
Maintainer Author

So instead of earthaccess.search_datasets returning a list[DataCollection] currently, we want it to return a DataCollection which is a list[DataGranules] or something? Similar to how a pystac.ItemCollection is a list[pystac.Item]?

Exactly!

mfisher87 Nov 21, 2024
Maintainer

Input spatiotemporal search arguments (e.g. bbox, time range, etc) made to earthaccess.search_datasets

Exactly as input, correct? So a prior search could be reproduced with e.g. earthaccess.search_data(**results.input_params)?

chuckwondo · 2024-11-22T12:01:10Z

chuckwondo
Nov 22, 2024
Maintainer

I will continue to raise this in our various discussions related to CMR queries: we keep ignoring python_cmr.

Perhaps as part of this discussion we should reconsider the relationship between earthaccess and python_cmr.

If we want earthaccess to continue leveraging python_cmr, we need to have these CMR-query conversations over in that library, not here. Otherwise, we will continue to add functionality to earthaccess that further distances it from python_cmr, and further blurs/confuses the delineation.

One way or the other (leveraging or abandoning python_cmr), there is code that should really be moved one direction or the other. If we commit to leveraging python_cmr, then there's a fair bit of code in earthaccess that needs to be moved to python_cmr. If we choose to abandon use of python_cmr, then there's code we need to move in the other direction.

Currently, we're sitting in a sort of "limbo," and I'd like to suggest we make a decision one way or the other and make progress towards getting out of "limbo," which continues to add to our technical debt, IMO.

6 replies

chuckwondo Nov 23, 2024
Maintainer

The NAMS wall should be less of a concern now than it was before. Frank, Brianna, and myself are maintainers of python_cmr, so we can certainly address python_cmr requests better than previously, when lack of response drove us to "duplicate" work in earthaccess.

mfisher87 Nov 23, 2024
Maintainer

I agree things are certainly better, but I do still feel the current state is not resilient to change. Not a big deal short-term. It doesn't change my "leverage python_cmr" stance, it just prevents me from feeling like this will be completely smooth long-term.

jhkennedy Nov 26, 2024
Maintainer

It might be worth a separate discussion about python_cmr specifically and possibly sending it to the decision committee.

I agree that a lot of things should be python_cmr things, but I'm more and more coming around to "I'd like to eat it", or at least co-locate it "here" (or in an earthaccess org).

I don't like the status quo for a few reasons:

From a docs/user perspective, it's not very satisfactory to push a lot of our primary function signatures to another module and send people to a different location for docs (a significant portion of what we do is query cmr)
I'd prefer to isolate our extras in the DataGranules/DataCollections/DataServices classes into results containers and instead use the GranuleQuery/CollectionQuery/ServiceQuery classes directly
both packages require community support to maintain, and it would be nice to align them so that developing on ether is effectively the same
while it's significantly better than it was with the python_cmr maintainers are being very active now, building a community around python_cmr will be harder and I'm not sure it's sustainable (bring in new maintainers does still require a nams request) especially how security things are handled at NASA (I have this concern about being in the NSIDC org as well, even though it's sig. less restrictive)

All can be done without needing to eat python_cmr, but I think it would make things better, overall.

jhkennedy Nov 26, 2024
Maintainer

One way or the other (leveraging or abandoning python_cmr), there is code that should really be moved one direction or the other. If we commit to leveraging python_cmr, then there's a fair bit of code in earthaccess that needs to be moved to python_cmr. If we choose to abandon use of python_cmr, then there's code we need to move in the other direction.

It'd be handy to do this inventory -- I suspect there won't be much of earthaccess left, or at least anything coherent if we push everything that should be done via python_cmr in that direction, and similarly, we basically use everything in python_cmr directly and only thinly wrap it before handing it to users. It seems to me the major problem is they are one and the same.

frankinspace Dec 6, 2024

I started contributing to python_cmr because PO.DAAC had a need to query CMR from python and at the time there were at least 3 (all non-active) python cmr libraries listed on pip. That need still exists for many groups, a python API to CMR, and that is the intent of python_cmr. How much that goal overlaps with the goals of the earthaccess library I'm not sure but I do know that splitting development again into multiple libraries maintained independently from each other increases the risk that we end up in the situation where there are multiple options available from PyPi, with different levels of functionality and support.

I do have access to invite outside collaborators as maintainers of the python_cmr repository in the NASA org and am more than willing to add more individuals. Thing is, open-source project governance is always tough because it is basically just a bunch of willing volunteers. I have tried making the case to ESDIS that libraries like this need to be identified as core libraries that need to have dedicated persistent support and I will continue to advocate for python_cmr in that regard.

If anything, being part of the NASA org does at least give us a path where in a situation where all the currently active maintainers of python_cmr disappeared, anyone with a NASA badge and enough motivation could revive the project, because that's exactly how I got involved :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should earthaccess provide a way to keep track of input query parameters? #847

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

should earthaccess provide a way to keep track of input query parameters? #847

JessicaS11 Oct 24, 2024 Maintainer

Replies: 4 comments · 12 replies

jhkennedy Oct 25, 2024 Maintainer

chuckwondo Oct 25, 2024 Maintainer

jhkennedy Oct 25, 2024 Maintainer

jhkennedy Oct 25, 2024 Maintainer

jhkennedy Oct 25, 2024 Maintainer

JessicaS11 Oct 29, 2024 Maintainer Author

weiji14 Nov 12, 2024

JessicaS11 Nov 13, 2024 Maintainer Author

mfisher87 Nov 21, 2024 Maintainer

chuckwondo Nov 22, 2024 Maintainer

chuckwondo Nov 23, 2024 Maintainer

mfisher87 Nov 23, 2024 Maintainer

jhkennedy Nov 26, 2024 Maintainer

jhkennedy Nov 26, 2024 Maintainer

frankinspace Dec 6, 2024

JessicaS11
Oct 24, 2024
Maintainer

Replies: 4 comments 12 replies

jhkennedy
Oct 25, 2024
Maintainer

chuckwondo
Oct 25, 2024
Maintainer

jhkennedy Oct 25, 2024
Maintainer

jhkennedy Oct 25, 2024
Maintainer

jhkennedy Oct 25, 2024
Maintainer

JessicaS11
Oct 29, 2024
Maintainer Author

JessicaS11 Nov 13, 2024
Maintainer Author

mfisher87 Nov 21, 2024
Maintainer

chuckwondo
Nov 22, 2024
Maintainer

chuckwondo Nov 23, 2024
Maintainer

mfisher87 Nov 23, 2024
Maintainer

jhkennedy Nov 26, 2024
Maintainer

jhkennedy Nov 26, 2024
Maintainer