-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Type Extension: single point definition #273
Comments
I'm not sure I understand what you're asking. A data type extension just means adding support to the spec or implementation for a new data type. It is referenced in the metadata by name, just like any existing data type |
Sorry, let me try to clearer.
which is defined in the metadata for an array. |
Yes, because in principle the arrays could have different data types. How else should the data type of an array be specified, if not in the metadata for the array? |
But does each reference to "datetime" need to include the "configuration" or is can |
datetime hasn't been standardized so it is kind of hypothetical at this point. But the intent of configuration is to allow parameterized data types without having to encode all of the parameters as a single string. Until there actually are any parameterized data types, though, you might just ignore the possibility in your implementation. That is what I've done in tensorstore. |
The specific example -- datetime -- is irrelevant, |
Is the idea that |
See also zarr-developers/zeps#47 regarding a variable-length string proposal. |
No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding, |
I see, in that case you would not need any configuration options, unless you want to use a configuration option to indicate the character encoding. |
Yes, with the option of specifying the encoding. |
I think there is in general a question as to what should go into the data type configuration and what should go into a separate attribute to indicate "units". For example, if we are storing a temperature in degrees C, we would probably not have a separate "degrees C stored as float64" data type. Instead, we would store it with a data type of "float64" and use some other attribute to indicate that the unit is "degrees C". For datetime, a unit rather than a separate data type would also seem to me to be cleaner, but many data storage formats, including zarr v2 as implemented by zarr-python, do have a separate data type for datetime. |
We seem to have strayed from my original question, namely, does the full datatype definition, including |
If the data type has no configuration options, then it can be specified as a plain string, |
An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then |
For the cases we've discussed so far, In general I think nominal typing is problematic because a given program may be working with arrays from more than one group. For example, suppose you are copying an array from local disk storage to s3. If the local disk array somehow references a data type defined in its parent group, and you compare data types nominally rather than structurally, there would be no way for the array stored on s3 to specify the same data type. |
Sure there is. The standard approach is to use fully qualified names (FQNs). |
Can you give an example of what you have in mind? I think we run into a problem if the "fully-qualified name" is both the unique identifier and the location of the definition. If the "fully-qualified name" is independent of the location of the definition, or the definition is always provided inline with the fully-qualified name, then it seems fine. |
I don't follow your last paragraph.
|
Can you clarify how the name works? Using the datetime example, would |
Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical. |
Yes, T would be e.g. datetime. |
I misread this comment. My assumption has always been that the metadata for a given file was self contained |
Currently there isn't really any type of "reference" to other arrays/groups/attributes/metadata fields of any kind anywhere in the spec. |
Interesting. That validates my belief that it is intended that array declarations are intended to be self contained. |
It appears that a data type extension must be re-defined at every use (specifically in each array's metadata).
It would certainly be useful if a data type extension could be defined once somewhere and used where needed.
The text was updated successfully, but these errors were encountered: