You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a number of file formats that are effectively XML of a specific kind. These are currently relying on finding the correct namespace for the XML schema in the bytesequence, often within X bytes of the BOF. This has a couple of issues:
In files where lots of namespaces are referenced, there is no guarantee that the default namespace is within X bytes of the start of the file, it should just be within the opening tag of the root node.
In some cases, the root node can be in the wrong namespace, but the correct namespace can still be referenced within X bytes of the start of the file as part of an embedded set of tags.
More formally, we generally want to declare that a format is XML with a root node of X, within the namespace Y. It might make sense to stop at identification of fmt/101 for XML, then hand off to some specific XML parsing to assess the root node and namespace criteria.
This might alternatively be solved with something that allows more regex like specifications, including backreferencable groups and greedy/non-greedy matching, and might thus be related to #12, i.e:
[Optional <?xml.. declaration] followed by opening <, followed by (optional prefix) + :, followed by {RootNode}, followed by non-greedy matching of any number of bytes (not including >), followed by "xmlns[optional :\1]={SpecificNamespace}", followed by non-greedy matching of any number of bytes, followed by >
Which hopefully says, "the XML declaration is optional, after that, within the first <> pair, you should have a specific RootNode name, optionally prefixed with an unknown string, and a namespace declaration either of a default, or the unknown prefix string named namespace, where the namespace itself is {SpecificNamespace}".
The text was updated successfully, but these errors were encountered:
Attached files demonstrate the problem for GPX (fmt/1134)
example1.gpx – (route from Strava) - identified OK
example2.gpx – (route from OS Maps) - identified as generic XML
example6.gpx – (ride from Strava) - identified as generic XML
Thanks @marhop I had missed that conversation the first time around. Without taking the time to fully absorb everything, it seems like pushing for some extension of regex-like syntax, with classes and back-referencable groups might allow us to solve a number of issues.
There are a number of file formats that are effectively XML of a specific kind. These are currently relying on finding the correct namespace for the XML schema in the bytesequence, often within X bytes of the BOF. This has a couple of issues:
In files where lots of namespaces are referenced, there is no guarantee that the default namespace is within X bytes of the start of the file, it should just be within the opening tag of the root node.
In some cases, the root node can be in the wrong namespace, but the correct namespace can still be referenced within X bytes of the start of the file as part of an embedded set of tags.
More formally, we generally want to declare that a format is XML with a root node of X, within the namespace Y. It might make sense to stop at identification of fmt/101 for XML, then hand off to some specific XML parsing to assess the root node and namespace criteria.
This might alternatively be solved with something that allows more regex like specifications, including backreferencable groups and greedy/non-greedy matching, and might thus be related to #12, i.e:
[Optional <?xml.. declaration] followed by opening <, followed by (optional prefix) + :, followed by {RootNode}, followed by non-greedy matching of any number of bytes (not including >), followed by "xmlns[optional :\1]={SpecificNamespace}", followed by non-greedy matching of any number of bytes, followed by >
Which hopefully says, "the XML declaration is optional, after that, within the first <> pair, you should have a specific RootNode name, optionally prefixed with an unknown string, and a namespace declaration either of a default, or the unknown prefix string named namespace, where the namespace itself is {SpecificNamespace}".
The text was updated successfully, but these errors were encountered: