Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve identification of XML based formats #13

Open
jackdos opened this issue Nov 27, 2019 · 3 comments
Open

Improve identification of XML based formats #13

jackdos opened this issue Nov 27, 2019 · 3 comments

Comments

@jackdos
Copy link

jackdos commented Nov 27, 2019

There are a number of file formats that are effectively XML of a specific kind. These are currently relying on finding the correct namespace for the XML schema in the bytesequence, often within X bytes of the BOF. This has a couple of issues:

  • In files where lots of namespaces are referenced, there is no guarantee that the default namespace is within X bytes of the start of the file, it should just be within the opening tag of the root node.

  • In some cases, the root node can be in the wrong namespace, but the correct namespace can still be referenced within X bytes of the start of the file as part of an embedded set of tags.

More formally, we generally want to declare that a format is XML with a root node of X, within the namespace Y. It might make sense to stop at identification of fmt/101 for XML, then hand off to some specific XML parsing to assess the root node and namespace criteria.

This might alternatively be solved with something that allows more regex like specifications, including backreferencable groups and greedy/non-greedy matching, and might thus be related to #12, i.e:

[Optional <?xml.. declaration] followed by opening <, followed by (optional prefix) + :, followed by {RootNode}, followed by non-greedy matching of any number of bytes (not including >), followed by "xmlns[optional :\1]={SpecificNamespace}", followed by non-greedy matching of any number of bytes, followed by >

Which hopefully says, "the XML declaration is optional, after that, within the first <> pair, you should have a specific RootNode name, optionally prefixed with an unknown string, and a namespace declaration either of a default, or the unknown prefix string named namespace, where the namespace itself is {SpecificNamespace}".

@JonTilbury
Copy link

Attached files demonstrate the problem for GPX (fmt/1134)
example1.gpx – (route from Strava) - identified OK
example2.gpx – (route from OS Maps) - identified as generic XML
example6.gpx – (ride from Strava) - identified as generic XML

GPXexamples.zip

@marhop
Copy link

marhop commented Jan 6, 2020

Just in case you missed it, also have a look at @richardlehane's thoughts on this topic and the discussion around them.

@jackdos
Copy link
Author

jackdos commented Jan 6, 2020

Thanks @marhop I had missed that conversation the first time around. Without taking the time to fully absorb everything, it seems like pushing for some extension of regex-like syntax, with classes and back-referencable groups might allow us to solve a number of issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants