Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container patterns with variably named files #10

Open
Dclipsham opened this issue Oct 15, 2019 · 18 comments
Open

Container patterns with variably named files #10

Dclipsham opened this issue Oct 15, 2019 · 18 comments

Comments

@Dclipsham
Copy link

Container signatures currently require finding specific files within specific names within the zip container. In certain circumstances the file names may be variable. For Gnumeric, for example, and other formats that use GZIP-based compression as standard, the file usually shares the name of the GZ container.
For Thumbs.db generated later than Windows XP, the contained file is named with what appears to be a partial checksum, with a pattern '256_xxxxxxxxxxxxxxxx' where each 'x' is a value from the hexadecimal range.

It would therefore be useful to be able to express container signatures with variably named files. This needs careful consideration as, for example a full wildcard name would mean that any ZIP file would attempt to scan any and all files contained within!

@thorsted
Copy link

Use case for this would be iWork 2019 documents. In order to distinguish between Pages, Numbers, & Keynote specific files would need to be referenced, but files specific to say Keynote have variable filenames.

In one sample we have "Slide1.iwa", in a later sample we have "Slide-9273.iwa" or "Slide-10181.iwa". Files have same content, but can't be referenced in container ""

@thorsted
Copy link

thorsted commented Sep 8, 2020

Another container format with variable names is the Project files for AutoDesk ReCap. .RCP files contain three files.
A BMP, JPG, and XML. The JPG and XML are named with a 32 alpha/numeric string which is unique to the file. The XML contains the root tag <Autodesk Version="1.0"> which would be a identifiable string for a signature, but the name of the XML file is not static.
Screen Shot 2020-09-07 at 10 58 42 PM

@thorsted
Copy link

The USDZ file format is another ZIP container file format with variable names and folders, making identification difficult. The USDC file inside can have a static name, a UUID, or both nested deep within other variable folder names.

@thorsted
Copy link

thorsted commented Apr 1, 2022

Another container format with variable names is the Project files for AutoDesk ReCap. .RCP files contain three files. A BMP, JPG, and XML. The JPG and XML are named with a 32 alpha/numeric string which is unique to the file. The XML contains the root tag <Autodesk Version="1.0"> which would be a identifiable string for a signature, but the name of the XML file is not static. Screen Shot 2020-09-07 at 10 58 42 PM

Link to sample RCP file

@richardlehane
Copy link

Propose could use glob patterns to express these names.

Rationale:

  • it is the standard for filename pattern matching in unix
  • not as complex/costly as full regex
  • java, python and go all have standard library support, hopefully reducing the cost of implementing this.

@thorsted
Copy link

Another example of container format with variable name:

Tableau Packaged Workbook (.twbx)
Samples here: https://community.tableau.com/s/topic/0TO4T000000RcA5WAK/workbook-calculation-library

@thorsted
Copy link

Another example of a container format with a variable named file.

Web Archive Collection Zipped (WACZ)
https://specs.webrecorder.net/wacz/1.1.1/#archive

@Dclipsham
Copy link
Author

@thorsted with WACZ there appears to be enough mandatory files, e.g. datapackage.json, pages/pages.jsonl etc to be able to create a reliable container sig pattern for. we've got a sig going thru internal testing right now that's proving reliable so far. Do you have files that are variant?

@thorsted
Copy link

@thorsted with WACZ there appears to be enough mandatory files, e.g. datapackage.json, pages/pages.jsonl etc to be able to create a reliable container sig pattern for. we've got a sig going thru internal testing right now that's proving reliable so far. Do you have files that are variant?

No variants. Greg and I did send in the following to PRONOM, looks like it will be released in v110.

                <File>
                    <Path>datapackage.json</Path>
                    <BinarySignatures>
                        <InternalSignatureCollection>                    
	                        <InternalSignature ID="300">
	                            <ByteSequence Reference="BOFoffset">
	                                <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="4096">
	                                    <Sequence>'wacz_version'</Sequence>
	                                </SubSequence>
	                            </ByteSequence>
	                        </InternalSignature>
	                    </InternalSignatureCollection>
                    </BinarySignatures>
                </File>

@Dclipsham
Copy link
Author

Great, thank you!

richardlehane added a commit to richardlehane/siegfried that referenced this issue May 27, 2023
@richardlehane
Copy link

I've tried implementing this on siegfried's develop branch. Now, if you use a container path that looks like a glob (contains *, ? or [] chars), & is a valid glob, then it will do glob instead of literal string matching.

@thorsted if you'd like some binaries to try let me know what OS and I can build for you. Or, if you can share some files and container signatures I can test for you.

With glob syntax you can do * and ? single/many wildcards:

  • Slide*.iwa
  • ????????-????-????-????-????????????.xml.

You can also do character sets:

  • 256_[abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789][abcdefABCDEF0123456789]

Character sets with a lot of repeats are verbose but maybe 256_???????????????? is precise enough anyway.

One issue I hit was a number of container signatures have "[Content_Types].xml" as the path test. If you interpret this as a glob, then the literal path won't match (instead C.xml, o.xml, n.xml etc. would all match). I ended up just special casing this but it is a potential future footgun unless you can distinguish when paths should be interpreted as globs/regex or as literal strings.

@ross-spencer
Copy link

ross-spencer commented May 28, 2023

Nice work Richard!

I ended up just special casing this but it is a potential future footgun unless you can distinguish when paths should be interpreted as globs/regex or as literal strings.

Depending how far down the development path recent PRONOM work is, extending the container XML might be an option? an attribute in the XML?

  • <path type="glob">ACDC</path>

Or a new element?

  • <globPath>AC*DC</globPath>

There are a number of reasons it feel it would be better if it was explicit the pattern type, rather than trying to interpret as a reader of the signature file.

Depending how far down the development path recent PRONOM work is...

Answering my own question a bit as I write, but if DROID didn't complain about either of the two options, i.e. simply ignored they were there, then the XML could maybe be agreed upon and extended prior to its inclusion in a future PRONOM user interface?

@richardlehane
Copy link

Another option would be to enclose ambiguous paths (just [Content_Types].xml at the moment) within single or double quote marks to force literal matching (the same way you do it in a command line)

@thorsted
Copy link

Started keeping a list. May be others I need to add.
https://docs.google.com/spreadsheets/d/120Xt6oP4QVV3aj_MelvewytjBJNL4RgR-6z0DHWMT_E/edit?usp=sharing

@richardlehane I would love to test, any Mac version will do. I can also put together a test set of formats with these unique structures for all of us to test with.

@richardlehane
Copy link

@richardlehane I would love to test, any Mac version will do. I can also put together a test set of formats with these unique structures for all of us to test with.

there are fresh sf and roy binaries in the *mac64.zip file here: https://github.com/richardlehane/siegfried/releases/tag/v1.11.0-rc1

@ross-spencer
Copy link

Related issue: digital-preservation/droid#823

@ross-spencer
Copy link

Another option would be to enclose ambiguous paths (just [Content_Types].xml at the moment) within single or double quote marks to force literal matching (the same way you do it in a command line)

Much more elegant.

The only issue may be backward compatibility.

If you zip a file with the name [content-type].xml.

And use this combination of signature files:

Standard sig

<?xml version="1.0" encoding="UTF-8"?>
<FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2023-05-31T14:02:16">
 <InternalSignatureCollection>
  <InternalSignature ID="2" Specificity="Specific">
   <ByteSequence Reference="BOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="4">
     <Sequence>504B0304</Sequence>
    </SubSequence>
   </ByteSequence>
   <ByteSequence Reference="EOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="61" SubSeqMaxOffset="65565">
     <Sequence>504B01</Sequence>
    </SubSequence>
   </ByteSequence>
   <ByteSequence Reference="EOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="65535">
     <Sequence>504B0506</Sequence>
    </SubSequence>
   </ByteSequence>
  </InternalSignature>
  <InternalSignature ID="3" Specificity="Specific">
   <ByteSequence Reference="BOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="0">
     <Sequence>504B0304</Sequence>
    </SubSequence>
   </ByteSequence>
   <ByteSequence Reference="BOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="4" SubSeqMaxOffset="30">
     <Sequence>5B436F6E74656E745F54797065735D2E786D6C20A2</Sequence>
    </SubSequence>
   </ByteSequence>
   <ByteSequence Reference="EOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="65535">
     <Sequence>504B0102</Sequence>
    </SubSequence>
   </ByteSequence>
   <ByteSequence Reference="EOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="65535">
     <Sequence>504B0506</Sequence>
    </SubSequence>
   </ByteSequence>
  </InternalSignature>
  <InternalSignature ID="4" Specificity="Specific">
   <ByteSequence Reference="BOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="0">
     <Sequence>D0CF11E0A1B11AE1</Sequence>
    </SubSequence>
   </ByteSequence>
   <ByteSequence Reference="BOFoffset">
    <SubSequence Position="1" MinFragLength="0" SubSeqMinOffset="0" SubSeqMaxOffset="28">
     <Sequence>FEFF</Sequence>
    </SubSequence>
   </ByteSequence>
  </InternalSignature>
 </InternalSignatureCollection>
 <FileFormatCollection>
  <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">
   <Extension>ext</Extension>
  </FileFormat>
  <FileFormat ID="2" Name="ZIP Format" PUID="x-fmt/263" Version="" MIMEType="application/zip">
   <InternalSignatureID>2</InternalSignatureID>
   <Extension>zip</Extension>
  </FileFormat>
  <FileFormat ID="3" Name="Microsoft Office Open XML" PUID="fmt/189" Version="" MIMEType="application/octet-stream">
   <InternalSignatureID>3</InternalSignatureID>
  </FileFormat>
  <FileFormat ID="4" Name="OLE2 Compound Document Format" PUID="fmt/111" Version="" MIMEType="application/octet-stream">
   <InternalSignatureID>4</InternalSignatureID>
  </FileFormat>
 </FileFormatCollection>
</FFSignatureFile>

Container sig

<?xml version="1.0" encoding="UTF-8"?>
<ContainerSignatureMapping SchemaVersion="1.0" SignatureVersion="1">
 <ContainerSignatures>
  <ContainerSignature Id="2" ContainerType="ZIP">
   <Description>Development Signature</Description>
   <Files>
    <File>
     <Path>[content-type].xml</Path>
    </File>
   </Files>
  </ContainerSignature>
 </ContainerSignatures>
 <FileFormatMappings>
  <FileFormatMapping signatureId="2" Puid="dev/1"></FileFormatMapping>
 </FileFormatMappings>
 <TriggerPuids>
  <TriggerPuid ContainerType="OLE2" Puid="fmt/111"></TriggerPuid>
  <TriggerPuid ContainerType="ZIP" Puid="fmt/189"></TriggerPuid>
  <TriggerPuid ContainerType="ZIP" Puid="x-fmt/263"></TriggerPuid>
 </TriggerPuids>
</ContainerSignatureMapping>

Then the sequence will be matched in DROID and Siegfried:

---
siegfried   : 1.9.3
scandate    : 2023-05-31T16:10:10+02:00
signature   : default.sig
created     : 2023-05-31T16:10:08+02:00
identifiers : 
  - name    : 'pronom'
    details : 'my-standard-sig.xml; mysig.xml; built without reports'
---
filename : 'sample.ext'
filesize : 216
modified : 2023-05-31T15:35:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'dev/1'
    format  : 'Development Signature'
    version : '1.0'
    mime    : 'application/octet-stream'
    basis   : 'extension match ext; container name [content-type].xml with name only'
    warning : 

But then if that path is quoted, e.g. "[content-type].xml", neither tool knows what to do with that out of the box, so it doesn't match. Maybe there's an escape I am missing? Otherwise, I'm not sure OOTMH what the instruction is to DROID here.

So, the question may come down to how does Siegfried or DROID start to consider taking advantage of a glob enabled signature files today, while enabling current versions to still use existing container signatures?

@thorsted
Copy link

Another format with a variable file name is the MXL (Compressed MusicXML) format. Will have an XML inside container with identifiable root entry pattern, but name of XML is variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants