Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with archives.Identify when filename contains a compression extension #7

Open
luotianqi777 opened this issue Dec 17, 2024 · 2 comments

Comments

@luotianqi777
Copy link

luotianqi777 commented Dec 17, 2024

Title: Issue with archives.Identify when filename contains a compression extension

Description:

I encountered an issue while using the archives.Identify function from the github.com/mholt/archives package. When the file name includes a compression extension (e.g., test.gzero.zip), the function returns an error: gzip: invalid header.

Here is the code snippet to reproduce the issue:

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/mholt/archives"
)

func main() {
	var err error
	ctx := context.Background()
	stream, _ := os.Open("testzip") // zip stream
	defer stream.Close()

	_, _, err = archives.Identify(ctx, "test.gzero.zip", stream) // filename contains Compression extension
	fmt.Println(err) // matching zip: gzip: invalid header

	_, _, err = archives.Identify(ctx, "test.zip", stream)
	fmt.Println(err) // nil
}

Steps to Reproduce:

  1. Open a zip file stream using os.Open.
  2. Pass the filename with a compression extension (e.g., test.gzero.zip) to archives.Identify.
  3. Observe that the function returns an error (gzip: invalid header).
@luotianqi777
Copy link
Author

luotianqi777 commented Dec 17, 2024

The current implementation of the filename matching logic in archives.Identify uses strings.Contains, which may result in incorrect matches when the filename contains multiple extensions. For example, a file named test.gzero.szx.zip may cause unexpected behavior.

// match filename
if strings.Contains(strings.ToLower(filename), gz.Name()) {
	mr.ByName = true
}

Maybe we should split the filename by . and check each part for equality with the expected format.

// match filename
for _, w := range strings.Split(filename, ".")[1:]{
	if strings.EqualFold(gz.Name(), "."+w){
		mr.ByName = true
		break
	}
}

Or provide a strict mode that matches only using the file header.

@mholt
Copy link
Owner

mholt commented Dec 17, 2024

Well, maybe we need to start with test cases then.

Is foo.tar.gz a tar file or a gzipped file?

What is foo.gz.zip?

Anyway, I agree we could improve this logic, but I am not sure what the answers are yet.

Identify(), and thus Match(), are used to determine how to read files... typically they expect an outer compression layer, if any, and then an archive format if there's a second match, within the compressed layer (if any).

So maybe the answer is a combination of chopping off a file extension after matching it, before matching the inner layer, or something like that; and making the filename matching more strict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants