Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enhance API filetype detection #445

Merged
merged 3 commits into from
Aug 6, 2024
Merged

feat: enhance API filetype detection #445

merged 3 commits into from
Aug 6, 2024

Conversation

awalker4
Copy link
Collaborator

@awalker4 awalker4 commented Aug 6, 2024

Use the library for filetype detection

The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that Filetype None is not supported. The library has a detect_filetype that actually inspects the file bytes, so let's reuse this.

Add a content_type param to override filetype detection

Add an optional content_type param that allows the user to override the filetype detection. We'll use this value if it's set, or take the file.content_type which is based on the multipart Content-Type header. This provides an alternative when clients are unable to modify the header.

Testing

The important thing is that test_happy_path_all_types passes in the docker smoke test - this contains all filetypes that we want the API to support.

To test manually, you can try sending files to the server with and without the filename/content_type defined.

Check out this branch and run make run-web-app.

Example sending with no extension in filename. This correctly processes a pdf.

import requests

filename = "sample-docs/layout-parser-paper-fast.pdf"
url = "http://localhost:8000/general/v0/general"

with open(filename, 'rb') as f:
    files = {'files': ("sample-doc", f)}
    response = requests.post(url, files=files)
    print(response.text)

For the new param, you can try modifying the content type for a text based file.

Verify that you can change the metadata.filetype of the response using the new param:

 curl --location 'http://localhost:8000/general/v0/general' \
--form 'files=@"sample-docs/family-day.eml"' \
--form 'content_type="text/plain"'

[
    {
        "type": "UncategorizedText",
        "element_id": "5cafe1ce2b0a96f8e3eba232e790db19",
        "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <mallori@unstructured.io> To: Mallori Harrell <mallori@unstructured.io> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"",
        "metadata": {
            "filename": "family-day.eml",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    ...
]

The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that `Filetype None is not supported`. The library has a `detect_filetype` that actually inspects the file bytes, so let's reuse this.
Add an optional `content_type` param that allows the user to override the filetype detection. We'll use this value if it's set, or take the `file.content_type` which is based on the multipart `Content-Type` header. This provides an alternative when clients are unable to modify the header.

Also, rename a `content_type` variable that's actually referring to the response mimetype.
@@ -100,6 +101,15 @@ def as_form(
),
BeforeValidator(SmartValueParser[bool]().value_or_first_element),
] = False,
content_type: Annotated[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling this parameter content_type_hint to be more explicit?

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question but LGTM overall! Really like how much that cleaned up the MIME type logic.

@awalker4 awalker4 enabled auto-merge (squash) August 6, 2024 15:54
@awalker4 awalker4 disabled auto-merge August 6, 2024 16:00
@awalker4 awalker4 enabled auto-merge (squash) August 6, 2024 16:02
@awalker4 awalker4 merged commit 7468938 into main Aug 6, 2024
6 checks passed
@awalker4 awalker4 deleted the filetype-detection branch August 6, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants