feat: enhance API filetype detection #445

awalker4 · 2024-08-06T15:43:47Z

Use the library for filetype detection

The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that Filetype None is not supported. The library has a detect_filetype that actually inspects the file bytes, so let's reuse this.

Add a `content_type` param to override filetype detection

Add an optional content_type param that allows the user to override the filetype detection. We'll use this value if it's set, or take the file.content_type which is based on the multipart Content-Type header. This provides an alternative when clients are unable to modify the header.

Testing

The important thing is that test_happy_path_all_types passes in the docker smoke test - this contains all filetypes that we want the API to support.

To test manually, you can try sending files to the server with and without the filename/content_type defined.

Check out this branch and run make run-web-app.

Example sending with no extension in filename. This correctly processes a pdf.

import requests

filename = "sample-docs/layout-parser-paper-fast.pdf"
url = "http://localhost:8000/general/v0/general"

with open(filename, 'rb') as f:
    files = {'files': ("sample-doc", f)}
    response = requests.post(url, files=files)
    print(response.text)

For the new param, you can try modifying the content type for a text based file.

Verify that you can change the metadata.filetype of the response using the new param:

 curl --location 'http://localhost:8000/general/v0/general' \
--form 'files=@"sample-docs/family-day.eml"' \
--form 'content_type="text/plain"'

[
    {
        "type": "UncategorizedText",
        "element_id": "5cafe1ce2b0a96f8e3eba232e790db19",
        "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <mallori@unstructured.io> To: Mallori Harrell <mallori@unstructured.io> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"",
        "metadata": {
            "filename": "family-day.eml",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    ...
]

The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that `Filetype None is not supported`. The library has a `detect_filetype` that actually inspects the file bytes, so let's reuse this.

Add an optional `content_type` param that allows the user to override the filetype detection. We'll use this value if it's set, or take the `file.content_type` which is based on the multipart `Content-Type` header. This provides an alternative when clients are unable to modify the header. Also, rename a `content_type` variable that's actually referring to the response mimetype.

MthwRobinson · 2024-08-06T15:49:41Z

prepline_general/api/models/form_params.py

@@ -100,6 +101,15 @@ def as_form(
            ),
            BeforeValidator(SmartValueParser[bool]().value_or_first_element),
        ] = False,
+        content_type: Annotated[


What do you think about calling this parameter content_type_hint to be more explicit?

MthwRobinson

One question but LGTM overall! Really like how much that cleaned up the MIME type logic.

awalker4 added 2 commits August 6, 2024 11:37

awalker4 requested review from christinestraub and MthwRobinson August 6, 2024 15:46

Add unstructured version to changlog

094c91d

MthwRobinson reviewed Aug 6, 2024

View reviewed changes

MthwRobinson approved these changes Aug 6, 2024

View reviewed changes

awalker4 enabled auto-merge (squash) August 6, 2024 15:54

awalker4 disabled auto-merge August 6, 2024 16:00

awalker4 enabled auto-merge (squash) August 6, 2024 16:02

awalker4 merged commit 7468938 into main Aug 6, 2024
6 checks passed

awalker4 deleted the filetype-detection branch August 6, 2024 16:43

This was referenced Aug 6, 2024

How do I supply Files.contentType? Unstructured-IO/unstructured-js-client#20

Closed

File type None is not supported. #365

Closed

Unstructured Inflexible with Client Inconsistencies Unstructured-IO/unstructured#2476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance API filetype detection #445

feat: enhance API filetype detection #445

awalker4 commented Aug 6, 2024 •

edited

Loading

MthwRobinson Aug 6, 2024

MthwRobinson left a comment

feat: enhance API filetype detection #445

feat: enhance API filetype detection #445

Conversation

awalker4 commented Aug 6, 2024 • edited Loading

Use the library for filetype detection

Add a content_type param to override filetype detection

Testing

MthwRobinson Aug 6, 2024

Choose a reason for hiding this comment

MthwRobinson left a comment

Choose a reason for hiding this comment

awalker4 commented Aug 6, 2024 •

edited

Loading

Add a `content_type` param to override filetype detection