Generic Multimodal Support #1021

Saghen · 2024-04-16T22:07:24Z

Adds support for multimodal with Anthropic by increasing the maximum file size, adjusting the message.files type to support mime and removing the assumptions around TGI.

Changed from base64 or hash string[] to { type: 'hash' | 'base64', value: string, mime: string }
Moved TGI specific image resizing and markdown ![](base64) prompting to TGI endpoint code
Changed maximum file size from 2mb -> 10mb
- Thinking of reducing this and adding back the in-browser image resizing
- Likely should be configurable

I'd like to move the file upload logic out of the UI code and begin uploading immediately upon selecting a file, but that's outside the scope of this PR. However, that should allow for processing files earlier, which could be particularly useful for non-images (i.e. making embeddings for PDFs).

Test the TGI endpoints
Ensure clients receive a useful error message when their files are incompatible (with respect to mime types)

flexchar · 2024-04-21T11:13:28Z

This is amazing. When this is merged, please ping me. I would like to adapt it for OpenAI + Gemini 1.5 Pro. ✌️

Ichigo3766 · 2024-04-26T21:06:59Z

This is amazing! Would be nice to extend this to openai api as well if possible.

Extremys · 2024-05-01T06:49:32Z

Yes amazing! It would be so great to have also OpenAI-like API compatibility, so many Open sources multimodal models are available like Idefics2, Llava, llama-3-vision, ... :)

nsarrazin · 2024-05-05T22:52:41Z

Hey @Saghen, PR looking great from my local testing!

We changed a few things last week since we switched our docker image to a new build process. That probably introduced some conflicts but I don't mind fixing them for you since I created them 😅 If you're ok giving me write-access on the PR then I can just do the merge commit directly.

Saghen · 2024-05-05T23:49:06Z

@nsarrazin that'd be great, thanks! granted you permission

nsarrazin

Overall looks pretty good! Left some comments, let me know what you think.

src/routes/conversation/[id]/+page.svelte

nsarrazin · 2024-05-07T17:36:24Z

src/routes/conversation/[id]/+page.svelte


-			const module = await import("browser-image-resizer");
-			// currently, only IDEFICS is supported by TGI
-			// the size of images is hardcoded to 224x224 in TGI
-			// this will need to be configurable when support for more models is added
-			const resizedImages = await Promise.all(
-				files.map(async (file) => {
-					return await module
-						.readAndCompressImage(file, {
-							maxHeight: 224,
-							maxWidth: 224,
-							quality: 1,
-						})
-						.then(async (el) => await file2base64(el as File));
-				})
+			const base64Files = await Promise.all(
+				(files ?? []).map((file) =>
+					file2base64(file).then((value) => ({ type: "base64" as const, value, mime: file.type }))
+				)
 			);



Would be nice to be able to do the resizing in browser but it should be configurable indeed. Would save quite some bandwidth I think at huggingchat scale

Maybe model.multimodal could be true | { maxSize?: number, preferredMimeType?: string } ? That would give us a place to store multimodal specific settings, and the client has access to the model so we could do the resizing there.

I created a more generic image processor in c8814f4 and put the multimodal options at endspoints[*].multimodal. What do you think?

WRT to bandwidth, it might make sense to do some image processing on the server on upload to convert to a more suitable format (i.e. put everything in AVIF/WEBP?) but I found the existing upload limit to get in the way

src/routes/conversation/[id]/+server.ts

nsarrazin · 2024-05-07T20:26:10Z

And thanks for exposing the mime type in files 🔥 that's gonna be super handy down the road as we support more modalities

Saghen · 2024-05-13T02:28:59Z

@Ichigo3766 @Extremys @flexchar heads up that it was trivial so I added support for OpenAI in this PR as well

mishig25 · 2024-05-14T09:37:49Z

@Saghen I will review it soon. Could you merge/rebase with the main so that the merge conflicts are gone ❤️

README.md

src/lib/components/chat/ChatWindow.svelte

src/lib/server/endpoints/images.ts

src/lib/server/endpoints/endpoints.ts

src/lib/server/endpoints/tgi/endpointTgi.ts

src/lib/types/Message.ts

mishig25

left some nits. I think we are close to merge 🚀

Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>

src/lib/server/files/uploadFile.ts

mishig25

besides the last two nits I've left, LGTM 🚀

Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>

mishig25

LGTM! testing it one more time before merge

This reverts commit 57f8934.

Revert "Generic Multimodal Support (#1021)" This reverts commit 57f8934.

pocman · 2024-07-10T09:33:40Z

@flexchar are you still planning on adding support for Gemini pro ?
cc @ArthurGoupil

ArthurGoupil · 2024-07-10T09:50:57Z

@flexchar are you still planning on adding support for Gemini pro ? cc @ArthurGoupil

@flexchar I would be happy to help if needed!

mishig25 · 2024-07-10T11:57:54Z

related #1330

flexchar · 2024-07-10T18:24:32Z

Hi Arthur, unfortunately I will not be able to. It was for my personal "chatgpt" local alternative. and I have since discovered the Open Web-UI, which I am running locally in the docker and it provides me with much more.

Worth a note, I've been prototyping with vercel/ai projects and I think hugging face could totally consider using their providers. It is a very beautiful abstraction layer. Alternatively, using native google library is just as reasonable.

Hope that's alright! Maybe will also allow a sooner merge thus not leaving PR very stale. ✌️

* feat: multimodal anthropic support * docs: add claude haiku and multimodal support * feat: uploaded file detection and image conversion * fix deps with sharp * fix resvg deps? * fix: image conversion, retry with files * feat: generic image processing and size target * docs: multimodal review comments Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> * docs: multimodal review comments Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> * feat: review comment resolution * fix: type error on image params * feat: add multimodal for vertex ai anthropic * style: uploadFile timeout number Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> --------- Co-authored-by: Nathan Sarrazin <sarrazin.nathan@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>

Revert "Generic Multimodal Support (huggingface#1021)" This reverts commit 57f8934.

Saghen added 3 commits April 24, 2024 16:41

feat: multimodal anthropic support

8d64319

docs: add claude haiku and multimodal support

d36fae4

feat: uploaded file detection and image conversion

6c952e9

Saghen force-pushed the feat/multimodal-anthropic branch from e33c2e6 to 6c952e9 Compare April 24, 2024 20:42

Saghen marked this pull request as ready for review April 24, 2024 20:53

nsarrazin self-requested a review April 29, 2024 14:04

nsarrazin added 3 commits May 7, 2024 05:43

Merge branch 'main' into feat/multimodal-anthropic

401baa7

fix deps with sharp

87b80eb

fix resvg deps?

7a2b5fd

nsarrazin reviewed May 7, 2024

View reviewed changes

Saghen added 2 commits May 10, 2024 15:58

fix: image conversion, retry with files

c6e1cfe

feat: generic image processing and size target

c8814f4

Saghen changed the title ~~Multimodal Anthropic Claude 3 Support~~ Generic Multimodal Support May 13, 2024

gary149 requested a review from mishig25 May 14, 2024 09:32