-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic Multimodal Support #1021
Generic Multimodal Support #1021
Conversation
This is amazing. When this is merged, please ping me. I would like to adapt it for OpenAI + Gemini 1.5 Pro. ✌️ |
e33c2e6
to
6c952e9
Compare
This is amazing! Would be nice to extend this to openai api as well if possible. |
Yes amazing! It would be so great to have also OpenAI-like API compatibility, so many Open sources multimodal models are available like Idefics2, Llava, llama-3-vision, ... :) |
Hey @Saghen, PR looking great from my local testing! We changed a few things last week since we switched our docker image to a new build process. That probably introduced some conflicts but I don't mind fixing them for you since I created them 😅 If you're ok giving me write-access on the PR then I can just do the merge commit directly. |
@nsarrazin that'd be great, thanks! granted you permission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty good! Left some comments, let me know what you think.
|
||
const module = await import("browser-image-resizer"); | ||
// currently, only IDEFICS is supported by TGI | ||
// the size of images is hardcoded to 224x224 in TGI | ||
// this will need to be configurable when support for more models is added | ||
const resizedImages = await Promise.all( | ||
files.map(async (file) => { | ||
return await module | ||
.readAndCompressImage(file, { | ||
maxHeight: 224, | ||
maxWidth: 224, | ||
quality: 1, | ||
}) | ||
.then(async (el) => await file2base64(el as File)); | ||
}) | ||
const base64Files = await Promise.all( | ||
(files ?? []).map((file) => | ||
file2base64(file).then((value) => ({ type: "base64" as const, value, mime: file.type })) | ||
) | ||
); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to be able to do the resizing in browser but it should be configurable indeed. Would save quite some bandwidth I think at huggingchat scale
Maybe model.multimodal
could be true | { maxSize?: number, preferredMimeType?: string }
? That would give us a place to store multimodal specific settings, and the client has access to the model
so we could do the resizing there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a more generic image processor in c8814f4 and put the multimodal options at endspoints[*].multimodal
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WRT to bandwidth, it might make sense to do some image processing on the server on upload to convert to a more suitable format (i.e. put everything in AVIF/WEBP?) but I found the existing upload limit to get in the way
And thanks for exposing the mime type in files 🔥 that's gonna be super handy down the road as we support more modalities |
@Ichigo3766 @Extremys @flexchar heads up that it was trivial so I added support for OpenAI in this PR as well |
@Saghen I will review it soon. Could you merge/rebase with the main so that the merge conflicts are gone ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some nits. I think we are close to merge 🚀
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides the last two nits I've left, LGTM 🚀
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! testing it one more time before merge
This reverts commit 57f8934.
@flexchar are you still planning on adding support for Gemini pro ? |
@flexchar I would be happy to help if needed! |
related #1330 |
Hi Arthur, unfortunately I will not be able to. It was for my personal "chatgpt" local alternative. and I have since discovered the Open Web-UI, which I am running locally in the docker and it provides me with much more. Worth a note, I've been prototyping with Hope that's alright! Maybe will also allow a sooner merge thus not leaving PR very stale. ✌️ |
* feat: multimodal anthropic support * docs: add claude haiku and multimodal support * feat: uploaded file detection and image conversion * fix deps with sharp * fix resvg deps? * fix: image conversion, retry with files * feat: generic image processing and size target * docs: multimodal review comments Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> * docs: multimodal review comments Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> * feat: review comment resolution * fix: type error on image params * feat: add multimodal for vertex ai anthropic * style: uploadFile timeout number Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> --------- Co-authored-by: Nathan Sarrazin <sarrazin.nathan@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Revert "Generic Multimodal Support (huggingface#1021)" This reverts commit 57f8934.
Adds support for multimodal with Anthropic by increasing the maximum file size, adjusting the
message.files
type to support mime and removing the assumptions around TGI.string[]
to{ type: 'hash' | 'base64', value: string, mime: string }
![](base64)
prompting to TGI endpoint codeI'd like to move the file upload logic out of the UI code and begin uploading immediately upon selecting a file, but that's outside the scope of this PR. However, that should allow for processing files earlier, which could be particularly useful for non-images (i.e. making embeddings for PDFs).