You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Phi-3 Vision is a lightweight, state-of-the-art open multimodal model that excels in handling both text and visual data. It is part of the Phi-3 model family and supports a context length of up to 128K tokens. The model has undergone rigorous enhancement through supervised fine-tuning and direct preference optimization, ensuring precise instruction adherence and robust safety measures.
Model Architecture and Specifications
Model Name: Phi-3-Vision-128K-Instruct
Parameters: 4.2 billion
Components: Includes an image encoder, connector, projector, and Phi-3 Mini language model
Inputs: Text and image, optimized for chat format prompts
Context Length: 128K tokens
Status: Static model with training cutoff on March 15, 2024; potential future updates
Key Applications
Phi-3 Vision is designed for broad commercial and research use, particularly in English. It is well-suited for applications that require:
Memory and compute efficiency
Low latency
General image understanding
Optical Character Recognition (OCR)
Chart and table interpretation
These capabilities make Phi-3 Vision an excellent choice for building general-purpose AI systems that integrate both visual and text inputs, especially in environments where resources are limited or latency is critical.
For further technical details and benchmarks, refer to the Phi-3 Blog.
Implementation Approach (Temporary Content, To Be Removed After Completion)
Phi-3 Vision consists of three main modules: CLIP ViT-L/14 for encoding visual inputs, text embedding for encoding text into word embeddings, and Phi-3-mini as the base language model. Microsoft provides ONNX files for these three models. I will attempt to convert these files into formats that can be processed by NCNN using PNNX, which may require implementing specific operators. For instance, Phi-3-mini is based on the LLaMA-2 architecture and uses RoPE for positional encoding, so I might need to add RoPE operators during the attention computation phase.
After a successful conversion, I will also need to implement external pipelines, including image preprocessing, tokenizers, LLM sampling strategies, and autoregressive generation.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Phi-3 Vision Model Overview
Phi-3 Vision is a lightweight, state-of-the-art open multimodal model that excels in handling both text and visual data. It is part of the Phi-3 model family and supports a context length of up to 128K tokens. The model has undergone rigorous enhancement through supervised fine-tuning and direct preference optimization, ensuring precise instruction adherence and robust safety measures.
Model Architecture and Specifications
Key Applications
Phi-3 Vision is designed for broad commercial and research use, particularly in English. It is well-suited for applications that require:
These capabilities make Phi-3 Vision an excellent choice for building general-purpose AI systems that integrate both visual and text inputs, especially in environments where resources are limited or latency is critical.
For further technical details and benchmarks, refer to the Phi-3 Blog.
Implementation Approach (Temporary Content, To Be Removed After Completion)
Phi-3 Vision consists of three main modules: CLIP ViT-L/14 for encoding visual inputs, text embedding for encoding text into word embeddings, and Phi-3-mini as the base language model. Microsoft provides ONNX files for these three models. I will attempt to convert these files into formats that can be processed by NCNN using PNNX, which may require implementing specific operators. For instance, Phi-3-mini is based on the LLaMA-2 architecture and uses RoPE for positional encoding, so I might need to add RoPE operators during the attention computation phase.
After a successful conversion, I will also need to implement external pipelines, including image preprocessing, tokenizers, LLM sampling strategies, and autoregressive generation.
Beta Was this translation helpful? Give feedback.
All reactions