Tutorial: deploy Phi-3 Vision with ncnn #5688

LeiWu7999 · 2024-09-09T09:15:55Z

LeiWu7999
Sep 9, 2024

Phi-3 Vision Model Overview

Phi-3 Vision is a lightweight, state-of-the-art open multimodal model that excels in handling both text and visual data. It is part of the Phi-3 model family and supports a context length of up to 128K tokens. The model has undergone rigorous enhancement through supervised fine-tuning and direct preference optimization, ensuring precise instruction adherence and robust safety measures.

Model Architecture and Specifications

Model Name: Phi-3-Vision-128K-Instruct
Parameters: 4.2 billion
Components: Includes an image encoder, connector, projector, and Phi-3 Mini language model
Inputs: Text and image, optimized for chat format prompts
Context Length: 128K tokens
Status: Static model with training cutoff on March 15, 2024; potential future updates

Key Applications

Phi-3 Vision is designed for broad commercial and research use, particularly in English. It is well-suited for applications that require:

Memory and compute efficiency
Low latency
General image understanding
Optical Character Recognition (OCR)
Chart and table interpretation

These capabilities make Phi-3 Vision an excellent choice for building general-purpose AI systems that integrate both visual and text inputs, especially in environments where resources are limited or latency is critical.

For further technical details and benchmarks, refer to the Phi-3 Blog.

Implementation Approach (Temporary Content, To Be Removed After Completion)

Phi-3 Vision consists of three main modules: CLIP ViT-L/14 for encoding visual inputs, text embedding for encoding text into word embeddings, and Phi-3-mini as the base language model. Microsoft provides ONNX files for these three models. I will attempt to convert these files into formats that can be processed by NCNN using PNNX, which may require implementing specific operators. For instance, Phi-3-mini is based on the LLaMA-2 architecture and uses RoPE for positional encoding, so I might need to add RoPE operators during the attention computation phase.

After a successful conversion, I will also need to implement external pipelines, including image preprocessing, tokenizers, LLM sampling strategies, and autoregressive generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: deploy Phi-3 Vision with ncnn #5688

{{title}}

Replies: 0 comments

Select a reply

Tutorial: deploy Phi-3 Vision with ncnn #5688

LeiWu7999 Sep 9, 2024

Phi-3 Vision Model Overview

Model Architecture and Specifications

Key Applications

Implementation Approach (Temporary Content, To Be Removed After Completion)

Replies: 0 comments

LeiWu7999
Sep 9, 2024