🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

Table of contents:

🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.
🐛 Bug fixes
📝 Documentation improvements
🛠️ Other improvements
🤗 New contributors

🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.

Janus for any-to-any generation (e.g., image-to-text and text-to-image)

First of all, this release adds support for Janus, a novel autoregressive framework that unifies multimodal understanding and generation. The most popular model, deepseek-ai/Janus-1.3B, is tagged as an "any-to-any" model, and has specifically been trained for the following tasks:

Example: Image-Text-to-Text

import { AutoProcessor, MultiModalityCausalLM } from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
  {
    role: "User",
    content: "<image_placeholder>\nConvert the formula into latex code.",
    images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
  },
];
const inputs = await processor(conversation);

// Generate response
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 150,
  do_sample: false,
});

// Decode output
const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null]);
const decoded = processor.batch_decode(new_tokens, { skip_special_tokens: true });
console.log(decoded[0]);

Sample output:

Sure, here is the LaTeX code for the given formula:

```
x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}
```

This code represents the mathematical expression for the variable \( x \).

Example: Text-to-Image

import { AutoProcessor, MultiModalityCausalLM } from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
  {
    role: "User",
    content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
  },
];
const inputs = await processor(conversation, { chat_template: "text_to_image" });

// Generate response
const num_image_tokens = processor.num_image_tokens;
const outputs = await model.generate_images({
  ...inputs,
  min_new_tokens: num_image_tokens,
  max_new_tokens: num_image_tokens,
  do_sample: true,
});

// Save the generated image
await outputs[0].save("test.png");

Sample outputs:

Qwen2-VL for Image-Text-to-Text

Example: Image-Text-to-Text

Next, we added support for Qwen2-VL, the multimodal large language model series developed by Qwen team, Alibaba Cloud. It introduces the Naive Dynamic Resolution mechanism, allowing the model to process images of varying resolutions and leading to more efficient and accurate visual representations.

import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage } from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id);

// Prepare inputs
const url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg";
const image = await (await RawImage.read(url)).resize(448, 448);
const conversation = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "text", text: "Describe this image." },
    ],
  },
];
const text = processor.apply_chat_template(conversation, { add_generation_prompt: true });
const inputs = await processor(text, image);

// Perform inference
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 128,
});

// Decode output
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);
// The image depicts a serene beach scene with a woman and a dog. The woman is sitting on the sand, wearing a plaid shirt, and appears to be engaged in a playful interaction with the dog. The dog, which is a large breed, is sitting on its hind legs and appears to be reaching out to the woman, possibly to give her a high-five or a paw. The background shows the ocean with gentle waves, and the sky is clear, suggesting it might be either sunrise or sunset. The overall atmosphere is calm and relaxed, capturing a moment of connection between the woman and the dog.

JinaCLIP for multimodal embeddings

JinaCLIP is a series of general-purpose multilingual multimodal embedding models for text & images, created by Jina AI.

Example: Compute text and/or image embeddings with jinaai/jina-clip-v2:

import { AutoModel, AutoProcessor, RawImage, matmul } from "@huggingface/transformers";

// Load processor and model
const model_id = "jinaai/jina-clip-v2";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModel.from_pretrained(model_id, { dtype: "q4" /* e.g., "fp16", "q8", or "q4" */ });

// Prepare inputs
const urls = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"];
const images = await Promise.all(urls.map(url => RawImage.read(url)));
const sentences = [
    "غروب جميل على الشاطئ", // Arabic
    "海滩上美丽的日落", // Chinese
    "Un beau coucher de soleil sur la plage", // French
    "Ein wunderschöner Sonnenuntergang am Strand", // German
    "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", // Greek
    "समुद्र तट पर एक खूबसूरत सूर्यास्त", // Hindi
    "Un bellissimo tramonto sulla spiaggia", // Italian
    "浜辺に沈む美しい夕日", // Japanese
    "해변 위로 아름다운 일몰", // Korean
];

// Encode text and images
const inputs = await processor(sentences, images, { padding: true, truncation: true });
const { l2norm_text_embeddings, l2norm_image_embeddings } = await model(inputs);

// Encode query (text-only)
const query_prefix = "Represent the query for retrieving evidence documents: ";
const query_inputs = await processor(query_prefix + "beautiful sunset over the beach");
const { l2norm_text_embeddings: query_embeddings } = await model(query_inputs);

// Compute text-image similarity scores
const text_to_image_scores = await matmul(query_embeddings, l2norm_image_embeddings.transpose(1, 0));
console.log("text-image similarity scores", text_to_image_scores.tolist()[0]); // [0.29530206322669983, 0.3183615803718567]

// Compute image-image similarity scores
const image_to_image_score = await matmul(l2norm_image_embeddings[0], l2norm_image_embeddings[1]);
console.log("image-image similarity score", image_to_image_score.item()); // 0.9344457387924194

// Compute text-text similarity scores
const text_to_text_scores = await matmul(query_embeddings, l2norm_text_embeddings.transpose(1, 0));
console.log("text-text similarity scores", text_to_text_scores.tolist()[0]); // [0.5566609501838684, 0.7028406858444214, 0.582255482673645, 0.6648036241531372, 0.5462006330490112, 0.6791588068008423, 0.6192430257797241, 0.6258729100227356, 0.6453716158866882]

LLaVA-OneVision for Image-Text-to-Text

LLaVA-OneVision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone.

Example: Multi-round conversations w/ PKV caching

import { AutoProcessor, AutoTokenizer, LlavaOnevisionForConditionalGeneration, RawImage } from '@huggingface/transformers';

// Load tokenizer, processor and model
const model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf';

const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'fp32' or 'q8'
        vision_encoder: 'fp16', // or 'fp32' or 'q8'
        decoder_model_merged: 'q4', // or 'q8'
    },
    // device: 'webgpu',
});

// Prepare text inputs
const prompt = 'What does the text say?';
const messages = [
    { role: 'system', content: 'Answer the question.' },
    { role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);

// Prepare vision inputs
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Generate response
const { past_key_values, sequences } = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    do_sample: false,
    max_new_tokens: 64,
    return_dict_in_generate: true,
});

// Decode output
const answer = tokenizer.decode(
    sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(answer);
// The text says "small but mighty" in a playful font.

const new_messages = [
    ...messages,
    { role: 'assistant', content: answer },
    { role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);

// Generate another response
const output = await model.generate({
    ...new_text_inputs,
    past_key_values,
    do_sample: false,
    max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
    output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(new_answer);
// The text "small but mighty" is likely a playful or humorous reference to the image of the blue mouse with the orange dumbbell. It could be used as a motivational phrase or a playful way to express the idea that even small things can be impressive or powerful.

ViTPose for pose-estimation

A state-of-the-art pose estimation model which employs a standard, non-hierarchical vision transformer as a backbone for the task of keypoint estimation (combined with a simple decoder head to predict heatmaps from a given image).

Example: Pose estimation w/ onnx-community/vitpose-base-simple.

import { AutoModel, AutoImageProcessor, RawImage } from '@huggingface/transformers';

// Load model and processor
const model_id = 'onnx-community/vitpose-base-simple';
const model = await AutoModel.from_pretrained(model_id);
const processor = await AutoImageProcessor.from_pretrained(model_id);

// Load image and prepare inputs
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/ryan-gosling.jpg';
const image = await RawImage.read(url);
const inputs = await processor(image);

// Predict heatmaps
const { heatmaps } = await model(inputs);

// Post-process heatmaps to get keypoints and scores
const boxes = [[[0, 0, image.width, image.height]]];
const results = processor.post_process_pose_estimation(heatmaps, boxes)[0][0];
console.log(results);

Optionally, visualize the outputs (Node.js usage shown here, using the node-canvas library):

import { createCanvas, createImageData } from 'canvas';

// Create canvas and draw image
const canvas = createCanvas(image.width, image.height);
const ctx = canvas.getContext('2d');
const imageData = createImageData(image.rgba().data, image.width, image.height);
ctx.putImageData(imageData, 0, 0);

// Draw edges between keypoints
const points = results.keypoints;
ctx.lineWidth = 4;
ctx.strokeStyle = 'blue';
for (const [i, j] of model.config.edges) {
    const [x1, y1] = points[i];
    const [x2, y2] = points[j];
    ctx.beginPath();
    ctx.moveTo(x1, y1);
    ctx.lineTo(x2, y2);
    ctx.stroke();
}

// Draw circle at each keypoint
ctx.fillStyle = 'red';
for (const [x, y] of points) {
    ctx.beginPath();
    ctx.arc(x, y, 8, 0, 2 * Math.PI);
    ctx.fill();
}

// Save image to file
import fs from 'fs';
const out = fs.createWriteStream('pose.png');
const stream = canvas.createPNGStream();
stream.pipe(out)
out.on('finish', () =>  console.log('The PNG file was created.'));

Input image	Output image

MGP-STR for Optical Character Recognition (OCR)

A simple yet powerful vision scene text recognition model, built upon the vision transformer (ViT).

Example: Optical Character Recognition (OCR) w/ onnx-community/mgp-str-base

import { MgpstrForSceneTextRecognition, MgpstrProcessor, RawImage } from '@huggingface/transformers';

const model_id = 'onnx-community/mgp-str-base';
const model = await MgpstrForSceneTextRecognition.from_pretrained(model_id);
const processor = await MgpstrProcessor.from_pretrained(model_id);

// Load image from the IIIT-5k dataset
const url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png";
const image = await RawImage.read(url);

// Preprocess the image
const result = await processor(image);

// Perform inference
const outputs = await model(result);

// Decode the model outputs
const generated_text = processor.batch_decode(outputs.logits).generated_text;
console.log(generated_text); // [ 'ticket' ]

PatchTST and PatchTSMixer for time series forecasting.

Example: Time series forecasting w/ onnx-community/granite-timeseries-patchtst

Models which can be used for multivariate time series forecasting.

import { PatchTSTForPrediction, Tensor } from "@huggingface/transformers";

const model_id = "onnx-community/granite-timeseries-patchtst";
const model = await PatchTSTForPrediction.from_pretrained(model_id, { dtype: "fp32" });

const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
    Float32Array.from({ length: prod }, (_, i) => i / prod),
    dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);

Example: Time series forecasting w/ onnx-community/granite-timeseries-patchtsmixer

import { PatchTSMixerForPrediction, Tensor } from "@huggingface/transformers";

const model_id = "onnx-community/granite-timeseries-patchtsmixer";
const model = await PatchTSMixerForPrediction.from_pretrained(model_id, { dtype: "fp32" });

const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
    Float32Array.from({ length: prod }, (_, i) => i / prod),
    dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);

🐛 Bug fixes

When padding an image, the dimensions get stretched by @BritishWerewolf in #1015
fix(scale): add missing scale element by @tosinamuda in #1017

📝 Documentation improvements

Updated link to sentence similarity models. by @uzyn in #893
fix(docs): fixed a broken link to quantization guide by @ThomasWT in #1014
fix(docs): Fixed Typos in README and docs/snippets/6_supported-models.snippet by @hitchhiker3010 in #1030

🛠️ Other improvements

Add option to maintain aspect ratio on resize by @BritishWerewolf in #971
Add functionality to split RawImage into channels; Update slice documentation and tests by @BritishWerewolf in #978
Avoid resizing images when they already have the desired size by @nemphys in #1027
Add support for Split pretokenizer w/ behavior=removed & invert=false by @xenova in #1033
Add type declaration for progress_callback by @ocavue in #1034
Add support for op_block_list by @pdufour in #1036

🤗 New contributors

@uzyn made their first contribution in #893
@ThomasWT made their first contribution in #1014
@tosinamuda made their first contribution in #1017
@nemphys made their first contribution in #1027
@hitchhiker3010 made their first contribution in #1030
@pdufour made their first contribution in #1036

Full Changelog: 3.0.2...3.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.1.0