Skip to content

Releases: huggingface/transformers.js

2.15.0

06 Feb 14:03
Compare
Choose a tag to compare

What's new?

🤖 Qwen1.5 Chat models (0.5B and 1.8B)

Yesterday, the Qwen team (Alibaba Group) released the Qwen1.5 series of chat models. As part of the release, they published several sub-2B-parameter models, including Qwen/Qwen1.5-0.5B-Chat and Qwen/Qwen1.5-1.8B-Chat, which both demonstrate strong performance despite their small sizes. The best part? They can run in the browser with Transformers.js (PR)! 🚀 See here for the full list of supported models.

demo-2x

Example: Text generation with Xenova/Qwen1.5-0.5B-Chat.

import { pipeline } from '@xenova/transformers';

// Create text-generation pipeline
const generator = await pipeline('text-generation', 'Xenova/Qwen1.5-0.5B-Chat');

// Define the prompt and list of messages
const prompt = "Give me a short introduction to large language model."
const messages = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": prompt }
]

// Apply chat template
const text = generator.tokenizer.apply_chat_template(messages, {
    tokenize: false,
    add_generation_prompt: true,
});

// Generate text
const output = await generator(text, {
    max_new_tokens: 128,
    do_sample: false,
});
console.log(output[0].generated_text);
// 'A large language model is a type of artificial intelligence system that can generate text based on the input provided by users, such as books, articles, or websites. It uses advanced algorithms and techniques to learn from vast amounts of data and improve its performance over time through machine learning and natural language processing (NLP). Large language models have become increasingly popular in recent years due to their ability to handle complex tasks such as generating human-like text quickly and accurately. They have also been used in various fields such as customer service chatbots, virtual assistants, and search engines for information retrieval purposes.'

🧍 MODNet for Portrait Image Matting

Next, we added support for MODNet, a small (but powerful) portrait image matting model (PR). Thanks to @cyio for the suggestion!

animation

Example: Perform portrait image matting with Xenova/modnet.

import { AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';

// Load model and processor
const model = await AutoModel.from_pretrained('Xenova/modnet', { quantized: false });
const processor = await AutoProcessor.from_pretrained('Xenova/modnet');

// Load image from URL
const url = 'https://images.pexels.com/photos/5965592/pexels-photo-5965592.jpeg?auto=compress&cs=tinysrgb&w=1024';
const image = await RawImage.fromURL(url);

// Pre-process image
const { pixel_values } = await processor(image);

// Predict alpha matte
const { output } = await model({ input: pixel_values });

// Save output mask
const mask = await RawImage.fromTensor(output[0].mul(255).to('uint8')).resize(image.width, image.height);
mask.save('mask.png');
Input image Output mask
image/png image/png

🧠 New text embedding models

We also added support for several new text embedding models, including:

Check out the links for example usage.

🛠️ Other improvements

  • Fix example links in documentation (#550).
  • Improve unknown model warnings (#554).
  • Update jsdoc-to-markdown dev dependency (#574).

Full Changelog: 2.14.2...2.15.0

2.14.2

29 Jan 12:54
Compare
Choose a tag to compare

What's new?

Full Changelog: 2.14.1...2.14.2

2.14.1

25 Jan 14:16
Compare
Choose a tag to compare

What's new?

  • Add support for Depth Anything (#534). See here for the list of available models.

    Example: Depth estimation with Xenova/depth-anything-small-hf.

    import { pipeline } from '@xenova/transformers';
    
    // Create depth-estimation pipeline
    const depth_estimator = await pipeline('depth-estimation', 'Xenova/depth-anything-small-hf');
    
    // Predict depth map for the given image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/bread_small.png';
    const output = await depth_estimator(url);
    // {
    //   predicted_depth: Tensor {
    //     dims: [350, 518],
    //     type: 'float32',
    //     data: Float32Array(181300) [...],
    //     size: 181300
    //   },
    //   depth: RawImage {
    //     data: Uint8Array(271360) [...],
    //     width: 640,
    //     height: 424,
    //     channels: 1
    //   }
    // }

    You can visualize the output with:

    output.depth.save('depth.png');
    Input image Visualized output
    image image

    Online demo: https://huggingface.co/spaces/Xenova/depth-anything-web

    Example video:

    depth-anything-demo-final.mp4
  • Fix typo in tokenizers.js (#518)

  • Return empty tokens array if text is empty after normalization (#535)

Full Changelog: 2.14.0...2.14.1

2.14.0

10 Jan 17:50
Compare
Choose a tag to compare

What's new?

🚀 Segment Anything Model (SAM)

The Segment Anything Model (SAM) can be used to generate segmentation masks for objects in a scene, given an input image and input points. See here for the full list of pre-converted models. Support for this model was added in #510.

demo

Demo + source code: https://huggingface.co/spaces/Xenova/segment-anything-web

Example: Perform mask generation w/ Xenova/slimsam-77-uniform.

import { SamModel, AutoProcessor, RawImage } from '@xenova/transformers';

const model = await SamModel.from_pretrained('Xenova/slimsam-77-uniform');
const processor = await AutoProcessor.from_pretrained('Xenova/slimsam-77-uniform');

const img_url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/corgi.jpg';
const raw_image = await RawImage.read(img_url);
const input_points = [[[340, 250]]] // 2D localization of a window

const inputs = await processor(raw_image, input_points);
const outputs = await model(inputs);

const masks = await processor.post_process_masks(outputs.pred_masks, inputs.original_sizes, inputs.reshaped_input_sizes);
console.log(masks);
// [
//   Tensor {
//     dims: [ 1, 3, 410, 614 ],
//     type: 'bool',
//     data: Uint8Array(755220) [ ... ],
//     size: 755220
//   }
// ]
const scores = outputs.iou_scores;
console.log(scores);
// Tensor {
//   dims: [ 1, 1, 3 ],
//   type: 'float32',
//   data: Float32Array(3) [
//     0.8350210189819336,
//     0.9786665439605713,
//     0.8379436731338501
//   ],
//   size: 3
// }

You can then visualize the 3 predicted masks with:

const image = RawImage.fromTensor(masks[0][0].mul(255));
image.save('mask.png');
Input image Visualized output
corgi mask

Next, select the channel with the highest IoU score, which in this case is the second (green) channel. Intersecting this with the original image gives us an isolated version of the subject:

Selected Mask Intersected
mask corgi-masked

🛠️ Improvements

  • Add support for processing non-square images w/ ConvNextFeatureExtractor in #503
  • Encode revision in remote URL by #507

Full Changelog: 2.13.4...2.14.0

2.13.4

04 Jan 17:31
Compare
Choose a tag to compare

What's new?

  • Add support for cross-encoder models (+fix token type ids) (#501)

    Example: Information Retrieval w/ Xenova/ms-marco-TinyBERT-L-2-v2.

    import { AutoTokenizer, AutoModelForSequenceClassification } from '@xenova/transformers';
    
    const model = await AutoModelForSequenceClassification.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');
    const tokenizer = await AutoTokenizer.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');
    
    const features = tokenizer(
        ['How many people live in Berlin?', 'How many people live in Berlin?'],
        {
            text_pair: [
                'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.',
                'New York City is famous for the Metropolitan Museum of Art.',
            ],
            padding: true,
            truncation: true,
        }
    )
    
    const { logits } = await model(features)
    console.log(logits.data);
    // quantized:   [ 7.210887908935547, -11.559350967407227 ]
    // unquantized: [ 7.235750675201416, -11.562294006347656 ]

    Check out the list of pre-converted models here. We also put out a demo for you to try out.

Full Changelog: 2.13.3...2.13.4

2.13.3

04 Jan 00:41
Compare
Choose a tag to compare

What's new?

  • Fix typo in JSDoc in #498
  • Fix properties on pipelines in #500. Thanks to @wesbos for reporting the issue!

Full Changelog: 2.13.2...2.13.3

2.13.2

03 Jan 14:57
Compare
Choose a tag to compare

What's new?

This release is a follow-up to #485, with additional intellisense-focused improvements (see PR).

typing-demo-new

Full Changelog: 2.13.1...2.13.2

2.13.1

03 Jan 11:24
Compare
Choose a tag to compare

What's new?

  • Improve typing of pipeline function in #485. Thanks to @wesbos for the suggestion!

    typing-demo

    This also means when you hover over the class name, you'll get example code to help you out.
    typing-demo2

  • Add phi-1_5 model in #493.

    See example code
    import { pipeline } from '@xenova/transformers';
    
    // Create a text-generation pipeline
    const generator = await pipeline('text-generation', 'Xenova/phi-1_5_dev');
    
    // Construct prompt
    const prompt = `\`\`\`py
    import math
    def print_prime(n):
        """
        Print all primes between 1 and n
        """`;
    
    // Generate text
    const result = await generator(prompt, {
      max_new_tokens: 100,
    });
    console.log(result[0].generated_text);

    Results in:

    import math
    def print_prime(n):
        """
        Print all primes between 1 and n
        """
        primes = []
        for num in range(2, n+1):
            is_prime = True
            for i in range(2, int(math.sqrt(num))+1):
                if num % i == 0:
                    is_prime = False
                    break
            if is_prime:
                primes.append(num)
        print(primes)
    
    print_prime(20)

    Running the code produces the correct result:

    [2, 3, 5, 7, 11, 13, 17, 19]
    

Full Changelog: 2.13.0...2.13.1

2.13.0

27 Dec 15:00
Compare
Choose a tag to compare

What's new?

🎄 7 new architectures!

This release adds support for many new multimodal architectures, bringing the total number of supported architectures to 80! 🤯

1. VITS for multilingual text-to-speech across over 1000 languages! (#466)

import { pipeline } from '@xenova/transformers';

// Create English text-to-speech pipeline
const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-eng');

// Generate speech
const output = await synthesizer('I love transformers');
// {
//   audio: Float32Array(26112) [...],
//   sampling_rate: 16000
// }
mms-tts-eng.mp4

See here for the list of available models. To start, we've converted 12 of the ~1140 models on the Hugging Face Hub. If we haven't added the one you wish to use, you can make it web-ready using our conversion script.

2. CLIPSeg for zero-shot image segmentation. (#478)

import { AutoTokenizer, AutoProcessor, CLIPSegForImageSegmentation, RawImage } from '@xenova/transformers';

// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clipseg-rd64-refined');
const processor = await AutoProcessor.from_pretrained('Xenova/clipseg-rd64-refined');
const model = await CLIPSegForImageSegmentation.from_pretrained('Xenova/clipseg-rd64-refined');

// Run tokenization
const texts = ['a glass', 'something to fill', 'wood', 'a jar'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Read image and run processor
const image = await RawImage.read('https://github.com/timojl/clipseg/blob/master/example_image.jpg?raw=true');
const image_inputs = await processor(image);

// Run model with both text and pixel inputs
const { logits } = await model({ ...text_inputs, ...image_inputs });
// logits: Tensor {
//   dims: [4, 352, 352],
//   type: 'float32',
//   data: Float32Array(495616)[ ... ],
//   size: 495616
// }

You can visualize the predictions as follows:

const preds = logits
  .unsqueeze_(1)
  .sigmoid_()
  .mul_(255)
  .round_()
  .to('uint8');

for (let i = 0; i < preds.dims[0]; ++i) {
  const img = RawImage.fromTensor(preds[i]);
  img.save(`prediction_${i}.png`);
}
Original "a glass" "something to fill" "wood" "a jar"
image prediction_0 prediction_1 prediction_2 prediction_3

See here for the list of available models.

3. SegFormer for semantic segmentation and image classification. (#480)

import { pipeline } from '@xenova/transformers';

// Create an image segmentation pipeline
const segmenter = await pipeline('image-segmentation', 'Xenova/segformer_b2_clothes');

// Segment an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/young-man-standing-and-leaning-on-car.jpg';
const output = await segmenter(url);

image

See output
[
  {
    score: null,
    label: 'Background',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Hair',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Upper-clothes',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Pants',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-shoe',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-shoe',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Face',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-leg',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-leg',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-arm',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-arm',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  }
]

See here for the list of available models.

4. Table Transformer for table extraction from unstructured documents. (#477)

import { pipeline } from '@xenova/transformers';

// Create an object detection pipeline
const detector = await pipeline('object-detection', 'Xenova/table-transformer-detection', { quantized: false });

// Detect tables in an image
const img = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/invoice-with-table.png';
const output = await detector(img);
// [{ score: 0.9967531561851501, label: 'table', box: { xmin: 52, ymin: 322, xmax: 546, ymax: 525 } }]
Show example output

image

See here for the list of available models.

5. DiT for document image classification. (#474)

import { pipeline } from '@xenova/transformers';

// Create an image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dit-base-finetuned-rvlcdip');

// Classify an image 
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/coca_cola_advertisement.png';
const output = await classifier(url);
// [{ label: 'advertisement', score: 0.9035086035728455 }]

See here for the list of available models.

6. SigLIP for zero-shot image classification. (#473)

import { pipeline } from '@xenova/transformers';

// Create a zero-shot image classification pipeline
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/siglip-base-patch16-224');

// Classify images according to provided labels
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url, ['2 cats', '2 dogs'], {
    hypothesis_template: 'a photo of {}',
});
// [
//   { score: 0.16770583391189575, label: '2 cats' },
//   { score: 0.000022096000975579955, label: '2 dogs' }
// ]

See here for the list of available models.

7. RoFormer for masked language modelling, sequence classification, token classification, and question answering. (#464)

import { pipeline } from '@xenova/transformers';

// Create a masked language modelling pipeline
const pipe = await pipeline('fill-mask', 'Xenova/antiberta2');

// Predict missing token
const output = await pipe('Ḣ Q V Q ... C A [MASK] D ... T V S S');
See output
[
  {
    score: 0.48774364590644836,
    token: 19,
    token_str: 'R',
    sequence: 'Ḣ Q V Q C A R D T V S S'
  },
  {
    score: 0.2768442928791046,
    token: 18,
    token_str: 'Q...
Read more

2.12.1

18 Dec 21:30
Compare
Choose a tag to compare

What's new?

Patch for release 2.12.1, making @huggingface/jinja a dependency instead of a peer dependency. This also means apply_chat_template is now synchronous (and does not lazily load the module). In future, we may want to add this functionality, but for now, it causes issues with lazy loading from a CDN.

code

code
import { AutoTokenizer } from "@xenova/transformers";

// Load tokenizer from the Hugging Face Hub
const tokenizer = await AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1");

// Define chat messages
const chat = [
  { role: "user", content: "Hello, how are you?" },
  { role: "assistant", content: "I'm doing great. How can I help you today?" },
  { role: "user", content: "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, ...]

Full Changelog: 2.12.0...2.12.1