Tokenizers

Go bindings for the HuggingFace Tokenizers library.

Installation

make build to build libtokenizers.a that you need to run your application that uses bindings. In addition, you need to inform the linker where to find that static library: go run -ldflags="-extldflags '-L./path/to/libtokenizers/directory'" . or just add it to the CGO_LDFLAGS environment variable: CGO_LDFLAGS="-L./path/to/libtokenizers/directory" to avoid specifying it every time.

Using pre-built binaries

If you don't want to install Rust toolchain, build it in docker: docker build --platform=linux/amd64 -f release/Dockerfile . or use prebuilt binaries from the releases page. Prebuilt libraries are available for:

Getting started

TLDR: working example.

Load a tokenizer from a JSON config:

import "github.com/daulet/tokenizers"

tk, err := tokenizers.FromFile("./data/bert-base-uncased.json")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Load a tokenizer from Huggingface:

import "github.com/daulet/tokenizers"

tk, err := tokenizers.FromPretrained("google-bert/bert-base-uncased")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Encode text and decode tokens:

fmt.Println("Vocab size:", tk.VocabSize())
// Vocab size: 30522
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
// [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
// [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
// brown fox jumps over the lazy dog

Encode text with options:

var encodeOptions []tokenizers.EncodeOption
encodeOptions = append(encodeOptions, tokenizers.WithReturnTypeIDs())
encodeOptions = append(encodeOptions, tokenizers.WithReturnAttentionMask())
encodeOptions = append(encodeOptions, tokenizers.WithReturnTokens())
encodeOptions = append(encodeOptions, tokenizers.WithReturnOffsets())
encodeOptions = append(encodeOptions, tokenizers.WithReturnSpecialTokensMask())

// Or just basically
// encodeOptions = append(encodeOptions, tokenizers.WithReturnAllAttributes())

encodingResponse := tk.EncodeWithOptions("brown fox jumps over the lazy dog", false, encodeOptions...)
fmt.Println(encodingResponse.IDs)
// [2829 4419 14523 2058 1996 13971 3899]
fmt.Println(encodingResponse.TypeIDs)
// [0 0 0 0 0 0 0]
fmt.Println(encodingResponse.SpecialTokensMask)
// [0 0 0 0 0 0 0]
fmt.Println(encodingResponse.AttentionMask)
// [1 1 1 1 1 1 1]
fmt.Println(encodingResponse.Tokens)
// [brown fox jumps over the lazy dog]
fmt.Println(encodingResponse.Offsets)
// [[0 5] [6 9] [10 15] [16 20] [21 24] [25 29] [30 33]]

Benchmarks

go test . -run=^\$ -bench=. -benchmem -count=10 > test/benchmark/$(git rev-parse HEAD).txt

Decoding overhead (due to CGO and extra allocations) is between 2% to 9% depending on the benchmark.

go test . -bench=. -benchmem -benchtime=10s

goos: darwin
goarch: arm64
pkg: github.com/daulet/tokenizers
BenchmarkEncodeNTimes-10     	  959494	     12622 ns/op	     232 B/op	      12 allocs/op
BenchmarkEncodeNChars-10      1000000000	     2.046 ns/op	       0 B/op	       0 allocs/op
BenchmarkDecodeNTimes-10     	 2758072	      4345 ns/op	      96 B/op	       3 allocs/op
BenchmarkDecodeNTokens-10    	18689725	     648.5 ns/op	       7 B/op	       0 allocs/op
PASS
ok   github.com/daulet/tokenizers 126.681s

Run equivalent Rust tests with cargo bench.

decode_n_times          time:   [3.9812 µs 3.9874 µs 3.9939 µs]
                        change: [-0.4103% -0.1338% +0.1275%] (p = 0.33 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

decode_n_tokens         time:   [651.72 ns 661.73 ns 675.78 ns]
                        change: [+0.3504% +2.0016% +3.5507%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Contributing

Please refer to CONTRIBUTING.md for information on how to contribute a PR to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
benches		benches
example		example
release		release
src		src
test		test
.gitignore		.gitignore
BUILD.bazel		BUILD.bazel
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.Bazel.lock		Cargo.Bazel.lock
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MODULE.bazel		MODULE.bazel
MODULE.bazel.lock		MODULE.bazel.lock
Makefile		Makefile
README.md		README.md
WORKSPACE		WORKSPACE
benchmark.py		benchmark.py
deps.bzl		deps.bzl
go.mod		go.mod
go.sum		go.sum
tokenizer.go		tokenizer.go
tokenizer_test.go		tokenizer_test.go
tokenizers.h		tokenizers.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizers

Installation

Using pre-built binaries

Getting started

Benchmarks

Contributing

About

Releases 22

Packages

Contributors 7

Languages

License

daulet/tokenizers

Folders and files

Latest commit

History

Repository files navigation

Tokenizers

Installation

Using pre-built binaries

Getting started

Benchmarks

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 22

Packages 0

Contributors 7

Languages

Packages