better-trigram

A (better) trigram tokenizer for SQLite3 FTS5

While SQLite3 already has a built-in trigram tokenizer, it does not have any kind of word segmentation support. For example, i am a bird gets tokenized as ['i a', ' am', 'm a', ' a ', ' bi', 'bir', 'ird'] which isn't too bad but in real-world use cases users usually type queries as SELECT * FROM fts_table WHERE title MATCH 'a bird' which returns no results.

This tokenizer fixes this by treating spaces as a word boundary. The result is that i am a bird gets tokenized as ['i', 'am', 'a', 'bir', 'ird'] and SELECT * FROM fts_table WHERE title MATCH 'a bird' correctly returns the expected results. You get all the benefits for substring matching just with a wider range of queries.

Furthermore, the built-in trigram tokenizer treats CJK as normal characters and creates trigrams out of them. The problem is, in CJK a single Unicode character can be a whole word. better-trigram fixes this by treating each CJK character as its own token. For example: 李红：那是钢笔 gets tokenized as ['李','红','：','那','是','钢','笔'] and if there are any non-CJK words mixed in the input, they also get properly tokenized automatically.

Compatibility with `trigram`

better-trigram is 99% compatible with trigram. This means it has full UTF-8 support, handles all the same edge cases etc. To ensure better-trigram remains compatible, it passes all the trigram tokenizer tests. Yay!

With that being said, better-trigram doesn't support LIKE & GLOB patterns. This is a limitation in FTS5 because it doesn't allow custom tokenizers to opt-in to this behavior. (You could technically compile a custom version of FTS5 that enables support for this but I haven't looked into it.) Using LIKE & GLOB will fallback to full table scans (not recommended).

Usage

You can use this tokenizer like any other custom tokenizer:

make

Load the better-trigram.so file as a loadable SQLite extension (e.g. .load better-trigram.so).

Then specify it when creating your FTS5 virtual table:

CREATE VIRTUAL TABLE t1 USING fts5(y, tokenize='better_trigram')

Options

better-trigram supports exactly the same options as trigram and copies the exact behavior of the original tokenizer when specifying invalid options. This means setting case_sensitive 1 and remove_diacritics 1 will throw an error.

Refer to FTS5 documentation for the Trigram tokenizer for more details on what each options does and how you can use it.

Performance

better-trigram is significantly faster than trigram. Here are the benchmarks:

Text length: 289091
clk: ~4.09 GHz
cpu: AMD Ryzen 5 PRO 5650U with Radeon Graphics
runtime: bun 1.1.31 (x64-linux)

benchmark              avg (min … max) p75   p99    (min … top 1%)
-------------------------------------- -------------------------------
better-trigram            6.37 ms/iter   6.86 ms   7.31 ms ▂▅▃▂▂▃▄▇█▄▂
trigram                  10.21 ms/iter  11.23 ms  12.75 ms █▄▄▄▃▅▅▇▄▂▂

summary
  better-trigram
   1.6x faster than trigram

To reproduce on your own machine, run bun bench.ts.

Contributing

All kinds of PRs are welcome, of course. Just make sure all the tests pass. You can run the tests like this:

make test

License

2024-10-21

The author disclaims copyright to this source code.  In place of
a legal notice, here is a blessing:

    May you do good and not evil.
    May you find forgiveness for yourself and forgive others.
    May you share freely, never taking more than you give.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

better-trigram

Compatibility with `trigram`

Usage

Options

Performance

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

better-trigram

Compatibility with trigram

Usage

Options

Performance

Contributing

License

Compatibility with `trigram`