SPDX-FileCopyrightText	SPDX-License-Identifier
2024 PyThaiNLP Project	Apache-2.0

nlpO3

Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.

To use as a library in a Rust project:

cargo add nlpo3

To use as a library in a Python project:

pip install nlpo3

Features

Thai word tokenizer
- Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
  - 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
- Load a dictionary from a plain text file (one word per line) or from Vec<String>

Use

Node.js binding

See nlpo3-nodejs.

Python binding

Example:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")

See more at nlpo3-python.

Rust library

Add to dependency

To use as a library in a Rust project:

cargo add nlpo3

It will add "nlpo3" to Cargo.toml:

[dependencies]
# ...
nlpo3 = "1.4.0"

Example

Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();

Create a tokenizer using a dictionary from a vector of Strings:

let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["มิวเซียม"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

Command-line interface

Example:

echo "ฉันกินข้าว" | nlpo3 segment

See more at nlpo3-cli.

Dictionary

For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
A dictionary is needed for the dictionary-based word tokenizer.
For tokenization dictionary, try
- words_th.tx from PyThaiNLP
  - ~62,000 words
  - CC0-1.0
- word break dictionary from libthai
  - consists of dictionaries in different categories, with a make script
  - LGPL-2.1

Build

Requirements

Rust 2018 Edition

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Develop

Development document

Notes on custom string

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

nlpO3

Table of contents

Features

Use

Node.js binding

Python binding

Rust library

Add to dependency

Example

Command-line interface

Dictionary

Build

Requirements

Steps

Develop

Development document

Issues

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

nlpO3

Table of contents

Features

Use

Node.js binding

Python binding

Rust library

Add to dependency

Example

Command-line interface

Dictionary

Build

Requirements

Steps

Develop

Development document

Issues

License