Notes on versioning:
The project follows semantic versioning 2.0.0. The API covers the following symbols:
- C++
onmt::BPELearner
onmt::BPE
onmt::SPMLearner
onmt::SentencePiece
onmt::SpaceTokenizer
onmt::Tokenizer
onmt::Vocab
onmt::unicode::*
- Python
pyonmttok.BPELearner
pyonmttok.SentencePieceLearner
pyonmttok.SentencePieceTokenizer
pyonmttok.Tokenizer
pyonmttok.Vocab
v1.37.1 (2023-03-01)
- Consider escaped characters as single characters in BPE
- Ignore undefined scripts when resolving inherited or common scripts
v1.37.0 (2023-02-28)
- Add tokenization option
allow_isolated_marks
to allow combining marks to appear isolated in the tokenization output in specific conditions
- Fix infinite loop when the text contains an invalid Unicode character
- Fix segmentation fault when the
BPELearner
does not not find any pairs of characters in the tokenized data - [Python] Update ICU to 72.1
v1.36.0 (2023-01-11)
- [Python] Add argument
vocabulary
in theTokenizer
constructor to set the vocabulary with a list of tokens instead of using a file - [Python] Add function
pyonmttok.is_valid_language
to check if a language code is valid and can be passed to theTokenizer
constructor
v1.35.0 (2022-12-06)
- [Python] Add pickling support to
pyonmttok.Vocab
- Update pybind11 to 2.10.1
- Update cibuildwheel to 2.11.2
v1.34.0 (2022-09-13)
- [Python] Wheels are now built under
manylinux2014
and requirespip
>= 19.3 for installation
- [Python] Build wheels for Python 3.11
- Improve error handling when reading token frequencies in the vocabulary file
- [Python] Fix possible crash when
pyonmttok
is imported beforetorch
- [Python] Update ICU to 71.1
- [C++] Fix static compilation with
-DBUILD_SHARED_LIBS=OFF
- [C++] Fix CMake warning when compiling the tests
v1.33.0 (2022-08-29)
- [Python] Build ARM64 wheels for macOS
- [CLI] Fix error when the option
--segment_alphabet
is not set - Fix SentencePiece build warning when compiling with Clang
v1.32.0 (2022-07-25)
- Add property
pyonmttok.Vocab.counters
to retrieve the number of occurrences of each token
- Update pybind11 to 2.10.0
- Update cxxopts to 3.0.0
v1.31.0 (2022-03-07)
- Add utilities to build and use vocabularies:
pyonmttok.Vocab
pyonmttok.build_vocab_from_tokens
pyonmttok.build_vocab_from_lines
- Define the method
Tokenizer.__call__
to simplify the tokenizer usage when additional features are unused:
tokens = tokenizer(text)
- Update pybind11 to 2.9.1
v1.30.1 (2022-01-25)
- Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)
v1.30.0 (2021-11-29)
- [Python] Build wheels for AArch64 Linux
- [Python] Update ICU to 70.1
v1.29.0 (2021-10-08)
- [Python] Drop support for Python 3.5
- [Python] Build wheels for Python 3.10
- [Python] Add tokenization method
Tokenizer.tokenize_batch
v1.28.1 (2021-09-30)
- Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)
v1.28.0 (2021-09-17)
- [C++] Remove the
SpaceTokenizer
class that is not meant to be public and can be confused with the "space" tokenization mode
- Build Python wheels for Windows
- Add option
tokens_delimiter
to configure how tokens are delimited in tokenized files (default is a space) - Expose option
with_separators
in Python and CLI to include whitespace characters in the tokenized output - [Python] Add package version information in
pyonmttok.__version__
- Fix detokenization when option
with_separators
is enabled
v1.27.0 (2021-08-30)
- Linux Python wheels are now compiled with
manylinux2010
and requirepip
>= 19.0 for installation - macOS Python wheels now require macOS >= 10.14
- Fix casing resolution when some letters do not have case information
- Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence
- Improve error message when setting invalid
segment_alphabet
orlang
options - Update SentencePiece to 0.1.96
- [Python] Improve declaration of functions and classes for better type hints and checks
- [Python] Update ICU to 69.1
v1.26.4 (2021-06-25)
- Fix regression introduced in last version for preserved tokens that are not segmented by BPE
v1.26.3 (2021-06-24)
- Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached
v1.26.2 (2021-06-08)
- Fix a divergence with the SentencePiece output when the spacer is detached from the word
v1.26.1 (2021-05-31)
- Fix application of the BPE vocabulary when using
preserve_segmented_tokens
and a subword appears without joiner in the vocabulary - Fix compilation with ICU versions older than 60
v1.26.0 (2021-04-19)
- Add
lang
tokenization option to apply language-specific case mappings
- Use ICU to convert strings to Unicode values instead of a custom implementation
v1.25.0 (2021-03-15)
- Add
training
flag in tokenization methods to disable subword regularization during inference - [Python] Implement
__len__
method in theToken
class
- Raise an error when enabling
case_markup
with incompatible tokenization modes "space" and "none" - [Python] Improve parallelization when
Tokenizer.tokenize
is called from multiple Python threads (the Python GIL is now released) - [Python] Cleanup some manual Python <-> C++ types conversion
v1.24.0 (2021-02-16)
- Add
verbose
flag in file tokenization APIs to log progress every 100,000 lines - [Python] Add
options
property toTokenizer
instances - [Python] Add class
pyonmttok.SentencePieceTokenizer
to help creating a tokenizer compatible with SentencePiece
- Fix deserialization into
Token
objects that was sometimes incorrect - Fix Windows compilation
- Fix Google Test integration that was sometimes installed as part of
make install
- [Python] Update pybind11 to 2.6.2
- [Python] Update ICU to 66.1
- [Python] Compile ICU with optimization flags
v1.23.0 (2020-12-30)
- Drop Python 2 support
- Publish Python wheels for macOS
- Improve performance in all tokenization modes (up to 2x faster)
- Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
- Fix a regression introduced in 1.20 where
segment_alphabet_*
options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation) - Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using
preserve_segmented_tokens
and the word is segmented by both asegment_*
option and BPE - Fix incorrect tokenization when using
support_prior_joiners
and some joiners are within protected sequences
v1.22.2 (2020-11-12)
- Do not require "none" tokenization mode for SentencePiece vocabulary restriction
v1.22.1 (2020-10-30)
- Fix error when enabling vocabulary restriction with SentencePiece and
spacer_annotate
is not explicitly set - Fix backward compatibility with Kangxi and Kanbun scripts (see
segment_alphabet
option)
v1.22.0 (2020-10-29)
- [C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a
std::shared_ptr
to make it outlive theTokenizer
instance.
- Add
set_random_seed
function to make subword regularization reproducible - [Python] Support serialization of
Token
instances - [C++] Add
Options
structure to configure tokenization options (Flags
can still be used for backward compatibility)
- Fix BPE vocabulary restriction when using
joiner_new
,spacer_annotate
, orspacer_new
(the previous implementation always assumedjoiner_annotate
was used) - [Python] Fix
spacer
argument name inToken
constructor - [C++] Fix ambiguous subword encoder ownership by using a
std::shared_ptr
v1.21.0 (2020-10-22)
- Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)
- Fix BPE vocabulary restriction when words have a leading or trailing joiner
- Raise an error when using a multi-character joiner and
support_prior_joiner
- [Python] Implement
__hash__
method ofpyonmttok.Token
objects to be consistent with the__eq__
implementation - [Python] Declare
pyonmttok.Tokenizer
arguments (exceptmode
) as keyword-only - [Python] Improve compatibility with Python 3.9
v1.20.0 (2020-09-24)
- The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in
Release
mode by default - Tests are no longer compiled by default (use
-DBUILD_TESTS=ON
to compile the tests)
- Accept any Unicode script aliases in the
segment_alphabet
option - Update SentencePiece to 0.1.92
- [Python] Improve the capabilities of the
Token
class:- Implement the
__repr__
method - Allow setting all attributes in the constructor
- Add a copy constructor
- Implement the
- [Python] Add a copy constructor for the
Tokenizer
class
- [Python] Accept
None
value forsegment_alphabet
argument
v1.19.0 (2020-09-02)
- Add BPE dropout (Provilkov et al. 2019)
- [Python] Introduce the "Token API": a set of methods that manipulate
Token
objects instead of serialized strings - [Python] Add
unicode_ranges
argument to thedetokenize_with_ranges
method to return ranges over Unicode characters instead of bytes
- Include "Half-width kana" in Katakana script detection
v1.18.5 (2020-07-07)
- Fix possible crash when applying a case insensitive BPE model on Unicode characters
v1.18.4 (2020-05-22)
- Fix segmentation fault on
cli/tokenize
exit - Ignore empty tokens during detokenization
- When writing to a file, avoid flushing the output stream on each line
- Update
cli/CMakeLists.txt
to mark Boost.ProgramOptions as required
v1.18.3 (2020-03-09)
- Strip token annotations when calling
SubwordLearner.ingest_token
v1.18.2 (2020-02-17)
- Speed and memory improvements for BPE learning
v1.18.1 (2020-01-16)
- [Python] Fix memory leak when deleting Tokenizer object
v1.18.0 (2020-01-06)
- Include
is_placeholder
function in the Python API - Add
ingest_token
method to learner objects to allow external tokenization
v1.17.2 (2019-12-06)
- Fix joiner annotation when SentencePiece returns isolated spacers
- Apply
preserve_segmented_tokens
in "none" tokenization mode - Performance improvements when using
case_feature
orcase_markup
- Add missing
--no_substitution
flag on the command line client
v1.17.1 (2019-11-28)
- Fix missing case features for isolated joiners or spacers
v1.17.0 (2019-11-13)
- Flag
soft_case_regions
to minimize the number of uppercase regions when usingcase_markup
- Fix mismatch between subword learning and encoding when using
case_feature
- [C++] Fix missing default value for new argument of constructor
SPMLearner
v1.16.1 (2019-10-21)
- Fix invalid SentencePiece training file when generated with
SentencePieceLearner.ingest
(newlines were missing) - Correctly ignore placeholders when using
SentencePieceLearner
without a tokenizer
v1.16.0 (2019-10-07)
- Support keeping the vocabulary generated by SentencePiece with the
keep_vocab
argument - [C++] Add intermediate method to annotate tokens before detokenization
- Improve file read/write errors detection
- [Python] Lower the risk of ABI incompatibilities with other pybind11 extensions
v1.15.7 (2019-09-20)
- Do not apply case modifiers on placeholder tokens
v1.15.6 (2019-09-16)
- Fix placeholder tokenization when followed by a combining mark
v1.15.5 (2019-09-16)
- [Python] Downgrade
pybind11
to fix segmentation fault when importing after non-compliant Python wheels
v1.15.4 (2019-09-14)
- [Python] Fix possible runtime error on program exit when using
SentencePieceLearner
v1.15.3 (2019-09-13)
- Fix possible memory issues when run in multiple threads with ICU
v1.15.2 (2019-09-11)
- [Python] Improve error checking in file based functions
v1.15.1 (2019-09-05)
- Fix regression in space tokenization: characters inside placeholders were incorrectly normalized
v1.15.0 (2019-09-05)
support_prior_joiners
flag to support tokenizing a pre-tokenized input
- Fix case markup when joiners or spacers are individual tokens
v1.14.1 (2019-08-07)
- Improve error checking
v1.14.0 (2019-07-19)
- [C++] Method to detokenize from
AnnotatedToken
s
- [Python] Release the GIL in time consuming functions (e.g. file tokenization, subword learning, etc.)
- Performance improvements
v1.13.0 (2019-06-12)
- [Python] File-based tokenization and detokenization APIs
- Support tokenizing files with multiple threads
- Respect "NoSubstitution" flag for combining marks applied on spaces
v1.12.1 (2019-05-27)
- Fix Python package
v1.12.0 (2019-05-27)
- Python API for subword learning (BPE and SentencePiece)
- C++ tokenization method to get the intermediate token representation
- Replace Boost.Python by pybind11 for the Python wrapper
- Fix verbose flag for SentencePiece training
- Check and raise possible errors during SentencePiece training
v1.11.0 (2019-02-05)
- Support copy operators on the Python client
- Support returning token locations in detokenized text
- Hide SentencePiece dependency in public headers
v1.10.6 (2019-01-15)
- Update SentencePiece to 0.1.8 in the Python package
- Allow naming positional arguments in the Python API
v1.10.5 (2019-01-03)
- More strict handle of combining marks - fixes #57 and #58
v1.10.4 (2018-12-18)
- Harden detokenization on invalid case markups combination
v1.10.3 (2018-11-05)
- Fix case markup for 1 letter words
v1.10.2 (2018-10-18)
- Fix compilations errors when SentencePiece is not installed
- Fix DLLs builds using Visual Studio
- Handle rare cases where SentencePiece returns 0 pieces
v1.10.1 (2018-10-08)
- Fix regression for SentencePiece: spacer annotation was not automatically enabled in tokenization mode "none"
v1.10.0 (2018-10-05)
CaseMarkup
flag to inject case information as new tokens
- Do not break compilation for users with old SentencePiece versions
v1.9.0 (2018-09-25)
- Vocabulary restriction for SentencePiece encoding
- Improve Tokenizer constructor for subword configuration
v1.8.4 (2018-09-24)
- Expose base methods in
Tokenizer
class - Small performance improvements for standard use cases
v1.8.3 (2018-09-18)
- Fix count of Arabic characters in the map of detected alphabets
v1.8.2 (2018-09-10)
- Minor fix to CMakeLists.txt for SentencePiece compilation
v1.8.1 (2018-09-07)
- Support training SentencePiece as a subtokenizer
v1.8.0 (2018-09-07)
- Add learning interface for SentencePiece
v1.7.0 (2018-09-04)
- Add integrated Subword Learning with first support of BPE.
- Preserve placeholders as independent tokens for all modes
v1.6.2 (2018-08-29)
- Support SentencePiece sampling API
- Additional +30% speedup for BPE tokenization
- Fix BPE not respecting
PreserveSegmentedTokens
(#30)
v1.6.1 (2018-07-31)
- Fix Python package
v1.6.0 (2018-07-30)
PreserveSegmentedTokens
flag to not attach joiners or spacers to tokens segmented by anySegment*
flags
- Do not rebuild
bpe_vocab
if already loaded (e.g. whenCacheModel
is set)
v1.5.3 (2018-07-13)
- Fix
PreservePlaceholders
withJoinerAnnotate
that possibly modified other tokens
v1.5.2 (2018-07-12)
- Fix support of BPE models v0.2 trained with
learn_bpe.py
v1.5.1 (2018-07-12)
- Do not escape spaces in placeholders value if
NoSubstitution
is enabled
v1.5.0 (2018-07-03)
- Support
apply_bpe.py
0.3 mode
- Up to x3 faster tokenization and detokenization
v1.4.0 (2018-06-13)
- New character level tokenization mode
Char
- Flag
SpacerNew
to make spacers independent tokens
- Replace spacer tokens by substitutes when found in the input text
- Do not enable spacers by default when SentencePiece is used as a subtokenizer
v1.3.0 (2018-04-07)
- New tokenization mode
None
that simply forwards the input text - Support SentencePiece, as a tokenizer or sub-tokenizer
- Flag
PreservePlaceholders
to not mark placeholders with joiners or spacers
- Revisit Python compilation to support wheels building
v1.2.0 (2018-03-28)
- Add API to retrieve discovered alphabet during tokenization
- Flag to convert joiners to spacers
- Add install target for the Python bindings library
v1.1.1 (2018-01-23)
- Make
Alphabet.h
public
v1.1.0 (2018-01-22)
- Python bindings
- Tokenization flag to disable special characters substitution
- Fix incorrect behavior when
--segment_alphabet
is not set by the client - Fix alphabet identification
- Fix segmentation fault when tokenizing empty string on spaces
v1.0.0 (2017-12-11)
- New
Tokenizer
constructor requiring bit flags
- Support BPE modes from
learn_bpe.lua
- Case insensitive BPE models
- Space tokenization mode
- Alphabet segmentation
- Do not tokenize blocks encapsulated by
⦅
and⦆
segment_numbers
flag to split numbers into digitssegment_case
flag to split words on case changessegment_alphabet_change
flag to split on alphabet changecache_bpe_model
flag to cache BPE models for future instances
- Fix
SpaceTokenizer
crash with leading or trailing spaces - Fix incorrect tokenization around tabulation character (#5)
- Fix incorrect joiner between numeric and punctuation
v0.2.0 (2017-03-08)
- Add CMake install rule
- Add API option to include separators
- Add static library compilation support
- Rename library to libOpenNMTTokenizer
- Make words features optional in tokenizer API
- Make
unicode
headers private
v0.1.0 (2017-02-14)
Initial release.