Skip to content

kagiannis/fast-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast-tokenizer

A cammand line tokenizer based on re2c. It tries to perform with acceptable accurancy for most weastern languages while being many times faster than competiting tokenizers like spacy's.

Compilation

make

Help

ftok: [OPTIONS] [INPUT_FILES]
        -a type1,type2,... accept only selected token types
        -r type1,type2,... reject selected token types
        -w                 print word count
        -W                 print tokenization in stdout and word count in stderr
valid token types: word, email, punct, num, date, symbol, uri, price, greek, other

About

A fast tokenizer based on re2c

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published