GitHub - wenhaofang/Tokenizer: Some demo tokenizers especially for Chinese, including Maximum Matching, UniGram, HMM, CRF.

Tokenizer

This repository includes some demo tokenizers (especially for Chinese).

Note: The project refers to jieba

Methods:

method1 (DONE): Maximum Matching
method2 (DONE): UniGram
method3 (DONE): HMM
method4 (TODO): CRF

Catalog Description

+ datas
    + data1
        - dict.txt # Synchronize with `jieba/dict.txt`
    + data2
        - prob_emit.py  # Synchronize with `jieba/finalseg/prob_emit.py`
        - prob_start.py # Synchronize with `jieba/finalseg/prob_start.py`
        - prob_trans.py # Synchronize with `jieba/finalseg/prob_trans.py`
+ modules
    - module1
    - module2 # Refering to `jieba/__init__.py`
    - module3 # Refering to `jieba/finalseg/__init__.py`

Supplement: How to get the data files? Here is the explanation from jieba issue.

Start

PYTHONIOENCODING=utf-8 PYTHONPATH=. python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datas		datas
modules		modules
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizer

Catalog Description

Start

About

Releases

Packages

Languages

wenhaofang/Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Tokenizer

Catalog Description

Start

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages