language identification

testing repo

[x] N-gram dataset

links

Unicode

TODO

script lookup

~~binary tree?~~
~~based on ip-lookup? (ranges in sets)~~
- ~~unicode max is 0x10FFFF~~
- ~~masks:~~
  - ~~0xfffff0~~
  - ~~0xffffe0~~
  - ~~0xffffc0~~
  - ~~0xffff80~~
  - ~~0xffff00~~
  - ~~0xfffe00~~
  - ~~0xfffc00~~
  - ~~0xfff800 <- max 544 of these~~
just use one set per lang / script and use lrucache(maxsize=0xFFFF)
char -> langs or char -> scripts?

modular langid

language code, variation/dialect name
- eg. "japanese, romaji" or "english, deseret"
script / charset (whitelist)
(optional) word ngram freqs
- 1-gram at a minimum
- char ngram freqs (with start/end chars) (fallback)
- n-grams: [word[i:i + n] for i in range(length - n + 1)]
- if no words, just use ngrams
(optional) char ngram freqs
- chars only, no spaces etc
  - clean on load? or error?
  - use the script as whitelist?
- if no chars, build from word freqs
- if no chars and no words, assume uniform distribution over all chars, but L* gets priority over M*
- some kind of smoothing where you specify the total population of ngrams
kenlm?
- (decoder only) pip install https://github.com/kpu/kenlm/archive/master.zip
- example.py
nltk?
- kneser ney

cleanup

mimic cld2 cleanup
- expand HTML entities &
- delete digits
- delete punctuation
- delete all html tags <br>
- removing repetitive sequences/words that would otherwise skew the scoring, such as jpg in foo.jpg bar.jpg baz.jpg
- removing web-specific words that convey almost no language information, such as page, link, click, td, tr, copyright, wikipedia, http.
more cleanup
- emails, urls, twitter handles, hashtags
- common tech terms (pdf, jpg, ppt, docx, htm, href)
- common entities (facebook, instagram, chrome, twitter, wiki)
filter
- by script
- remove 1-char words
- remove common english words
  - but keep most common vernacular words (whitelist / dictionary)?
remove low-count word ngrams
count char ngrams
dedupe repeated chars?
- hello -> helo <- hellloooo
- 1608.03030 -> sequences such as 'hahahaha...' or 'arghhhhh...' we restricted any sequence of repeating characters to at most five repetitions where the repeating pattern can be from one to four characters

cld2

Several embellishments improve the basic algorithm:

additional scoring of some sequences of two CJK letters or eight other letters
scoring some words and word pairs that are distinctive within sets of statistically-close languages, such as {Malay, Indonesian} or {Spanish, Portuguese, Galician}

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
datasets		datasets
deseret		deseret
dictionaries		dictionaries
iso15924		iso15924
iso639		iso639
language_identification		language_identification
pdfs		pdfs
.gitignore		.gitignore
README.md		README.md
datatypes.py		datatypes.py
language_identification_test.py		language_identification_test.py
modified_kneser_ney.py		modified_kneser_ney.py
remove_html_tags.py		remove_html_tags.py
requirements.txt		requirements.txt
script-langid.ipynb		script-langid.ipynb
scripts.json		scripts.json
tokenizer.py		tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

language identification

[x] N-gram dataset

links

Unicode

TODO

script lookup

modular langid

cleanup

cld2

About

Releases

Packages

Languages

averykhoo/language-identification

Folders and files

Latest commit

History

Repository files navigation

language identification

[x] N-gram dataset

links

Unicode

TODO

script lookup

modular langid

cleanup

cld2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages