testing repo
- http://data.gdeltproject.org/gdeltv3/web/ngrams/LASTUPDATE.TXT
- http://data.gdeltproject.org/gdeltv3/web/ngrams/MASTERFILELIST.TXT
- http://data.gdeltproject.org/blog/2019-gfg-august-2019-ngrams/MASTER.LINGUISTIC.1GRAM.TXT.gz
- http://data.gdeltproject.org/blog/2019-gfg-august-2019-ngrams/MASTER.LINGUISTIC.2GRAM.TXT.gz
- http://data.gdeltproject.org/gdeltv3/geg_gcnlapi/MASTERFILELIST.TXT
- http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt
- how Aspell works
- symspell
- An Overview of Fuzzy Name Matching Techniques
- Zipf's law
- https://core.ac.uk/download/pdf/22877794.pdf
- https://www.aclweb.org/anthology/W96-0106.pdf
- https://arxiv.org/pdf/cmp-lg/9606013.pdf
- https://www.degruyter.com/view/journals/cllt/14/1/article-p1.xml?language=en
- https://statweb.stanford.edu/~owen/courses/306a/ZipfAndGutenberg.pdf
- https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12768
- https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.flat.zip
- https://en.wikipedia.org/wiki/ISO_15924
- https://en.wikipedia.org/wiki/International_uniformity_of_braille_alphabets
- https://en.wikipedia.org/wiki/Tengwar
- https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalData.xml
- https://unicode-org.github.io/cldr-staging/charts/37/supplemental/languages_and_scripts.html
- https://unicode-org.github.io/cldr-staging/charts/37/supplemental/scripts_and_languages.html
- https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
mis
, for "uncoded languages";mul
, for "multiple languages";qaa
-qtz
, a range reserved for local use.und
, for "undetermined";zxx
, for "no linguistic content; not applicable";
- tokenization?
- NKFD decompose before match?
- allow dropping of M* chars for match?
-
https://en.wiktionary.org/wiki/Category:Basic_word_lists_by_language -
use unicode map
- language -> script (+Zyyy)
- script -> chars
- languages (iso 639-2)
- scripts (iso 15924)
-
pycountry?
- need script alias
-
how to handle traditional vs simplified chinese
- check the unicode chars?
-
clean ngrams using proper word tokenizer (ignore numbers)
- ignore any words using wrong scripts
- outliers? loanwords?
- common english words
- common multilingual words
-
get dictionaries for each language
- dump dictionaries from some free apks
- parse dictionaries
-
rebuild models from clean corpora
-
other corpora
binary tree?based on ip-lookup? (ranges in sets)unicode max is 0x10FFFFmasks:0xfffff00xffffe00xffffc00xffff800xffff000xfffe000xfffc000xfff800 <- max 544 of these
- just use one set per lang / script and use lrucache(maxsize=0xFFFF)
- char -> langs or char -> scripts?
- language code, variation/dialect name
- eg. "japanese, romaji" or "english, deseret"
- script / charset (whitelist)
- (optional) word ngram freqs
- 1-gram at a minimum
- char ngram freqs (with start/end chars) (fallback)
- n-grams:
[word[i:i + n] for i in range(length - n + 1)]
- if no words, just use ngrams
- (optional) char ngram freqs
- chars only, no spaces etc
- clean on load? or error?
- use the script as whitelist?
- if no chars, build from word freqs
- if no chars and no words, assume uniform distribution over all chars, but L* gets priority over M*
- some kind of smoothing where you specify the total population of ngrams
- chars only, no spaces etc
- kenlm?
- (decoder only)
pip install https://github.com/kpu/kenlm/archive/master.zip
- example.py
- (decoder only)
- nltk?
- mimic cld2 cleanup
- expand HTML entities
&
- delete digits
- delete punctuation
- delete all html tags
<br>
- removing repetitive sequences/words that would otherwise skew the scoring, such as jpg in foo.jpg bar.jpg baz.jpg
- removing web-specific words that convey almost no language information, such as page, link, click, td, tr, copyright, wikipedia, http.
- expand HTML entities
- more cleanup
- emails, urls, twitter handles, hashtags
- common tech terms (pdf, jpg, ppt, docx, htm, href)
- common entities (facebook, instagram, chrome, twitter, wiki)
- filter
- by script
- remove 1-char words
- remove common english words
- but keep most common vernacular words (whitelist / dictionary)?
- remove low-count word ngrams
- count char ngrams
- dedupe repeated chars?
- hello -> helo <- hellloooo
1608.03030
-> sequences such as 'hahahaha...' or 'arghhhhh...' we restricted any sequence of repeating characters to at most five repetitions where the repeating pattern can be from one to four characters
Several embellishments improve the basic algorithm:
- additional scoring of some sequences of two CJK letters or eight other letters
- scoring some words and word pairs that are distinctive within sets of statistically-close languages, such as {Malay, Indonesian} or {Spanish, Portuguese, Galician}