- the pattern one is a pattern that pinyin changes only 0 to 1 character in an idiom.
- the pattern two is patterns in which the pinyin changes for two or more characters in an idiom.
- exception pattern is a pattern except above.
outputs
├── duoyinzi_pattern_one.txt <- Generated by make_pattern_table.py
├── duoyinzi_pattern_two.json <- Generated by make_pattern_table.py
└── duoyinzi_exceptional_pattern.json <- Used only for exceptional patterns
Currently, duoyinzi_exceptional_pattern.json is generated manually. -> Generation location
.
├── phrase_of_exceptional_pattern.txt <- Collection of idioms containing exceptional replacement patterns (Editable)
├── phrase_of_pattern_one.txt <- Collection of idioms where only 0-1 characters change in pinyin (Editable)
├── phrase_of_pattern_two.txt <- Collection of idioms where 2 or more characters change in pinyin (Editable)
├── phrase_testcase.txt <- Test cases used to verify if validate_phrase.py works effectively
└── scripts
├── check_exsit_duoyinsi_on_word.py
├── make_pattern_table.py
├── phrase.py
├── phrase_holder.py
├── pinyin_getter.py
└── validate_phrase.py
# First, check the dictionary
$ python validate_phrase.py
# Generate pattern table
$ python make_pattern_table.py
flowchart TB
classDef noteclass fill:#fff5ad,stroke:#decc93;
id0["Is there a word duplication?<br/>validate_phrase.get_duplicate_word()"] -- yes --> id1[Remove duplicates]
id0 -- no --> id2["Is there a pattern that affects other words?<br/>validate_phrase.get_duplicate_pattern_of_word()"]
id2 -- yes --> id3["
The destination to write is phrase_of_exceptional_pattern.txt
Generally, conform to the smaller pattern.
For example, retain '阿谀' over '胶阿谀'
"]
id2 -- no --> id4["Are there 2 or more characters replaced homographs within a word?<br/>validate_phrase.get_multiple_replacement_by_duoyinzi()"]
id4 -- yes --> id5["
Destination is phrase_of_pattern_two.txt
Create patterns for context-dependent multiple replacements.
make_pattern_table.set_pattern_two()
#Glyph names should be 'ss01'~'ss20'.
#ss00 is for glyphs without any Hanzi.
#ss01 is for standard pinyin.
#ss02 and beyond are for variant pinyin readings."]
id4 -- no --> id6["
Destination is phrase_of_pattern_one.txt
Create patterns for characters homographs to be replaced.
make_pattern_table.set_pattern_one()
・When all characters are solely composed of standard pinyin (not homographs):
Include characters with multiple pinyin readings (and read in standard pinyin this time) as soon as they are found. Thus, it's first-come, first-served.
If the word is composed only of characters with a single reading, exclude it from the pattern table.
・ When there's only one homograph within a word:
Insert the word into the pattern of the target Hanzi.
"]
%% Since we can't use NOTE in flowcharts, we use nodes instead
id3 -.- SEVLNOTE0["
e.g.:
[轴子] and [大轴子,压轴子]
[着手] and [背着手]
Exception patterns are described as follows in the calt feature:
lookup calt {
ignore sub uni80CC uni7740' uni624B;
sub uni7740' uni624B by d;
} calt;
Thus,
着手->着手
背着手->背d手
get replaced into separate patterns accordingly.
"]:::noteclass
id5 -.- SEVLNOTE1["
e.g.:
'lookup_table': {
# Variant pinyin readings
# The sequence of numbers should match the index order of marged-mapping-table.txt.
'lookup_10': {
'占' : '占.ss02',
'卜' : '卜.ss02',
'少' : '少.ss02',
'更' : '更.ss02'
}
}"]:::noteclass
id5 -.- SEVLNOTE2["
e.g.:
兴兴头头: xīng/xìng/tou/tóu,
占卜:zhān/bǔ, 吐血:tù/xiě
Example notation (for 2 or more replacements):
lookup calt {
sub A' lookup lookup_0 A' lookup lookup_0 F;
} calt;
lookup lookup_0 {
sub A by X;
} lookup_0;
Thus,
AAF -> XXF
gets replaced when there are two or more characters changed.
"]:::noteclass
%% linkStyle 5 stroke-width:0px;
style id3 text-align:left
style id5 text-align:left
style id6 text-align:left
style SEVLNOTE0 text-align:left
style SEVLNOTE1 text-align:left
style SEVLNOTE2 text-align:left