Skip to content

Latest commit

 

History

History
136 lines (114 loc) · 5.37 KB

README_EN.md

File metadata and controls

136 lines (114 loc) · 5.37 KB

About homograph (多音字)

  • the pattern one is a pattern that pinyin changes only 0 to 1 character in an idiom.
  • the pattern two is patterns in which the pinyin changes for two or more characters in an idiom.
  • exception pattern is a pattern except above.

File Structure

outputs
   ├── duoyinzi_pattern_one.txt          <- Generated by make_pattern_table.py
   ├── duoyinzi_pattern_two.json         <- Generated by make_pattern_table.py
   └── duoyinzi_exceptional_pattern.json <- Used only for exceptional patterns

Currently, duoyinzi_exceptional_pattern.json is generated manually. -> Generation location

.
├── phrase_of_exceptional_pattern.txt <- Collection of idioms containing exceptional replacement patterns (Editable)
├── phrase_of_pattern_one.txt         <- Collection of idioms where only 0-1 characters change in pinyin (Editable)
├── phrase_of_pattern_two.txt         <- Collection of idioms where 2 or more characters change in pinyin (Editable)
├── phrase_testcase.txt               <- Test cases used to verify if validate_phrase.py works effectively
└── scripts
    ├── check_exsit_duoyinsi_on_word.py
    ├── make_pattern_table.py
    ├── phrase.py
    ├── phrase_holder.py
    ├── pinyin_getter.py
    └── validate_phrase.py

Generation Procedure

# First, check the dictionary
$ python validate_phrase.py

# Generate pattern table
$ python make_pattern_table.py

Overview of make_pattern_table.py

flowchart TB
    classDef noteclass fill:#fff5ad,stroke:#decc93;

    id0["Is there a word duplication?<br/>validate_phrase.get_duplicate_word()"] -- yes --> id1[Remove duplicates]
    id0 -- no --> id2["Is there a pattern that affects other words?<br/>validate_phrase.get_duplicate_pattern_of_word()"]
    id2 -- yes --> id3["
        The destination to write is phrase_of_exceptional_pattern.txt

        Generally, conform to the smaller pattern.
        For example, retain '阿谀' over '胶阿谀'
    "]
    id2 -- no --> id4["Are there 2 or more characters replaced homographs within a word?<br/>validate_phrase.get_multiple_replacement_by_duoyinzi()"]
    id4 -- yes --> id5["
                        Destination is phrase_of_pattern_two.txt

                        Create patterns for context-dependent multiple replacements.
                        make_pattern_table.set_pattern_two()
                        
                        #Glyph names should be 'ss01'~'ss20'.
                        #ss00 is for glyphs without any Hanzi.
                        #ss01 is for standard pinyin.
                        #ss02 and beyond are for variant pinyin readings."]
    id4 -- no --> id6["
                        Destination is phrase_of_pattern_one.txt

                        Create patterns for characters homographs to be replaced.
                        make_pattern_table.set_pattern_one()

                        ・When all characters are solely composed of standard pinyin (not homographs):
                          Include characters with multiple pinyin readings (and read in standard pinyin this time) as soon as they are found. Thus, it's first-come, first-served.
                          If the word is composed only of characters with a single reading, exclude it from the pattern table.
                        ・ When there's only one homograph within a word:
                          Insert the word into the pattern of the target Hanzi.

                        "]
    %% Since we can't use NOTE in flowcharts, we use nodes instead
    id3 -.- SEVLNOTE0["
        e.g.:

        [轴子] and [大轴子,压轴子]
        [着手] and [背着手]
        
        Exception patterns are described as follows in the calt feature:
        lookup calt {
          ignore sub uni80CC uni7740' uni624B;
          sub uni7740' uni624B by d;
        } calt;

        Thus, 
        着手->着手
        背着手->背d手
        get replaced into separate patterns accordingly.
        "]:::noteclass
    id5 -.- SEVLNOTE1["
        e.g.:

        'lookup_table': {
          # Variant pinyin readings
          # The sequence of numbers should match the index order of marged-mapping-table.txt.
          'lookup_10': {
            '占' : '占.ss02',
            '卜' : '卜.ss02',
            '少' : '少.ss02',
            '更' : '更.ss02'
          }
        }"]:::noteclass
    id5 -.- SEVLNOTE2["
            e.g.:
            
            兴兴头头: xīng/xìng/tou/tóu,
            占卜:zhān/bǔ, 吐血:tù/xiě
            
            Example notation (for 2 or more replacements):
            lookup calt {
              sub A' lookup lookup_0 A' lookup lookup_0 F;
            } calt;

            lookup lookup_0 {
              sub A by X;
            } lookup_0;

            Thus,
            AAF -> XXF
            gets replaced when there are two or more characters changed.
        "]:::noteclass
    %% linkStyle 5 stroke-width:0px;

    style id3 text-align:left
    style id5 text-align:left
    style id6 text-align:left
    style SEVLNOTE0 text-align:left
    style SEVLNOTE1 text-align:left
    style SEVLNOTE2 text-align:left
Loading