Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix more pronunciations #504

Closed
wants to merge 2 commits into from
Closed

Conversation

xatier
Copy link
Contributor

@xatier xatier commented Jul 6, 2024

This is the follow-up with @ChiahongHong's #503.

When reviewing the PR, I noticed the pattern of inconsistent pronunciations (often incorrect) cases with phrase groups like this:

節省 ㄐㄧㄝˊ ㄕㄥˇ
節省下 ㄐㄧㄝˊ ㄒㄧㄥˇ ㄒㄧㄚˋ
節省下來 ㄐㄧㄝˊ ㄒㄧㄥˇ ㄒㄧㄚˋ ㄌㄞˊ


著色 ㄓㄨㄛˊ ㄙㄜˋ
著色劑 ㄓㄜ˙ ㄙㄜˋ ㄐㄧˋ

遺傳學 ㄧˊ ㄔㄨㄢˊ ㄒㄩㄝˊ
遺傳學家 ㄧˊ ㄓㄨㄢˋ ㄒㄩㄝˊ ㄐㄧㄚ

Thanks to the previous work of sorting the dictionary in order, we can easily locate these issues. I wrote another script to find these cases.

#!/usr/bin/env python

import sys

DICT = "Source/Data/BPMFMappings.txt"

with open(DICT) as f:
    lines = f.readlines()

N = 30

for i in range(len(lines) - N):
    line = lines[i]
    t = line.split()[0]
    u = line.split()[1:]

    # exlucde these
    if "一" in t or "不" in t:
        continue

    if len(t) == 2:
        # iterate through the next N phrases
        for j in range(N):
            t1 = lines[i + j].split()[0]
            u1 = lines[i + j].split()[1:]

            if len(t1) == 2:
                continue

            if t1.startswith(t) and not (u1[0] == u[0] and u1[1] == u[1]):
                print(f"{line.strip()}    {lines[i+j]}", end="")

I went through the ~1300 lines of reports and manually examined them (with Moe dict and other dictionaries). In this PR, I have tried my best to provide consistency and correctness for these phrases. Some (incorrect) pronunciations are commonly used but different from the dictionaries, I preserved those cases for keeping the usability.

Another interesting finding is that these are often ambiguous, I also provided both available forms for them: 亞個得波播怎麼子露埔液雌

@xatier
Copy link
Contributor Author

xatier commented Jul 6, 2024

@lukhnos please DO NOT merge this just yet, I'd like to rebase to master once #503 is merged.

@ChiahongHong please kindly help spot issues if you get a chance 🙏

Review link: 10c4f5e

@ChiahongHong
Copy link
Contributor

以下是從您在 10c4f5e 的修改中,隨意先挑幾個字詞來測試,如果要保持一致的話,可能需要修改的部分。

或許我們可以分階段來修正,先修改明顯的錯誤讀音就好(像 音樂 ㄧㄣ ㄌㄜˋ)這種的,俗音 / 容錯 / 正音的部分之後分別為各個字來處理~

AUDIT = {
    '音樂': {'ㄧㄣ ㄩㄝˋ'},
    '波': {'ㄅㄛ', 'ㄆㄛ'},
    '亞洲': {'ㄧㄚˇ ㄓㄡ', 'ㄧㄚˋ ㄓㄡ'},
    '亞細亞': {'ㄧㄚˇ ㄒㄧˋ ㄧㄚˇ', 'ㄧㄚˋ ㄒㄧˋ ㄧㄚˋ'},
    '亞軍': {'ㄧㄚˇ ㄐㄩㄣ', 'ㄧㄚˋ ㄐㄩㄣ'},
    '倒出': {'ㄉㄠˇ ㄔㄨ', 'ㄉㄠˋ ㄔㄨ'},
    '液': {'ㄧˋ', 'ㄧㄝˋ'},
    '冠狀': {'ㄍㄨㄢ ㄓㄨㄤˋ', 'ㄍㄨㄢˋ ㄓㄨㄤˋ'},
    '鰻': {'ㄇㄢˊ', 'ㄇㄢˋ'},
    '黏膜': {'ㄋㄧㄢˊ ㄇㄛˊ', 'ㄋㄧㄢˊ ㄇㄛˋ'}
}

data = dict()

with open('BPMFMappings.txt', 'r', encoding='UTF-8') as f:
    for line in f.readlines():
        word, bpmf = line.strip().split(maxsplit=1)
        data.setdefault(word, set()).add(bpmf)

for char, audit_pronuns in AUDIT.items():
    first = True
    for word, bpmfs in data.items():
        if char not in word:
            continue

        index = word.index(char)
        pronuns = set()
        for bpmf in bpmfs:
            bpmf = bpmf.split()
            bpmf = bpmf[index:index+len(char)]
            bpmf = ' '.join(bpmf)
            pronuns.add(bpmf)

        diff = sorted(audit_pronuns.difference(pronuns))
        if len(diff) > 0:
            if first:
                print(f'\n## {char}')
                first = False
            print(f'  - {word}')
            print(f'    - Missing:    ', ', '.join(diff))

        diff = pronuns.difference(audit_pronuns)
        if len(diff) > 0:
            print(f'    - Unexpected: ', ', '.join(diff))

音樂

  • 新力音樂
    • Missing: ㄧㄣ ㄩㄝˋ
    • Unexpected: ㄧㄣ ㄌㄜˋ
  • 玩音樂
    • Missing: ㄧㄣ ㄩㄝˋ
    • Unexpected: ㄧㄣ ㄌㄜˋ

  • 一波
    • Missing: ㄅㄛ
  • 一波三折
    • Missing: ㄅㄛ
  • 一波又起
    • Missing: ㄅㄛ
  • 一波數折
    • Missing: ㄅㄛ
  • 一波未平
    • Missing: ㄅㄛ
  • 中短波
    • Missing: ㄅㄛ
  • 五五波
    • Missing: ㄅㄛ
  • 仁波切
    • Missing: ㄆㄛ
  • 伊波拉
    • Missing: ㄅㄛ
  • 伊波拉病毒
    • Missing: ㄅㄛ
  • 元波肉圓
    • Missing: ㄅㄛ
  • 入射波
    • Missing: ㄅㄛ
  • 劉曉波
    • Missing: ㄅㄛ
  • 卡波西氏肉瘤
    • Missing: ㄆㄛ
  • 哈利波特
    • Missing: ㄅㄛ
  • 威廉波特
    • Missing: ㄅㄛ
  • 定風波
    • Missing: ㄅㄛ
  • 寒波盪漾
    • Missing: ㄅㄛ
  • 寧波
    • Missing: ㄅㄛ
  • 寧波市
    • Missing: ㄅㄛ
  • 寧波府
    • Missing: ㄅㄛ
  • 小波浪
    • Missing: ㄅㄛ
  • 嵐煙波影
    • Missing: ㄅㄛ
  • 希波克拉底
    • Missing: ㄅㄛ
  • 引起軒然大波
    • Missing: ㄆㄛ
  • 意外風波
    • Missing: ㄅㄛ
  • 新一波
    • Missing: ㄅㄛ
  • 暗送秋波
    • Missing: ㄅㄛ
  • 曼波
    • Missing: ㄅㄛ
  • 柯夢波丹
    • Missing: ㄅㄛ
  • 波多馬克河
    • Missing: ㄅㄛ
  • 波札那
    • Missing: ㄅㄛ
  • 波濤
    • Missing: ㄆㄛ
  • 波濤洶湧
    • Missing: ㄆㄛ
  • 波濤起伏
    • Missing: ㄆㄛ
  • 波特蘭
    • Missing: ㄅㄛ
  • 波特蘭拓荒者
    • Missing: ㄅㄛ
  • 波耳
    • Missing: ㄆㄛ
  • 波耳半徑
    • Missing: ㄆㄛ
  • 波耳模型
    • Missing: ㄆㄛ
  • 波耳氫模型
    • Missing: ㄆㄛ
  • 波茨坦宣言
    • Missing: ㄆㄛ
  • 波西尼亞
    • Missing: ㄅㄛ
  • 波霸
    • Missing: ㄆㄛ
  • 波麗露
    • Missing: ㄅㄛ
  • 海不揚波
    • Missing: ㄆㄛ
  • 海爾波普彗星
    • Missing: ㄅㄛ
  • 消波塊
    • Missing: ㄅㄛ
  • 濾波
    • Missing: ㄅㄛ
  • 濾波器
    • Missing: ㄅㄛ
  • 無線電波
    • Missing: ㄅㄛ
  • 煙波
    • Missing: ㄅㄛ
  • 物質波
    • Missing: ㄅㄛ
  • 王小波
    • Missing: ㄆㄛ
  • 王曉波
    • Missing: ㄅㄛ
  • 的黎波里
    • Missing: ㄅㄛ
  • 短波
    • Missing: ㄅㄛ
  • 碎波
    • Missing: ㄅㄛ
  • 碧波
    • Missing: ㄅㄛ
  • 碧波盪漾
    • Missing: ㄅㄛ
  • 秋波
    • Missing: ㄅㄛ
  • 羅斯卓波維奇
    • Missing: ㄅㄛ
  • 聲波
    • Missing: ㄅㄛ
  • 脈波
    • Missing: ㄅㄛ
  • 腦波
    • Missing: ㄅㄛ
  • 臨去秋波
    • Missing: ㄅㄛ
  • 般若波羅蜜
    • Missing: ㄅㄛ
  • 萬頃碧波
    • Missing: ㄅㄛ
  • 諧波
    • Missing: ㄅㄛ
  • 超短波
    • Missing: ㄅㄛ
  • 超音波
    • Missing: ㄅㄛ
  • 軒然大波
    • Missing: ㄅㄛ
  • 載波
    • Missing: ㄅㄛ
  • 逐波而去
    • Missing: ㄅㄛ
  • 醋海生波
    • Missing: ㄅㄛ
  • 防波堤
    • Missing: ㄅㄛ
  • 陳澄波
    • Missing: ㄅㄛ
  • 難波
    • Missing: ㄅㄛ
  • 電波
    • Missing: ㄅㄛ
  • 震波
    • Missing: ㄅㄛ
  • 風平波息
    • Missing: ㄅㄛ
  • 馬可波羅遊記
    • Missing: ㄅㄛ
  • 麥可波頓
    • Missing: ㄅㄛ

亞洲

  • 自由亞洲電台
    • Missing: ㄧㄚˋ ㄓㄡ

亞細亞

  • 小亞細亞
    • Missing: ㄧㄚˋ ㄒㄧˋ ㄧㄚˋ

亞軍

  • 冠亞軍
    • Missing: ㄧㄚˋ ㄐㄩㄣ

倒出

  • 倒出來
    • Missing: ㄉㄠˇ ㄔㄨ
  • 倒出去
    • Missing: ㄉㄠˋ ㄔㄨ

  • 人工血液
    • Missing: ㄧˋ
  • 修正液
    • Missing: ㄧㄝˋ
  • 卸妝液
    • Missing: ㄧㄝˋ
  • 廢液
    • Missing: ㄧㄝˋ
  • 懸浮液
    • Missing: ㄧㄝˋ
  • 懸液計
    • Missing: ㄧㄝˋ
  • 懸濁液
    • Missing: ㄧㄝˋ
  • 標準溶液
    • Missing: ㄧㄝˋ
  • 樹液
    • Missing: ㄧㄝˋ
  • 津液
    • Missing: ㄧˋ
  • 溶液
    • Missing: ㄧㄝˋ
  • 溶液聚合
    • Missing: ㄧㄝˋ
  • 漿液
    • Missing: ㄧㄝˋ
  • 濾液
    • Missing: ㄧㄝˋ
  • 瓊漿玉液
    • Missing: ㄧㄝˋ
  • 硫酸液
    • Missing: ㄧㄝˋ
  • 稀釋液
    • Missing: ㄧㄝˋ
  • 稀釋溶液
    • Missing: ㄧㄝˋ
  • 粉底液
    • Missing: ㄧㄝˋ
  • 粘液
    • Missing: ㄧㄝˋ
  • 細胞液
    • Missing: ㄧㄝˋ
  • 腦液
    • Missing: ㄧㄝˋ
  • 膠體溶液
    • Missing: ㄧㄝˋ
  • 菌液
    • Missing: ㄧㄝˋ
  • 葡萄糖液
    • Missing: ㄧㄝˋ
  • 蒸餾液
    • Missing: ㄧㄝˋ
  • 藥液
    • Missing: ㄧㄝˋ
  • 血液檢驗
    • Missing: ㄧˋ
  • 血液檢體
    • Missing: ㄧㄝˋ
  • 防蚊液
    • Missing: ㄧㄝˋ
  • 顯像液
    • Missing: ㄧㄝˋ
  • 顯影液
    • Missing: ㄧㄝˋ
  • 體液
    • Missing: ㄧㄝˋ
  • 鹼液
    • Missing: ㄧㄝˋ
  • 黏液
    • Missing: ㄧㄝˋ

冠狀

  • 冠狀動脈
    • Missing: ㄍㄨㄢˋ ㄓㄨㄤˋ
  • 冠狀病毒
    • Missing: ㄍㄨㄢ ㄓㄨㄤˋ

  • 八目鰻
    • Missing: ㄇㄢˋ
  • 大尾鱸鰻
    • Missing: ㄇㄢˋ
  • 歐洲鰻
    • Missing: ㄇㄢˋ
  • 海鰻
    • Missing: ㄇㄢˋ
  • 白鰻魚
    • Missing: ㄇㄢˋ
  • 美洲鰻
    • Missing: ㄇㄢˋ
  • 鰻鱺
    • Missing: ㄇㄢˋ

黏膜

  • 口腔黏膜
    • Missing: ㄋㄧㄢˊ ㄇㄛˋ
  • 鼻黏膜炎
    • Missing: ㄋㄧㄢˊ ㄇㄛˋ

@xatier
Copy link
Contributor Author

xatier commented Jul 7, 2024

Good advice, I've fixed the ones you've mentioned (亙剖那). We can then use this change set as a reference and open up a series of PRs for each group respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants