-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Fix more pronunciations #504
Conversation
@lukhnos please DO NOT merge this just yet, I'd like to rebase to master once #503 is merged. @ChiahongHong please kindly help spot issues if you get a chance 🙏 Review link: 10c4f5e |
以下是從您在 10c4f5e 的修改中,隨意先挑幾個字詞來測試,如果要保持一致的話,可能需要修改的部分。 或許我們可以分階段來修正,先修改明顯的錯誤讀音就好(像 AUDIT = {
'音樂': {'ㄧㄣ ㄩㄝˋ'},
'波': {'ㄅㄛ', 'ㄆㄛ'},
'亞洲': {'ㄧㄚˇ ㄓㄡ', 'ㄧㄚˋ ㄓㄡ'},
'亞細亞': {'ㄧㄚˇ ㄒㄧˋ ㄧㄚˇ', 'ㄧㄚˋ ㄒㄧˋ ㄧㄚˋ'},
'亞軍': {'ㄧㄚˇ ㄐㄩㄣ', 'ㄧㄚˋ ㄐㄩㄣ'},
'倒出': {'ㄉㄠˇ ㄔㄨ', 'ㄉㄠˋ ㄔㄨ'},
'液': {'ㄧˋ', 'ㄧㄝˋ'},
'冠狀': {'ㄍㄨㄢ ㄓㄨㄤˋ', 'ㄍㄨㄢˋ ㄓㄨㄤˋ'},
'鰻': {'ㄇㄢˊ', 'ㄇㄢˋ'},
'黏膜': {'ㄋㄧㄢˊ ㄇㄛˊ', 'ㄋㄧㄢˊ ㄇㄛˋ'}
}
data = dict()
with open('BPMFMappings.txt', 'r', encoding='UTF-8') as f:
for line in f.readlines():
word, bpmf = line.strip().split(maxsplit=1)
data.setdefault(word, set()).add(bpmf)
for char, audit_pronuns in AUDIT.items():
first = True
for word, bpmfs in data.items():
if char not in word:
continue
index = word.index(char)
pronuns = set()
for bpmf in bpmfs:
bpmf = bpmf.split()
bpmf = bpmf[index:index+len(char)]
bpmf = ' '.join(bpmf)
pronuns.add(bpmf)
diff = sorted(audit_pronuns.difference(pronuns))
if len(diff) > 0:
if first:
print(f'\n## {char}')
first = False
print(f' - {word}')
print(f' - Missing: ', ', '.join(diff))
diff = pronuns.difference(audit_pronuns)
if len(diff) > 0:
print(f' - Unexpected: ', ', '.join(diff)) 音樂
波
亞洲
亞細亞
亞軍
倒出
液
冠狀
鰻
黏膜
|
Good advice, I've fixed the ones you've mentioned (亙剖那). We can then use this change set as a reference and open up a series of PRs for each group respectively. |
This is the follow-up with @ChiahongHong's #503.
When reviewing the PR, I noticed the pattern of inconsistent pronunciations (often incorrect) cases with phrase groups like this:
Thanks to the previous work of sorting the dictionary in order, we can easily locate these issues. I wrote another script to find these cases.
I went through the ~1300 lines of reports and manually examined them (with Moe dict and other dictionaries). In this PR, I have tried my best to provide consistency and correctness for these phrases. Some (incorrect) pronunciations are commonly used but different from the dictionaries, I preserved those cases for keeping the usability.
Another interesting finding is that these are often ambiguous, I also provided both available forms for them:
亞個得波播怎麼子露埔液雌