-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My two cents #1
Comments
Hi, thanks for the comments.
Interesting, this is my first time having that happen. I am flattered.
I think you are slightly misunderstanding what our work does. We are doing character (= syllable/음절) level modeling, but we are doing it in a way that reduces parameter counts by only using subcharacter/자모 features. You can read about it here: https://aclanthology.org/2023.eacl-main.172/. On the encoding side there are roughly 3 options: One-hot syllable: requires 11k embedding vectors Our's is three-hot syllable, so we do produce a single syllable-level encoding for each syllable in the text, but its made from a combination of the component jamo parts. However, our work mainly focused on the output side, where there is a fourth option: independent three-hot syllable (https://koreascience.kr/article/CFKO201832073079068.pdf). We show that this one doesn't properly model syllables, and propose conditional three-hot syllable decoding which also only requires ~70 embedding vectors and outputs full syllables in one timestep. So, to summarize, we are doing character-level encoding but with a reduced parameter count (11k -> 70 embedding vectors but no sequence length increase). |
Copolot suggested this repository while adding additional tokens (James) to my tokenizer.
Here's my two cents:
I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.
and three hot encodings is what exactly the "Unicode" Korean table does, too.
The text was updated successfully, but these errors were encountered: