Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My two cents #1

Open
christallire opened this issue May 9, 2024 · 1 comment
Open

My two cents #1

christallire opened this issue May 9, 2024 · 1 comment

Comments

@christallire
Copy link

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Here's my two cents:

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.
and three hot encodings is what exactly the "Unicode" Korean table does, too.

@christallire christallire changed the title A question My two cents May 9, 2024
@mcognetta
Copy link
Owner

mcognetta commented May 9, 2024

Hi, thanks for the comments.

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Interesting, this is my first time having that happen. I am flattered.

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.

I think you are slightly misunderstanding what our work does. We are doing character (= syllable/음절) level modeling, but we are doing it in a way that reduces parameter counts by only using subcharacter/자모 features. You can read about it here: https://aclanthology.org/2023.eacl-main.172/.

On the encoding side there are roughly 3 options:

One-hot syllable: requires 11k embedding vectors
One-hot jamo: requires ~70 embedding vectors, but 3x sequence length
Three-hot syllable: requires 70 embedding vectors but syllable-level sequence length

Our's is three-hot syllable, so we do produce a single syllable-level encoding for each syllable in the text, but its made from a combination of the component jamo parts.

However, our work mainly focused on the output side, where there is a fourth option: independent three-hot syllable (https://koreascience.kr/article/CFKO201832073079068.pdf). We show that this one doesn't properly model syllables, and propose conditional three-hot syllable decoding which also only requires ~70 embedding vectors and outputs full syllables in one timestep.

So, to summarize, we are doing character-level encoding but with a reduced parameter count (11k -> 70 embedding vectors but no sequence length increase).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants