My two cents #1

christallire · 2024-05-09T13:38:47Z

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Here's my two cents:

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.
and three hot encodings is what exactly the "Unicode" Korean table does, too.

mcognetta · 2024-05-09T14:06:51Z

Hi, thanks for the comments.

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Interesting, this is my first time having that happen. I am flattered.

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.

I think you are slightly misunderstanding what our work does. We are doing character (= syllable/음절) level modeling, but we are doing it in a way that reduces parameter counts by only using subcharacter/자모 features. You can read about it here: https://aclanthology.org/2023.eacl-main.172/.

On the encoding side there are roughly 3 options:

One-hot syllable: requires 11k embedding vectors
One-hot jamo: requires ~70 embedding vectors, but 3x sequence length
Three-hot syllable: requires 70 embedding vectors but syllable-level sequence length

Our's is three-hot syllable, so we do produce a single syllable-level encoding for each syllable in the text, but its made from a combination of the component jamo parts.

However, our work mainly focused on the output side, where there is a fourth option: independent three-hot syllable (https://koreascience.kr/article/CFKO201832073079068.pdf). We show that this one doesn't properly model syllables, and propose conditional three-hot syllable decoding which also only requires ~70 embedding vectors and outputs full syllables in one timestep.

So, to summarize, we are doing character-level encoding but with a reduced parameter count (11k -> 70 embedding vectors but no sequence length increase).

christallire changed the title ~~A question~~ My two cents May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My two cents #1

My two cents #1

christallire commented May 9, 2024

mcognetta commented May 9, 2024 •

edited

Loading

My two cents #1

My two cents #1

Comments

christallire commented May 9, 2024

mcognetta commented May 9, 2024 • edited Loading

mcognetta commented May 9, 2024 •

edited

Loading