Support empty alphabet, for simple CJK word segmentation #75

unhammer · 2019-10-28T08:04:44Z

Before
944ed25 / #52
it was possible to use monodix files with an empty <alphabet> in order to segment into all known analyses (presumably symbols without analyses were output as blanks). But after the change, this is no longer possible.

See 944ed25#commitcomment-35679780 for test cases for Chinese/Japanese/Korean.

Maybe the iswalnum test could be turned off by a flag, e.g. lt-proc --no-implicit-alphabet ?

The text was updated successfully, but these errors were encountered:

Solves #45 Consider alphanumeric characters to be part of the vocabulary.

TinoDidriksen · 2019-10-28T08:13:35Z

Surely this is as trivial as adding a alphabetic_chars.empty() check to the condition.

unhammer · 2019-10-28T08:45:22Z

What if someone wants only some chars to be unknown-tokenizable?

TinoDidriksen · 2019-10-28T08:48:16Z

I guess. I say this should be an opt-out, then. Default should be to have as much as possible in the alphabet, and people can then opt-out with something like <alphabet verbatim="true">

unhammer · 2019-10-28T09:00:10Z

Definitely opt-out, which is why I suggested --no-implicit-alphabet, though an attribute would be great too. However, an attribute would require a change to the binary format, wouldn't it? (If the iswalnum check is in lt-proc, not lt-comp.)

TinoDidriksen · 2019-10-28T09:05:06Z

The last binary break prepared for this eventuality: https://github.com/apertium/lttoolbox/blob/master/lttoolbox/compression.h#L29 - we can add features without breaking existing files. But yeah, a cmdline flag for now would work.

ftyers · 2019-10-28T12:55:23Z

Regarding #52 isn't this what the inconditional section is for?

unhammer · 2019-10-28T13:19:22Z

oh yeah :) @Fred-Git-Hub ↑ would this cover your use-case? With

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="noun"/>
      <sdef n="verb"/>
   </sdefs>

   <section id="main" type="inconditional">
      <e><p><l>我</l><r>我<s n="noun"/></r></p></e>
      <e><p><l>爱</l><r>爱<s n="verb"/></r></p></e>
      <e><p><l>你</l><r>你<s n="noun"/></r></p></e>
   </section>

</dictionary>

I get

$ echo "我爱你" | lt-proc test.bin
^我/我<noun>$^爱/爱<verb>$^你/你<noun>$

(See http://wiki.apertium.org/wiki/Inconditional#inconditional for more info.)

unhammer · 2019-10-28T13:26:34Z

well, the problem is that anything without an analysis in inconditional would turn what follows into one big unknown:

$ echo "熊猫 爱你" |lt-proc test.bin   # space after the bear:
^熊猫/*熊猫$ ^爱/爱<verb>$^你/你<noun>$
$ echo "熊猫爱你" |lt-proc test.bin    # no space, big unknown:
^熊猫爱你/*熊猫爱你$

so then you'd have to make sure to put every symbol you might expect to appear before other symbols into inconditional, including foreign ones like a and b.

ftyers · 2019-10-28T13:39:49Z

Aha, got it @unhammer, that makes sense. In general I think that in order to deal with this properly we need (1) weights in the lexicon, and (2) a special function of lttoolbox that does segmentation... maybe something like the compounding functionality.

unhammer · 2019-10-28T13:50:06Z

Yeah, I do have the feeling plain LRLM should eventually hit something it can't handle, but I wonder how far you can get with what @Fred-Git-Hub had going (if the language was mostly single-character words, it should be possible without any new features).

Languages like Thai would need something more, but the current weights and compounding features don't look at context – wouldn't context be needed? Even the simple Norwegian case of ^3./3<adj><ord>/3<num>+.<sent>$ you can't solve without looking at words that are not part of the longest-match of any of the analyses.

ftyers · 2019-10-28T14:08:15Z

Yeah, either you'd be stuck with a unigram model or you'd need to incorporate n-gram information somehow.

unhammer referenced this issue Oct 28, 2019

Fix the out of alphabet token handling in analyses generation

944ed25

Solves #45 Consider alphanumeric characters to be part of the vocabulary.

unhammer mentioned this issue Oct 28, 2019

Fix the out of alphabet token handling in analyses generation #52

Merged

ftyers added the tokenisation label Jun 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support empty alphabet, for simple CJK word segmentation #75

Support empty alphabet, for simple CJK word segmentation #75

unhammer commented Oct 28, 2019

TinoDidriksen commented Oct 28, 2019 •

edited

Loading

unhammer commented Oct 28, 2019 •

edited

Loading

TinoDidriksen commented Oct 28, 2019

unhammer commented Oct 28, 2019

TinoDidriksen commented Oct 28, 2019

ftyers commented Oct 28, 2019

unhammer commented Oct 28, 2019

unhammer commented Oct 28, 2019 •

edited

Loading

ftyers commented Oct 28, 2019

unhammer commented Oct 28, 2019 •

edited

Loading

ftyers commented Oct 28, 2019

Support empty alphabet, for simple CJK word segmentation #75

Support empty alphabet, for simple CJK word segmentation #75

Comments

unhammer commented Oct 28, 2019

TinoDidriksen commented Oct 28, 2019 • edited Loading

unhammer commented Oct 28, 2019 • edited Loading

TinoDidriksen commented Oct 28, 2019

unhammer commented Oct 28, 2019

TinoDidriksen commented Oct 28, 2019

ftyers commented Oct 28, 2019

unhammer commented Oct 28, 2019

unhammer commented Oct 28, 2019 • edited Loading

ftyers commented Oct 28, 2019

unhammer commented Oct 28, 2019 • edited Loading

ftyers commented Oct 28, 2019

TinoDidriksen commented Oct 28, 2019 •

edited

Loading

unhammer commented Oct 28, 2019 •

edited

Loading

unhammer commented Oct 28, 2019 •

edited

Loading

unhammer commented Oct 28, 2019 •

edited

Loading