-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pmhfst tokeniser inconsistently tokenises hyphen minus #28
Comments
How are you running |
Thanks. But that can’t be how you ran |
echo 'TEXT' | hfst-tokenise -S tools/...pmhfst |
There are a lot of quirks with tokenization. For Greenlandic we have a helper that wraps around hfst-tokenize and smooths out the quirks - e.g., https://github.com/giellalt/lang-kal/blob/main/tools/shellscripts/kal-tokenise.in#L472 for dashes. Might be a source of inspiration. |
@flammie see @TinoDidriksen 's comment above re what we talked about earlier today to have a look at improving tokenisation and text analysis. |
mm I know there's a lot of corner cases with tokenisation, some language specific other more generic. I'll try to make a list here then:
I think I'ma make a test suite actually, this regresses quite easily and stays unnoticed for long... |
Looks good. One note of caution: word final and word initial hyphen could also be errors (ie missing space between hyphen and word), thus I would suggest that these hyphens are part of the word token IFF the full token can be analyzed as such. If not, I would treat them as two separate tokens. If you don't you will get an unknown token when you could have had two known tokens. |
I am not sure Sjur's approach would work for Estonian. For example, we have a word "industriaalne" (industrial), in a compound it would be truncated: "industriaalmaastik" (industrial landscape) - "industriaal" alone" is not a legitimate word. Now, when part of a co-ordinated noun phrase, it would be truncated and with a hyphen: "industriaal- ja linnamaastik" (industrial and city landscape). This means that "industriaal-" is legitimate only together with the final hyphen. This truncation and hyphenating convention is not exceptional. |
@merisiga it would work well as long as the word form |
Ok, good. Now, what would you do with industriaal-- (two trailing hyphens)? This is a spelling error, but it consists of a legitimate token plus a hyphen. I am asking because this seems to be one of those corner cases, and I feel it would be easy to come up with an ad hoc solution which might induce some more ad hoc solutions down the pipeline. In short, I have an uneasy feeling, but cannot point to the exact reason for it... |
Depends on the task. For tokenisation I would probably just let it be. I am not sure how the tokens would be split, there are several possible outcomes depending on the FST and the tokenisation rules. The easiest solution would be to just add In the case of a grammar checker, I would probably iterate over the errors - two hyphens are usually an error to be corrected to one (again, such error detection and correction can be context dependent), and once corrected, the rest of the sentence including the corrected Ie by treating |
The Finnish output is now like:
the fixes so far only included lexc files, like deleting space and space hyphen from analysers and adding the expected hyphens to words. For most cases the tokenisation follows from the analyser. |
lang-fin/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
The "hyphen minus" is sometimes separate and other times retained in +Cmp/SplitR situations
Here are five separate instances:
(1a)
Ruotsin keski- ja eteläosien välille
(1b)
yleisesti Etelä- ja Keski-Suomen alueella
(2)
Instances of +Cmp/SplitL
Keski-Ruotsin ja -Norjan asumattomille metsäseuduille
In (2), one notices the indented and separate minus hyphen, which is a distinction from what is found in (1a).
In (3) and (4), it is disturbing to observe that a leading whitespace appears before the minus hyphen.
the "niin kuin" token is also peculiar
(3)
tiukoin ottein - niin kuin
(4)
(n. 4200 - 2500 eaa.)
In (5), an extra line has been inserted, but it may be associated with the « quote.
(5)
tarkoituksenmukaisuus - «muoto seuraa
The text was updated successfully, but these errors were encountered: