-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Digits wrongly processed by divvun-suggest output awk script #25
Comments
ah yeah these are all interesting ambiguities, a colon at the beginning of the line is used both by whitespace stuff in our cg format and suggestions now, the comma in second field is used to separate suggestions (though probably not empty strings) and spaces in lemma make field counting a bit ambiguous. so, the current work arounds are a bit hacky. |
The problem with compounds seems to be when one analysis renders a
This leads to the following final text:
Instead of the expected:
|
Here's an extreme case of the duplicate compound bug:
is turned into:
😀 It seems that in this case, none of the generated forms are identical to the input form (because the input is a NO form, and all output forms are SE forms), and the output forms differ among them. In such a case, we just go for the first one, we don't have the time to do anything more advanced. |
Might need more advanced parser or fix on divvun-suggest side for some of these cases where there's multiple different results, before the awk becomes too unwieldy... For quick patch maybe vislcg's |
The Any suggestion welcome 🙂 |
Actually, it seems to be only two cases left:
Since I don't know awk, I have no idea how hard this would be. |
I actually couldn't get it's maybe less spaghetti than it might have been but the code tracks now kind of one output per cohort, which is not the best, but actually to control it cg rules might be the way to go. For cand.polit. the variants are I think with and without a full stop and maybe one with :v at the end if I read correctly? |
With the changes in 63adeef it is almost perfect. What is still left are some cases where the awk script does not choose the word form that is identical to the input form even when there is such a word form. Example input:
Earlier output (with multiple forms concatenated):
New output after the latest fix:
Expected: since the input form is |
Another error:
The æ in "Æládus" has become ä, but not in last part of compound, "departemænnta" |
Gehtjav makkár {åvddånbuktemvuohke} l gåvån = Gehtjav makkár {} l gåvån åvddånbuktemvuohke e err/cmp. |
Anne Silja l aj åvdep giese journalisstan barggam {NRK} Sámeradio åvdås= Anne Silja l aj åvdep giese journalisstan barggam {NRK:A} Sámeradio åvdås |
Dán jagásj Bårjjåsin li guokta {vuostasj} artihkkala = Dán jagásj Bårjjåsin li guokta {vuostas} artihkkala |
Ja {gájkka} galggá sámegiellaj dáhpáduvvat = Ja {gájkav} galggá sámegiellaj dáhpáduvvat |
In retrospect, the processing should have been different:
Ie keep everything as original unless there is reason to do otherwise. We might still want to do this for future work for other languages and contexts. |
mm, a long term solution should definitely go in a one of the applications that can process whole cohorts and sentences with all information intact. But this prototype will be very useful for design we should collect and categorise the problems to keep in mind. |
The script earlier found in
lang-smj/tools/tts/convert-helper.awk
has been moved togiella-core/scripts/convert-divvunsuggest-to-almostplain.awk
, I believe it can be useful for more languages.There are still some issues:
:
?):
?)Sample input to test the errors
Digits in the thousands (ie including space):
CLB tag in the output:
Compound:
Disappearing commas:
Duplicated compounds:
ends up as:
The text was updated successfully, but these errors were encountered: