Digits wrongly processed by divvun-suggest output awk script #25

snomos · 2022-11-24T12:16:34Z

The script earlier found in lang-smj/tools/tts/convert-helper.awk has been moved to giella-core/scripts/convert-divvunsuggest-to-almostplain.awk, I believe it can be useful for more languages.

There are still some issues:

digits in the thousands
some compounds (followed by :?)
CLB tags in the output (final :?)
Actual commas disappear
Some long compounds get duplicated

Sample input to test the errors

Digits in the thousands (ie including space):

– Nordlánda fylkkamánne le ájnas aktisasjbarggoguojmme gå galggap tjadádit dåjmajt sámegiela doajmmaplánan oarjjel- ja julevsámegielaj gáktuj. Dan diehti doarjju ráddidus fylkkamánne gielladåjmajt jagen 2010 1,7 millijåvnåjn kråvnåjn, javllá ådåstuhttem-, háldadus- ja girkkoministar Rigmor Aasrud. Duodden juolloduvvá 150 000 kråvnå prosjæktaj ”YouTube på lulesamisk”.

CLB tag in the output:

Sij gudi libjjáv oadtju li:

Compound:

Gållelibjes: Oddvar Hansen, Otto Kristian Løvik, Lill Hege Nilsen, Jørn Øverby ja Erik Martinsen Øvergaard.

Disappearing commas:

Silbbalibjes: Lise Berit Aronsen, Thomas Cordtsen, Kjersti Buer Dolve, Sjur Harald Dolve, Marcel Gleffe, Bjørn Kasper Ilaug, Allan Søndergaard Jensen, Bjørn Juvet, Terje Bergan Lien, Christer Lillestrøm, Sissel Martinsen, Ole Johan Simonsen, Julien Lucien Bernard Sué ja Helge Gjerløw Wettre.

Duplicated compounds:

Barggovuorddemrudá binneduvvi 3 165 millijåvnåjn kråvnåjn

ends up as:

BargovuorddemrudáBarggovuorddemrudá binneduvvi 3 165 millijåvnåjn kråvnåjn

The text was updated successfully, but these errors were encountered:

flammie · 2022-11-24T14:02:07Z

ah yeah these are all interesting ambiguities, a colon at the beginning of the line is used both by whitespace stuff in our cg format and suggestions now, the comma in second field is used to separate suggestions (though probably not empty strings) and spaces in lemma make field counting a bit ambiguous.

so, the current work arounds are a bit hacky.

cf issue #25

snomos · 2022-11-24T14:58:59Z

The problem with compounds seems to be when one analysis renders a ?, while the other renders the expected output:

:\n
"<Gållelibjes>"
	"libjes" N Sem/Dummytag Sg Nom
		"gålle" N Sem/Mat Cmp/SgGen Cmp
gålle+N+Cmp/SgGen+Cmp#libjes+N+Sg+Nom	?
	"libjes" N Sem/Dummytag Sg Nom
		"gålle" N Sem/Mat Cmp/SgNom Cmp
gålle+N+Cmp/SgNom+Cmp#libjes+N+Sg+Nom	Gållelibjes
"<:>"
	":" CLB
:+CLB	:
:

This leads to the following final text:

gålle+N+Cmp/SgGen+Cmp#libjes+N+Sg+NomGållelibjes: Oddvar Hansen, Otto Kristian Løvik, Lill Hege Nilsen, Jørn Øverby ja Erik Martinsen Øvergaard.

Instead of the expected:

Gållelibjes: Oddvar Hansen, Otto Kristian Løvik, Lill Hege Nilsen, Jørn Øverby ja Erik Martinsen Øvergaard.

snomos · 2022-11-24T15:57:11Z

Here's an extreme case of the duplicate compound bug:

Bjørn Olav Megard le låhkåm cand.polit. sosialantropologijja oajvvefágajn Oslo universitehtas.

is turned into:

Bjørn Olav Megard le låhkåm cand.polit.cand.polit socialantropologidjasocialantropologiddjasocialantropologidjasocialantropologiddja oajvvefágajn Oslo universitehtas.

😀

It seems that in this case, none of the generated forms are identical to the input form (because the input is a NO form, and all output forms are SE forms), and the output forms differ among them. In such a case, we just go for the first one, we don't have the time to do anything more advanced.

flammie · 2022-11-24T19:31:46Z

Might need more advanced parser or fix on divvun-suggest side for some of these cases where there's multiple different results, before the awk becomes too unwieldy... For quick patch maybe vislcg's -1 option for picking 1-random to tie break is good enough?

snomos · 2022-11-24T19:38:55Z

The vislcg solution will only fix things on the analysis side. The problem is that the generation of the new word forms is ambiguous. And I don't know how to fix this quickly. Adding an option to divvun-suggest to unique the output would help, but is not enough. Also a flag on stderr to warn about unresolved ambiguity would help, there are not that many.

Any suggestion welcome 🙂

snomos · 2022-11-24T20:21:32Z

Actually, it seems to be only two cases left:

cand.polit., where the output matches the input, but is probably confused by the final full stop; an adjustment of the existing code should be enough
sociala..., where none of the outputs match the input; as a last resort just go with what is at hand / is easiest

Since I don't know awk, I have no idea how hard this would be.

flammie · 2022-11-24T22:06:55Z

I actually couldn't get cg-proc -1 to work anyways 😊

it's maybe less spaghetti than it might have been but the code tracks now kind of one output per cohort, which is not the best, but actually to control it cg rules might be the way to go.

For cand.polit. the variants are I think with and without a full stop and maybe one with :v at the end if I read correctly?

snomos · 2022-11-25T06:30:27Z

With the changes in 63adeef it is almost perfect. What is still left are some cases where the awk script does not choose the word form that is identical to the input form even when there is such a word form. Example input:

Jus muhtema mielas nágin le vajáldahtedum, de dåhkki ájn ienep ulmutjijt libjjáj oajvvadit.

Earlier output (with multiple forms concatenated):

Jus muhtemuhtema mielas nágin le vajáldahtedum, de dåhkki ájn ienep ulmutjijt libjjáj oajvvadit.

New output after the latest fix:

Jus muhte mielas nágin le vajáldahtedum, de dåhkki ájn ienep ulmutjijt libjjáj oajvvadit.

Expected: since the input form is muhtema, and that word form is among the generated word forms, I had expected that one to be selected.

ilm024 · 2022-11-25T07:20:21Z

Another error:

Äládus- ja oasesdepartementaoasesdepartemænnta

The æ in "Æládus" has become ä, but not in last part of compound, "departemænnta"

ilm024 · 2022-11-25T13:52:40Z

Gehtjav makkár {åvddånbuktemvuohke} l gåvån = Gehtjav makkár {} l gåvån

åvddånbuktemvuohke e err/cmp.

ilm024 · 2022-11-25T13:56:32Z

Anne Silja l aj åvdep giese journalisstan barggam {NRK} Sámeradio åvdås= Anne Silja l aj åvdep giese journalisstan barggam {NRK:A} Sámeradio åvdås

ilm024 · 2022-11-25T14:06:34Z

Dán jagásj Bårjjåsin li guokta {vuostasj} artihkkala = Dán jagásj Bårjjåsin li guokta {vuostas} artihkkala

ilm024 · 2022-11-25T14:22:58Z

Ja {gájkka} galggá sámegiellaj dáhpáduvvat = Ja {gájkav} galggá sámegiellaj dáhpáduvvat

snomos · 2022-11-28T10:04:54Z

In retrospect, the processing should have been different:

grab original word form by default
grab (first) generated word form only when cohort contains the target tag

Ie keep everything as original unless there is reason to do otherwise.

We might still want to do this for future work for other languages and contexts.

flammie · 2022-11-28T10:26:53Z

mm, a long term solution should definitely go in a one of the applications that can process whole cohorts and sentences with all information intact. But this prototype will be very useful for design we should collect and categorise the problems to keep in mind.

snomos added the bug Something isn't working label Nov 24, 2022

snomos assigned flammie Nov 24, 2022

flammie added a commit that referenced this issue Nov 24, 2022

work around some ambiguous corner cases

95b7fa1

cf issue #25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digits wrongly processed by divvun-suggest output awk script #25

Digits wrongly processed by divvun-suggest output awk script #25

snomos commented Nov 24, 2022 •

edited

Loading

flammie commented Nov 24, 2022

snomos commented Nov 24, 2022

snomos commented Nov 24, 2022 •

edited

Loading

flammie commented Nov 24, 2022

snomos commented Nov 24, 2022

snomos commented Nov 24, 2022

flammie commented Nov 24, 2022

snomos commented Nov 25, 2022

ilm024 commented Nov 25, 2022

ilm024 commented Nov 25, 2022 •

edited

Loading

ilm024 commented Nov 25, 2022 •

edited

Loading

ilm024 commented Nov 25, 2022

ilm024 commented Nov 25, 2022

snomos commented Nov 28, 2022

flammie commented Nov 28, 2022

Digits wrongly processed by divvun-suggest output awk script #25

Digits wrongly processed by divvun-suggest output awk script #25

Comments

snomos commented Nov 24, 2022 • edited Loading

Sample input to test the errors

flammie commented Nov 24, 2022

snomos commented Nov 24, 2022

snomos commented Nov 24, 2022 • edited Loading

flammie commented Nov 24, 2022

snomos commented Nov 24, 2022

snomos commented Nov 24, 2022

flammie commented Nov 24, 2022

snomos commented Nov 25, 2022

ilm024 commented Nov 25, 2022

ilm024 commented Nov 25, 2022 • edited Loading

ilm024 commented Nov 25, 2022 • edited Loading

ilm024 commented Nov 25, 2022

ilm024 commented Nov 25, 2022

snomos commented Nov 28, 2022

flammie commented Nov 28, 2022

snomos commented Nov 24, 2022 •

edited

Loading

snomos commented Nov 24, 2022 •

edited

Loading

ilm024 commented Nov 25, 2022 •

edited

Loading

ilm024 commented Nov 25, 2022 •

edited

Loading