Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checker info - SEGM and XPOSTAG fields #18

Open
epageperron opened this issue Jan 16, 2018 · 23 comments
Open

Checker info - SEGM and XPOSTAG fields #18

epageperron opened this issue Jan 16, 2018 · 23 comments
Assignees

Comments

@epageperron
Copy link
Member

epageperron commented Jan 16, 2018

As of now, these are the rules concerning the SEGM and XPOSTAG field of the CDLI-CoNLL format. It might still change since we are still running into problems wile annotating, problems that require making decisions on the rules.

SEGM
containts the lemma which is composed of a dictionary word and its sense, appended and in square brackets. eg : udu[sheep] or dab[seize]
For all word types except verbs, there will only be suffixed morphemes, no prefixed ones.
All morphemes except the first element are composed of a dash, followed by the morpheme.
Only in the case of verbs, the first prefix will be without a dash. eg : i[-n]-dab[seize][-ø]
every morpheme can be enclosed in [], or nor.
There are also rules concerning the "slots" for morpheme but since we are not noting them we will not check for the order at this time, but we should open a backlog issue to that effect, since we want to democratize the usage of the tool, checking the possible order of morphemes would be an asset for inexperienced annotators.

XPOSTAG
This field will display the ETCSRI/ORACC POS tag OR the named entity tag instead of the lemma in the SEGM field.
For all word types except verbs, there will only be suffixed morphological tags, no prefixed ones.
All morph tags morphemes except the first element are composed of a period, followed by the morpheme.
Only verbs can have prefixes. eg : FIN.3-SG-H-A.V.3-SG-P
Tags can contain dashes, they are not meaningful in this context since the checker should use a map of morphemes and morph tags.

@jayanthkmr jayanthkmr self-assigned this Jan 18, 2018
@epageperron
Copy link
Member Author

@jayanthkmr
Copy link
Collaborator

XPOSTAG:
if it is V, there can be prefixes and if not V, then no prefixes.
POS TAGS - pos tag.form regex (form may have dashes)
Check if pos tag is in the inventory.
SEGM field has dashes and XPOSTAGS has periods.

https://github.com/cdli-gh/CDLI-CoNLL-to-CoNLLU-Converter/blob/master/cdliconll2conllu/mapping.json#L25

https://docs.google.com/spreadsheets/d/1Is7MGG0h8h0vfHj9C9mnWOD2utPeuvm1ZeYb1dsaejg/edit#gid=0

only ORACC

SEGM
Main lemma followed by [sense in square brackets]

@jayanthkmr
Copy link
Collaborator

@jayanthkmr
Copy link
Collaborator

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Can you please allow underscores in the "sense" so in the square brackets following a lemma

Error: The segm bisagdubak[filing_basket] in line number 3 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P340861.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".


Use underscore in SEGM sense.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

same for the tilde :
Error: The segm bitum[~administration] in line number 5 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P340861.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".


[~*] - tilda start is valid (not anywhere in between)

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

morphemes and tags attached to nouns can also be attached to verbs :
Error: The xpostag MID.V.3-SG-S.TERM in line number 11 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P202745.conll has a suffix TERM not in verb postag suffix list ['1-SG-A', '1-SG-S', '1-SG-P', '2-SG-A', '2-SG-S', '2-SG-P', '3-SG-S', '3-SG-P', '3-SG-A', '3-SG-S-OB', '1-PL-A', '1-PL-S', '1-PL', '2-PL-A', '2-PL-S', '2-PL', '3-PL-S', '3-PL-P', '3-PL', '3-PL-A', 'PF', 'SUB', 'PLEN'].


Noun tags can also be verb suffixes. So, corresponding morphemes are also part of verb form.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Error: The xpostag FIN.3-SG-H-A.V.3-SG-P in line number 22 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102311.conll has a prefix 3-SG-H-A not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

This is correct, a verb can have more than one prefix :

Check in https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit#gid=0

Verbs can display prefixes from these categories:

  • Modal prefixes
  • Conjugation prefixes
  • Dimentional infixes (one pronoun, one or more cases)
  • Pronominal element before base

Add Pronominal element before base into prefix list.
75 - 81 Dimensional infixes pronoun can only have one but 84-114 can be multiple times.
There is order from modal to elemental.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

in text :
o.2.4 gu4-e-us2-sa gueusa[ox-following][-a] V.SUB _ _ _

Error: The segm V.SUB in line number 9 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102311.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".

so V.SUB is the XPOSTAG column no?

Is the dash inside the sense maybe the problem but error gives other information? If it's the case please permit - inside sense


Permit dash in the middle of the segm.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Error: The xpostag _ in line number 9 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102311.conll does not have a base postag out of "['AJ', 'AV', 'NU', 'CNJ', 'DET', 'J', 'N', 'PRP', 'DN', 'EN', 'GN', 'MN', 'PN', 'RN', 'SN', 'TN', 'WN', 'AN', 'CN', 'FN', 'LN', 'ON', 'QN', 'YN', 'V', 'V-PL', 'V-PF', 'V-RDP']".

Maybe give a warning, not error, saying that segm or xpostag hasn't been filled if you find an underscore?


Tab missing. Error in file.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Error: The segm us[follow]-'a in line number 15 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P102315.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".

the normalized morpheme 'a exists, just that sheets doesn't like apostrophes


Allow apostrophes.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Error: The segm Geme'enlila[1] in line number 126 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P142753.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".

apostrophes in lemmata are OK


Allow apostrophes in lemmata.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Error: The xpostag NU.3-SG-POSS.GEN in line number 31 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P480072.conll has a suffix 3-SG-POSS not in noun postag suffix list ['L1', 'L2-NH', 'GEN', '3-PL-POSS', '3-SG-H-POSS', '3-SG-NH-POSS', 'DEM2', 'COM', 'PL', 'ERG', 'DAT-NH', 'L3-NH', 'DEM', 'PL', 'ADV', 'EQU', '1-SG-POSS', '1-PL-POSS', 'L4', 'ABS', 'DAT-H', 'L2-H', 'L3-H', 'TERM', 'ABL', '2-SG-POSS', '2-PL-POSS', 'COP-1-SG', 'COP-2-SG', 'COP-3-SG', 'COP-1-PL', 'COP-2-PL', 'COP-3-PL', 'EXCEPT'].

you can add NU (numerals) in the list of POS that can see suffixes appended


Error in pos tag.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

Forward slash is acceptable in lemma

Error: The segm 1/3(disz)[one] in line number 7 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P480072.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".


Allow forward slash in lemmata.

@jayanthkmr
Copy link
Collaborator

Thanks @epageperron for your extensive testing. I will start adding the fixes one by one.

@epageperron
Copy link
Member Author

epageperron commented May 5, 2018

@ is also fine in lemma ( my browser crashed so I don't have the example anymore but something like disz@t


Allow @ in lemmata

@jayanthkmr
Copy link
Collaborator

jayanthkmr commented Jun 1, 2018

From Jinyan:

  1. For the errors of the following list, verb postag consists of prefix, infix and suffix, and infix includes dimensional infixes (pronouns and cases), like 3-PL (I corrected 3-PL-H in the following texts) and pronominal elements, like 3-SG-H-A. Please refer to:

https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit#gid=0

Error: The xpostag MID.DAT.V.SUB in line number 210 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P106438.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag MID.DAT.V.SUB in line number 253 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P106438.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 26 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P453803.conll has a prefix 3-PL-H not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 26 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P453803.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 26 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P453803.conll has a prefix 3-SG-H-A not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 29 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a prefix 3-PL-H not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 29 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag VEN.3-PL-H.DAT.3-SG-H-A.V.3-SG-P in line number 29 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a prefix 3-SG-H-A not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

Error: The xpostag MID.DAT.V.3-SG-S.SUB in line number 13 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P458899.conll has a prefix DAT not in verb postag prefix list ['NEG', 'MOD1', 'ANT', 'MOD2', 'MOD4', 'MOD6', 'MOD7', 'MOD3', 'MOD5', 'COOR', 'VEN', 'VEN', 'MID', 'FIN', 'FIN-LI', 'FIN-L2', 'FIN', 'FIN-L2', 'FIN'].

  1. For the error of RDP, RDP is included in the stem of the verb. Please refer to:

https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit#gid=0

Error: The xpostag V.RDP.SUB.GENABL in line number 16 in file /alty/data/mtaac_gold_corpus/morph/to_dict/P459210.conll has a suffix RDP not in verb postag suffix list ['1-SG-A', '1-SG-S', '1-SG-P', '2-SG-A', '2-SG-S', '2-SG-P', '3-SG-S', '3-SG-P', '3-SG-A', '3-SG-S-OB', '1-PL-A', '1-PL-S', '1-PL', '2-PL-A', '2-PL-S', '2-PL', '3-PL-S', '3-PL-P', '3-PL', '3-PL-A', 'PF', 'SUB', 'PLEN'].


Handle RDP as part of speech.

@jayanthkmr
Copy link
Collaborator

Files to check:
P340861.conll
P202745.conll
P102311.conll
P102315.conll
P142753.conll
P480072.conll

@epageperron
Copy link
Member Author

What should I be looking for in those files exactly?

@jayanthkmr
Copy link
Collaborator

That note was for me :)

@wangj619
Copy link

Error: The segm 1/2(disz)[one] in line number 6 in file /Users/user/annotation_assistant/scr/data/mtaac_gold_corpus/morph/processed/P339103.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".

This error appears again? Has forward slash been acceptable in lemma?

@wangj619
Copy link

wangj619 commented Aug 24, 2018

Error: The segm igi4(disz)[one]gal[fraction] in line number 23 in file /Users/user/annotation_assistant/scr/data/mtaac_gold_corpus/morph/processed/P339103.conll does not follow the format "^(([a-z0-9]+)((-[a-z0-9]+)|([-[a-z0-9]+]))-)?[A-Za-z0-9()]+[[a-z0-9]]((-[a-z0-9]+)|([-[a-z0-9]+])|([-ø]))*$".

This is correct. This lemma is actually a compound lemma, can it work for the computer?

@wangj619 It passes for "igi4(disz)[one]-gal[-fraction] "
@jayanthjaiswal Actually, [fraction] is the meaning of the whole lemma, not the last part 'gal'. And the number 4(disz)[one] is embedded into it.

@jayanthkmr
Copy link
Collaborator

jayanthkmr commented Sep 5, 2018

All +1 are now fixed @epageperron @khoidt @wangj619 The segm ones are fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants