forked from UniversalDependencies/UD_Portuguese-Bosque
-
Notifications
You must be signed in to change notification settings - Fork 0
/
HISTORY.txt
140 lines (110 loc) · 5.81 KB
/
HISTORY.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
This is an improved version of Bosque 7.5 UD, originally available
from Linguateca at the following URLs:
http://www.linguateca.pt/Floresta/ficheiros/bosque_CP.udep.conll.gz
http://www.linguateca.pt/Floresta/ficheiros/bosque_CF.udep.conll.gz
Issues found in the original Bosque 7.5 UD have been opened as issues
in this project, with the prefix [bosque-ud]; they have been either
fixed at the source (PALAVRAS and/or UD conversion), workarounds have
been provided by the script "fix-errors.sh".
Manually editing the file is discouraged and should only be made as a
last resort.
For reference:
1. bosque_CP.udep.conll.gz Bick's version of European Portuguese part
of Bosque 7.5 UD annotated, available at
http://www.linguateca.pt/Floresta/ficheiros/bosque_CP.udep.conll.gz
2. bosque_CF.udep.conll.gz ick's version of Brazilian Portuguese part
of Bosque 7.5 UD annotated, available at
http://www.linguateca.pt/Floresta/ficheiros/bosque_CF.udep.conll.gz
3. Dan Zeman's version of Bosque CoNLL (7.3), available at
https://github.com/UniversalDependencies/UD_Portuguese
4. Linguateca Version of Bosque CoNLL (7.3),
http://www.linguateca.pt/floresta/CoNLL-X/
The CG-converted UD Portuguese treebank is originally based on an
improved and enriched version of the 7.4 dependency version of the
revised Bosque part of the Floresta Sintá(c)tica treebank
(cf. Linguateca.pt). 7.4 was in 2006-2008 aligned with a new live run
with the PALAVRAS parser in order to propagate morphological features
from unambiguous to ambiguous words, and to add what the Floresta team
called "searchables", i.e. tags for features distributed across
several tokens, such as NP definiteness and complex tenses. The public
treebank only used this for the constituent version, which was the one
actively revised by the Floresta team until 2008 (Linguateca.pt
version 8.0).
Since 2008 Eckhard Bick has maintained an experimental version of the
dependency Bosque for semantic and other research, and made further
revisions to it, which were not aligned with either the constituent
version or the published 7.4 dependency version. In the beginning of
2016, Eckhard Bick wrote UD conversion rules for Constraint Grammar
input, and applied these to the updated version of the dependency
Bosque (Linguateca.pt version 7.5 of March 2016).
In a team effort in October 2016, Alexandre Rademaker, Cláudia
Freitas, Fabricio Chalub, Valeria de Paiva and Livy Maria Real Coelho,
aiming at full compatibility with ConLL UD specifications,
consistency-checked and discussed the 7.5 UD Bosque, leading to a
further round of manual treebank corrections and conversion rule
changes by Eckhard Bick. The conversion grammar ultimately used
contained some 530 rules. Of these 70 were simple feature mapping
rules, and 130 were local MWE splitting rules, assigning internal
structure, POS and features to MWE's from Bosque. The remainder of the
rules handle UD-specific dependency and function label changes in a
context-dependent fashion, the main issues being raising of copula
dependents to subject complements, inversion of prepositional
dependency and a change from syntactic to semantic verb chain
dependency.
The new UD treebank retains the additional tags for NP definiteness
and complex tenses, as well as the original syntactic functions tags
and secondary morphological tags. This way, the treebank retains its
original linguistic focus in addition to the machine learning uses
targeted by the ConLL UD format. For instance, conjuncts and roots
still feature a direct function tag (e.g. a verb complement role for a
conjunct or "question" for a root. In cases, where UD does not
distinguish between form and function, e.g. n/np adverbial modifiers,
where UD "duplicates" noun POS as 'nmod' function, the Bosque function
tag for free adverbial, adject or adverbial object is retained in
field 4 (@tags). Finally, some lost valency relations may be recovered
from an underspecified UD tag, e.g. the core clause arguments
"prepositional object" ('gostar de ARG') and valency-bound adverbial
('morar em ARG').
CONTRIBUTORS
The conversion was implemented by Eckhard Bick and revised by:
- Claudia Freitas
- Eckhard Bick
- Fabricio Chalub
- Alexandre Rademaker
- Livy Real
- Valeria Paiva
CHANGELOG
2016-10-31 v1.4
* Initial UD release.
LICENSE
See file LICENSE.txt
REFERENCES
- https://github.com/own-pt/bosque-UD (development of this corpus)
- http://www.linguateca.pt/Floresta/ (Floresta Treebank repository)
- http://visl.sdu.dk/tagset_cg_general.pdf (non-UD tags in field 4)
- http://visl.sdu.dk/constraint_grammar.html (cg3 compiler used for
the conversion grammar)
- http://visl.sdu.dk/visl/pt/parsing/automatic/ (PALAVRAS parser used
to create input trees for the manually revised Bosque treebank)
- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos (2002),
Floresta sintá(c)tica: a treebank for Portuguese
<http://visl.sdu.dk/%7Eeckhard/pdf/AfonsoetalLREC2002.ps.pdf>, In
/Proceedings of LREC'2002, Las Palmas/. pp. 1698-1703, Paris: ELRA
- Freitas, Cláudia & Rocha, Paulo & Bick, Eckhard (2008), "Floresta
Sintá(c)tica: Bigger, Thicker and Easier", in: António Teixeira et
al. (eds.) /Computational Processing of the Portuguese Language/
(Proceedings of PROPOR 2008, Aveiro, Sept. 8th-10th, 2008),
pp.216-219. Springer
- Bick, Eckhard (2014). PALAVRAS, a Constraint Grammar-based Parsing
System for Portuguese. In: Tony Berber Sardinha & Thelma de Lurdes
São Bento Ferreira (eds.), /Working with Portuguese Corpora/, pp
279-302. London/New York:Bloomsburry Academic. ISBN
978-1-4411-9050-5
--- Machine readable metadata ---
Documentation status: partial
Data source: semi-automatic
Data available since: UD v1.4
License: CC BY-SA 4.0
Genre: news blog
Contributors: Freitas, Claudia; Bick, Eckhard; Chalub, Fabricio; Rademaker, Alexandre; Real, Livy; Paiva, Valeria
Contact: arademaker@gmail.com