Tokenizer eats hyphen between words #81

victorbocharov · 2018-11-18T15:25:50Z

Line "June-July 2000" is tokenized into "June", "July", "2000". Hyphen disappeared.
Probably not a bug. But the problem happens in case you need to distinguish "June July" from "June-July" on rule level.

kleag · 2018-12-10T15:33:39Z

If I remember well, the tokenizer itself should keep the hyphen untouched and produce one unique token for "June-July". It is then the role of the hyphen alternatives module to either keep the original token with hyphen if it is known from the dictionary or to split the token in the two tokens (here June and July) othewise.
But, if it is splitted, I think it currently just removes the hyphen. In fact, I think that a third token should be created for the hyphen.
@romaricb what do you think about it ?

victorbocharov added the question label Nov 18, 2018

kleag self-assigned this Dec 10, 2018

kleag assigned romaricb Dec 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer eats hyphen between words #81

Tokenizer eats hyphen between words #81

victorbocharov commented Nov 18, 2018

kleag commented Dec 10, 2018

Tokenizer eats hyphen between words #81

Tokenizer eats hyphen between words #81

Comments

victorbocharov commented Nov 18, 2018

kleag commented Dec 10, 2018