You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The spaCy tokenizer splits hyphenated words by inserting a space before and after the hyphen. For example, "eye-opening" becomes "eye - opening". Is there a way to keep hyphenated words together, like with the quanteda tokenizers? (@JBGruber : Any idea? :))
The text was updated successfully, but these errors were encountered:
Changing the behaviour requires changing the infix patterns, which would be non trivial from R. But we can use a quick workaround:
library("spacyr")
txt<-"The spaCy tokenizer splits hyphenated words, like eye-opening, by inserting a space before and after the hyphen."# replace hypens with a symbol that is not part of the infix patternstxt2<- gsub("-", "§", txt, fixed=TRUE)
spacy_parse(txt2)$lemma#> [1] "the" "spacy" "tokenizer" "split" "hyphenated" #> [6] "word" "," "like" "eye§opening" "," #> [11] "by" "insert" "a" "space" "before" #> [16] "and" "after" "the" "hyphen" "."
I use the obscure section sign (§) to replace hyphens, but you can use anything else not in the infix pattern list. After parsing, you can then just change it back:
The spaCy tokenizer splits hyphenated words by inserting a space before and after the hyphen. For example, "eye-opening" becomes "eye - opening". Is there a way to keep hyphenated words together, like with the quanteda tokenizers? (@JBGruber : Any idea? :))
The text was updated successfully, but these errors were encountered: