Hyphenated words #250

seb-29 · 2024-05-15T20:19:48Z

The spaCy tokenizer splits hyphenated words by inserting a space before and after the hyphen. For example, "eye-opening" becomes "eye - opening". Is there a way to keep hyphenated words together, like with the quanteda tokenizers? (@JBGruber : Any idea? :))

JBGruber · 2024-05-17T08:25:51Z

Changing the behaviour requires changing the infix patterns, which would be non trivial from R. But we can use a quick workaround:

library("spacyr")
txt <- "The spaCy tokenizer splits hyphenated words, like eye-opening, by inserting a space before and after the hyphen."
# replace hypens with a symbol that is not part of the infix patterns
txt2 <- gsub("-", "§", txt, fixed = TRUE)
spacy_parse(txt2)$lemma
#>  [1] "the"         "spacy"       "tokenizer"   "split"       "hyphenated" 
#>  [6] "word"        ","           "like"        "eye§opening" ","          
#> [11] "by"          "insert"      "a"           "space"       "before"     
#> [16] "and"         "after"       "the"         "hyphen"      "."

I use the obscure section sign (§) to replace hyphens, but you can use anything else not in the infix pattern list. After parsing, you can then just change it back:

gsub("§", "-", spacy_parse(txt2)$lemma, fixed = TRUE)
#>  [1] "the"         "spacy"       "tokenizer"   "split"       "hyphenated" 
#>  [6] "word"        ","           "like"        "eye-opening" ","          
#> [11] "by"          "insert"      "a"           "space"       "before"     
#> [16] "and"         "after"       "the"         "hyphen"      "."

^{Created on 2024-05-17 with reprex v2.1.0}

Just make sure you don't happen to have any § (or other replacement symbol) in your original text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyphenated words #250

Hyphenated words #250

seb-29 commented May 15, 2024

JBGruber commented May 17, 2024

Hyphenated words #250

Hyphenated words #250

Comments

seb-29 commented May 15, 2024

JBGruber commented May 17, 2024