Incorporation of pre-trained word embeddings functionality #171

amatsuo · 2019-05-27T11:41:01Z

spaCy now has this:
https://spacy.io/usage/vectors-similarity

Maybe we want to make this functionality available in spacyr.

Any feedback/suggestions from users are welcome.

The text was updated successfully, but these errors were encountered:

amatsuo · 2019-06-30T06:12:00Z

I made some tries for this option. Install branch issue-171 and try the following:

library(spacyr)
# spacy_download_langmodel("en_core_web_md")
spacy_initialize("en_core_web_md") # or spacy_initialize("en_core_web_ld") 
txt <- "To make them compact and fast, spaCy’s small models (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. "
out <- spacy_parse(txt, embedding = TRUE)
attr(out, "embedding")

malickpaye · 2019-06-30T09:06:35Z

Hi,
Just tested this feature, works fine with my configuration.
Think that it definitely will be useful for users that need to leverage state of the art NLP approaches while sticking to their favorite [R] langage.
Malick.

kbenoit · 2019-07-04T10:49:59Z

Just experimented with this. A few comments on the branch.

Since this is looking up the tokens from the language model using Token.vector(), we don't really need to do this at the parsing stage. Instead, we could create a set of functions such as wordvectors_lookup() that look up the word vectors for a spacy_parsed object, but store them with one vector per type.

wordvectors_apply(x, wordvectors) would could apply those created by wordvectors_lookup() to a spacy_parsed object x, or to a quanteda::tokens() object. This means that we could be making this lookup function available to any package that has tokens or words.

We could create similar functions to weight or replace tokens with their L2 normed vector scores, similar to Token.vector_norm().

cainesap · 2019-10-01T14:29:44Z

+1 for this. Would be great to access embeddings and then, for example, the similarity() function described on that page you first linked to (https://spacy.io/usage/vectors-similarity)

amatsuo · 2019-10-24T11:27:51Z

@kbenoit and all

I've implemented the very first version of twi function (spacy_wordvectors_lookup and spacy_wordvectors_apply. Please test and give some feedback.

The following is one of the expected use cases: Calculating the similarity of short texts

## devtools::install_github("quanteda/spacyr", ref = "issue-171")


library(quanteda)
library(tidyverse)
library(spacyr)
library(DBI)

spacy_initialize(model = "en_core_web_md")

# data from here
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
db <- dbConnect(RSQLite::SQLite(), "~/Downloads/database.sqlite")

set.seed(20191024)
corpus_tw <- tbl(db, "Tweets") %>% as_tibble() %>% sample_n(1000) %>% 
    distinct(text, .keep_all = TRUE) %>%
    corpus(docid_field = "tweet_id")

twitter_parsed <- spacy_parse(corpus_tw, additional_attributes = "is_stop") 

wordvectors <- spacy_wordvectors_lookup(twitter_parsed)

wordvec_matrix <- spacy_wordvectors_apply(twitter_parsed, wordvectors)

# convert the matrix to tibble for the further manipulation
wordvec_tb <- wordvec_matrix %>% 
    as_tibble(.name_repair = "universal") %>%
    rename_all(str_replace, "\\D+", "D") %>% 
    bind_cols(twitter_parsed)

# calculate the average of the wordvector in the text
doc_vec_avg <- wordvec_tb %>% 
    filter(!is_stop) %>%
    group_by(doc_id) %>%
    summarise_at(1:300, mean) %>% ungroup()

# convert it to dfm for similarity calculation (since the matrix is dense, other package might work faster for similarity calculation)
temp <- doc_vec_avg %>%
    select(-1) %>% 
    as.matrix() %>% as.dfm

rownames(temp) <- paste(doc_vec_avg$doc_id)

simil_stat <- textstat_simil(temp, method = "cosine") %>% as.data.frame() %>% 
    sample_n(1000) %>%
    arrange(-cosine) %>% 
    mutate_at(1:2, as.character)

# print the output
for(i in seq(10)){
    cat(paste0("similarity: ", simil_stat$cosine[i], "\n",
               "doc1: ", corpus_tw[simil_stat$document1[i]], "\n",
               "doc2: ", corpus_tw[simil_stat$document2[i]], "\n\n"))
}

cainesap · 2019-10-25T20:35:50Z

@amatsuo tested and working fine!
Thank you for implementing and updating us :)

kbenoit · 2020-02-29T01:33:06Z

How about

# works on a spacyr parsed object
wordvectors_get.spacyr_parsed(x, model)

# works on a named list of characters, such as from spacy_tokenize()
wordvectors_get.list(x, model)

to return a v x d matrix, where v is the number of types (unique tokens) and d is the number of dimensions. This is a dense matrix.

# attaches special attribute of wordvectors to the object
wordvectors_put.spacyr_parsed(x, wordvectors)

We don't do this for a list, since we can do that instead in quanteda::as.tokens().

The important thing here is

when we get the word vectors from a language model, it's not ntoken x d, but rather ntype x d, so more efficient (and can be linked later by using the token label as a key); and
we can "put" any word vectors from any source, not just those taken from a spaCy language model. So the infrastructure is more general than spaCy, to allow us to take pre-trained word vectors from other sources, such as fasttext, BERT, Elmo, etc.

gary-mu · 2023-04-21T00:27:45Z

is the pre-trained embedding functionality added ?

kbenoit · 2023-04-29T16:49:41Z

No not yet, but we are working on adding some of the predictive functions.

amatsuo self-assigned this May 27, 2019

amatsuo mentioned this issue Oct 22, 2019

Calling full model name by default #177

Closed

kbenoit added the wishlist label Sep 1, 2022

kbenoit unassigned amatsuo Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporation of pre-trained word embeddings functionality #171

Incorporation of pre-trained word embeddings functionality #171

amatsuo commented May 27, 2019

amatsuo commented Jun 30, 2019

malickpaye commented Jun 30, 2019

kbenoit commented Jul 4, 2019

cainesap commented Oct 1, 2019

amatsuo commented Oct 24, 2019

cainesap commented Oct 25, 2019

kbenoit commented Feb 29, 2020 •

edited

Loading

gary-mu commented Apr 21, 2023

kbenoit commented Apr 29, 2023

Incorporation of pre-trained word embeddings functionality #171

Incorporation of pre-trained word embeddings functionality #171

Comments

amatsuo commented May 27, 2019

amatsuo commented Jun 30, 2019

malickpaye commented Jun 30, 2019

kbenoit commented Jul 4, 2019

cainesap commented Oct 1, 2019

amatsuo commented Oct 24, 2019

cainesap commented Oct 25, 2019

kbenoit commented Feb 29, 2020 • edited Loading

gary-mu commented Apr 21, 2023

kbenoit commented Apr 29, 2023

kbenoit commented Feb 29, 2020 •

edited

Loading