-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporation of pre-trained word embeddings functionality #171
Comments
I made some tries for this option. Install branch library(spacyr)
# spacy_download_langmodel("en_core_web_md")
spacy_initialize("en_core_web_md") # or spacy_initialize("en_core_web_ld")
txt <- "To make them compact and fast, spaCy’s small models (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. "
out <- spacy_parse(txt, embedding = TRUE)
attr(out, "embedding") |
Hi, |
Just experimented with this. A few comments on the branch. Since this is looking up the tokens from the language model using
We could create similar functions to weight or replace tokens with their L2 normed vector scores, similar to |
+1 for this. Would be great to access embeddings and then, for example, the similarity() function described on that page you first linked to (https://spacy.io/usage/vectors-similarity) |
@kbenoit and all I've implemented the very first version of twi function ( The following is one of the expected use cases: Calculating the similarity of short texts ## devtools::install_github("quanteda/spacyr", ref = "issue-171")
library(quanteda)
library(tidyverse)
library(spacyr)
library(DBI)
spacy_initialize(model = "en_core_web_md")
# data from here
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
db <- dbConnect(RSQLite::SQLite(), "~/Downloads/database.sqlite")
set.seed(20191024)
corpus_tw <- tbl(db, "Tweets") %>% as_tibble() %>% sample_n(1000) %>%
distinct(text, .keep_all = TRUE) %>%
corpus(docid_field = "tweet_id")
twitter_parsed <- spacy_parse(corpus_tw, additional_attributes = "is_stop")
wordvectors <- spacy_wordvectors_lookup(twitter_parsed)
wordvec_matrix <- spacy_wordvectors_apply(twitter_parsed, wordvectors)
# convert the matrix to tibble for the further manipulation
wordvec_tb <- wordvec_matrix %>%
as_tibble(.name_repair = "universal") %>%
rename_all(str_replace, "\\D+", "D") %>%
bind_cols(twitter_parsed)
# calculate the average of the wordvector in the text
doc_vec_avg <- wordvec_tb %>%
filter(!is_stop) %>%
group_by(doc_id) %>%
summarise_at(1:300, mean) %>% ungroup()
# convert it to dfm for similarity calculation (since the matrix is dense, other package might work faster for similarity calculation)
temp <- doc_vec_avg %>%
select(-1) %>%
as.matrix() %>% as.dfm
rownames(temp) <- paste(doc_vec_avg$doc_id)
simil_stat <- textstat_simil(temp, method = "cosine") %>% as.data.frame() %>%
sample_n(1000) %>%
arrange(-cosine) %>%
mutate_at(1:2, as.character)
# print the output
for(i in seq(10)){
cat(paste0("similarity: ", simil_stat$cosine[i], "\n",
"doc1: ", corpus_tw[simil_stat$document1[i]], "\n",
"doc2: ", corpus_tw[simil_stat$document2[i]], "\n\n"))
} |
@amatsuo tested and working fine! |
How about # works on a spacyr parsed object
wordvectors_get.spacyr_parsed(x, model)
# works on a named list of characters, such as from spacy_tokenize()
wordvectors_get.list(x, model) to return a v x d matrix, where v is the number of types (unique tokens) and d is the number of dimensions. This is a dense matrix. # attaches special attribute of wordvectors to the object
wordvectors_put.spacyr_parsed(x, wordvectors) We don't do this for a list, since we can do that instead in The important thing here is
|
is the pre-trained embedding functionality added ? |
No not yet, but we are working on adding some of the predictive functions. |
spaCy now has this:
https://spacy.io/usage/vectors-similarity
Maybe we want to make this functionality available in
spacyr
.Any feedback/suggestions from users are welcome.
The text was updated successfully, but these errors were encountered: