Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with gensim or fasttext #66

Open
petulla opened this issue Nov 26, 2019 · 1 comment
Open

Integration with gensim or fasttext #66

petulla opened this issue Nov 26, 2019 · 1 comment

Comments

@petulla
Copy link

petulla commented Nov 26, 2019

First, thank you so much for this package. It's very useful.

I know you've already more or less answered this here. I wasn't able to reproduce the example. I think it used an older version of fasttext.

import fasttext as ft
model2=ft.load_model('fasttext/wiki.en.bin')

class FastTextEmbeddings(object):
    def __getitem__(self, item):
        item = np.array(item, copy=True)
        item[item > len(fastText_wv) # for testing insert a value here] = -1
        return fastText_wv.get_input_vector(item)

There are a few issues:
--fastText_wv does not return a length. That said, we can use one for testing by looking at the output length when loading the model.
--get_input_vector cannot be called.

getInputVector(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: fasttext_pybind.Vector, arg1: int) -> None

Invoked with: <fasttext_pybind.fasttext object at 0x2ab5724b0>, <fasttext_pybind.Vector object at 0x2a29c5370>, array(18446744073709551615, dtype=uint64)

This is with fasttext 0.9.1

I'm sure you have this working with gensim or fasttext. I'm wondering if you could share your code example as you did with Numpy. I'm not sure where to start on debugging. Most of the methods in gensim require supplying the token over the word index.

@petulla
Copy link
Author

petulla commented Nov 27, 2019

Seeing a few things that were missing in that post, This should work:

import fasttext as ft
model2=ft.load_model('fasttext/wiki.en.bin')

Array was not needed:

class FastTextEmbeddings(object):
    def __getitem__(self, item):
        # this has to be set
        if item > dim: return [1e-8] * 300
        return model2.get_input_vector(item)

Need the fasttext ids:

def buildVectorsFT(row):
       #spacy is overkill here; but this tokenizes
        nlpAll = nlpSpacy(alltext)
        tokens = [t.text.lower() for t in nlpAll if t.is_alpha and not t.is_stop] #[token.text.lower() for token in nlpAll if not token.is_stop]
        
        words = Counter(t for t in tokens)
        orths = {t: model2.get_word_id(t) for t in tokens}
        
        sorted_words = sorted(words)
        documents[title] = (title, [orths[t] for t in sorted_words],
                            np.array([words[t] for t in sorted_words], dtype=np.float32))
        return

Get dim:

t = model2.get_output_matrix()
dim = len(t)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant