Integration with gensim or fasttext #66

petulla · 2019-11-26T22:56:20Z

First, thank you so much for this package. It's very useful.

I know you've already more or less answered this here. I wasn't able to reproduce the example. I think it used an older version of fasttext.

import fasttext as ft
model2=ft.load_model('fasttext/wiki.en.bin')

class FastTextEmbeddings(object):
    def __getitem__(self, item):
        item = np.array(item, copy=True)
        item[item > len(fastText_wv) # for testing insert a value here] = -1
        return fastText_wv.get_input_vector(item)

There are a few issues:
--fastText_wv does not return a length. That said, we can use one for testing by looking at the output length when loading the model.
--get_input_vector cannot be called.

getInputVector(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: fasttext_pybind.Vector, arg1: int) -> None

Invoked with: <fasttext_pybind.fasttext object at 0x2ab5724b0>, <fasttext_pybind.Vector object at 0x2a29c5370>, array(18446744073709551615, dtype=uint64)

This is with fasttext 0.9.1

I'm sure you have this working with gensim or fasttext. I'm wondering if you could share your code example as you did with Numpy. I'm not sure where to start on debugging. Most of the methods in gensim require supplying the token over the word index.

The text was updated successfully, but these errors were encountered:

petulla · 2019-11-27T00:48:51Z

Seeing a few things that were missing in that post, This should work:

import fasttext as ft
model2=ft.load_model('fasttext/wiki.en.bin')

Array was not needed:

class FastTextEmbeddings(object):
    def __getitem__(self, item):
        # this has to be set
        if item > dim: return [1e-8] * 300
        return model2.get_input_vector(item)

Need the fasttext ids:

def buildVectorsFT(row):
       #spacy is overkill here; but this tokenizes
        nlpAll = nlpSpacy(alltext)
        tokens = [t.text.lower() for t in nlpAll if t.is_alpha and not t.is_stop] #[token.text.lower() for token in nlpAll if not token.is_stop]
        
        words = Counter(t for t in tokens)
        orths = {t: model2.get_word_id(t) for t in tokens}
        
        sorted_words = sorted(words)
        documents[title] = (title, [orths[t] for t in sorted_words],
                            np.array([words[t] for t in sorted_words], dtype=np.float32))
        return

Get dim:

t = model2.get_output_matrix()
dim = len(t)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with gensim or fasttext #66

Integration with gensim or fasttext #66

petulla commented Nov 26, 2019 •

edited

Loading

petulla commented Nov 27, 2019 •

edited

Loading

Integration with gensim or fasttext #66

Integration with gensim or fasttext #66

Comments

petulla commented Nov 26, 2019 • edited Loading

petulla commented Nov 27, 2019 • edited Loading

petulla commented Nov 26, 2019 •

edited

Loading

petulla commented Nov 27, 2019 •

edited

Loading