-
In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known
corpus_tokens for
If I tokenize independently...
Is there a way to get a subset from the index that represents the query as represented when the index was built?
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
I'm not sure if I understand your question correctly. But if you mean whether it is possible to use the vocabulary of the corpus during tokenization, unfortunately this is not yet supported, but the support could be added if there's an interest in that (perhaps, as a "tokenizer" class). Right now, there's no way to synchronize two tokenized corpora (in your case, xEntity and yEntity). If you want, you can tokenize them together and slice the resulting ids: all_tokens = bm25s.tokenize(x_corpus + y_corpus)
xlen = len(x_corpus)
x_tokens = bm25s.tokenization.Tokenized(all_tokens.ids[:xlen], all_tokens.vocab)
y_tokens = bm25s.tokenization.Tokenized(all_tokens.ids[xlen:], all_tokens.vocab) If you are instead asking whether the query token ids are correctly mapped to the index's ids (that could create a potential bug), then the answer is that the query tokens are mapped back to text tokens with the index vocab. |
Beta Was this translation helpful? Give feedback.
-
I made an egregious error to start. In the corpus list, I forgot a comma and thus was seeing self inflicted incorrect results out of the tokenizer. Second, it is very possible I have not described the situation well 😄. I found partially what I was looking for. You don't necessarily need the Tokenized class to pass to Here is the functionality I externalized in my code. For any corpus index, I know the positions of the document I am after.
So instead of tokenizing
I think what I want out of the retriever (instead of externalizing this) is a method like so
|
Beta Was this translation helpful? Give feedback.
I made an egregious error to start. In the corpus list, I forgot a comma and thus was seeing self inflicted incorrect results out of the tokenizer. Second, it is very possible I have not described the situation well 😄.
I found partially what I was looking for. You don't necessarily need the Tokenized class to pass to
retrieve
.retrieve
can take the list of tokenized words: e.g.[['architectur', 'diagram']]
. here.Here is the functionality I externalized in my code. For any corpus index, I know the positions of the document I am after.
So instead of tokenizing