Can you query without a tokenization step? #26

snewcomer · 2024-07-10T13:21:16Z

snewcomer
Jul 10, 2024

In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known x and y types of objects.

x_corpus = [...]

y_corpus = [
    "fooo",
    "does the fish purr like a cat?",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

class XEntity:
  corpus_tokens = bm25s.tokenize(x_corpus, stopwords="en", stemmer=stemmer)
  x_retriever = bm25s.BM25()
  x_retriever.index(corpus_tokens)

class YEntity:
  corpus_tokens = bm25s.tokenize(y_corpus, stopwords="en", stemmer=stemmer)
  y_retriever = bm25s.BM25()
  y_retriever.index(corpus_tokens)

corpus_tokens for y_corpus

Tokenized(ids=[[8], [2, 5, 1, 4, 10], [0, 9, 12, 11, 14], [5, 13, 7, 6, 3]], vocab={'bird': 0, 'purr': 1, 'doe': 2, 'swim': 3, 'like': 4, 'fish': 5, 'water': 6, 'live': 7, 'fooo': 8, 'beauti': 9, 'cat': 10, 'can': 11, 'anim': 12, 'creatur': 13, 'fli': 14})

If I tokenize independently...

a_query = "does the fish purr like a cat?"
Tokenized(ids=[[3, 1, 2, 0, 4]], vocab={'like': 0, 'fish': 1, 'purr': 2, 'doe': 3, 'cat': 4})

Is there a way to get a subset from the index that represents the query as represented when the index was built?

query_from_y = precomputed_representation_of_a_query_without_tokenization_step
ranked_results = x_retriever.retrieve(query_from_y, k=5)

Answered by snewcomer

Jul 10, 2024

I made an egregious error to start. In the corpus list, I forgot a comma and thus was seeing self inflicted incorrect results out of the tokenizer. Second, it is very possible I have not described the situation well 😄.

I found partially what I was looking for. You don't necessarily need the Tokenized class to pass to retrieve. retrieve can take the list of tokenized words: e.g. [['architectur', 'diagram']]. here.

Here is the functionality I externalized in my code. For any corpus index, I know the positions of the document I am after.

def get_vocab_for_entry(self, idx):
  return [
    [self.reverse_vocab[token_id] for token_id in self.corpus_tokens[idx]]
  ]

So instead of tokenizing

y_te…

View full answer

xhluca · 2024-07-10T15:07:41Z

xhluca
Jul 10, 2024
Maintainer

I'm not sure if I understand your question correctly. But if you mean whether it is possible to use the vocabulary of the corpus during tokenization, unfortunately this is not yet supported, but the support could be added if there's an interest in that (perhaps, as a "tokenizer" class).

Right now, there's no way to synchronize two tokenized corpora (in your case, xEntity and yEntity). If you want, you can tokenize them together and slice the resulting ids:

all_tokens = bm25s.tokenize(x_corpus + y_corpus)
xlen = len(x_corpus)
x_tokens = bm25s.tokenization.Tokenized(all_tokens.ids[:xlen], all_tokens.vocab)
y_tokens = bm25s.tokenization.Tokenized(all_tokens.ids[xlen:], all_tokens.vocab)

If you are instead asking whether the query token ids are correctly mapped to the index's ids (that could create a potential bug), then the answer is that the query tokens are mapped back to text tokens with the index vocab.

0 replies

snewcomer · 2024-07-10T16:06:25Z

snewcomer
Jul 10, 2024
Author

I made an egregious error to start. In the corpus list, I forgot a comma and thus was seeing self inflicted incorrect results out of the tokenizer. Second, it is very possible I have not described the situation well 😄.

I found partially what I was looking for. You don't necessarily need the Tokenized class to pass to retrieve. retrieve can take the list of tokenized words: e.g. [['architectur', 'diagram']]. here.

Here is the functionality I externalized in my code. For any corpus index, I know the positions of the document I am after.

def get_vocab_for_entry(self, idx):
  return [
    [self.reverse_vocab[token_id] for token_id in self.corpus_tokens[idx]]
  ]

So instead of tokenizing

y_text = y_corpus[y_idx]
y_tokens = bm25s.tokenize(y_text)
x_retriever.retrieve(y_tokens)

I think what I want out of the retriever (instead of externalizing this) is a method like so

y_tokens = y_retriever.get_tokens_for_index(y_idx)
x_retriever.retrieve(y_tokens)

6 replies

snewcomer Jul 11, 2024
Author

Happy to help where necessary - Is this something you would want to expose as an API of the BM25 class?

xhluca Jul 11, 2024
Maintainer

It should already be exposed in the docstring of the function. I want to eventually auto-generate docs.

snewcomer Jul 14, 2024
Author

I'm personally not seeing this type of functionality exposed on the bm25 class. Happy to contribute if you feel is necessary!

xhluca Jul 14, 2024
Maintainer

Sorry i'm not sure i understand. Do you mean that list of list of str is not currently exposed and you would like to see that exposed? or that you can't currently pass token IDs (wrt index's vocab) directly to retrieve?

For the former, it is currently exposed:

bm25s/bm25s/__init__.py

Lines 456 to 458 in 73c7dea

    
                   query_tokens : List[List[str]] or bm25s.tokenization.Tokenized 
        
                       List of list of tokens for each query. If a Tokenized object is provided, 
        
                       it will be converted to a list of list of tokens.

For the latter, I do not think it makes sense to expose right now since it will be confusing for the end user what the token IDs should be, since the vocabulary itself is not transparently exposed. So "does the fish purr like a cat?" would be different if you use an existing vocab from the index vs running bm25s.tokenize (which will create a separate vocab). This would silently fail (not in a bug way, but in that it will return garbage because of index mismatch).

This is why in the code, I am only working with list of list of str until the very last part - accepting list of list of int as an input would require rewriting many functions (retrieve -> get_top_k_results -> selection.topk -> selection._topk_numpy/selection._topk_jax). But even if it ends up being exposed, the confusion about what IDs should be passed will remain.

So for now, I think it makes more sense to stay with list of list of str, and allow a tokenized object to be passed (as it currently is possible) for those who prefer having an ID.

snewcomer Jul 15, 2024
Author

vocabulary itself is not transparently exposed

Right right. Ok thank you. I'm basing this question on, what I think to be the behaviour, that the index is stored positionally. The Tokenized class of ids was the order they were inserted. Those ids map to discrete vocab items (through the token ids). I'll give another abbreviated example I have now.

corpus_tokens = bm25s.tokenize(documents, stopwords="en", stemmer=self.stemmer)

for i, embedding in enumerate(embeddings):
  id = documents[i].id
  tokens = bm25klass.get_vocab_at_index(i) # my own wrapper
  hash = {
    "id": id,
    "embedding": embedding,
    "tokens": tokens # list[list[str]]
   }

  # redis is our PersistentVolumeClaim
  redis_client.hset(name=f"{embedding_type}:{id}", mapping=hash)

With this, I can build_index_from_tokens, upon server restart, for example.

accepting list of list of int

all those methods are good! I was just developing an intuition (but please correct me if I am wrong!!) that often the index we build with bm25 is order dependent than can be accessed by an idx in various scenarios (like above)

you use an existing vocab from the index vs running bm25s.tokenize

Would you mind explaining this? Are you saying tokenize is not idempotent?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you query without a tokenization step? #26

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can you query without a tokenization step? #26

snewcomer Jul 10, 2024

Replies: 2 comments · 6 replies

xhluca Jul 10, 2024 Maintainer

snewcomer Jul 10, 2024 Author

snewcomer Jul 11, 2024 Author

xhluca Jul 11, 2024 Maintainer

snewcomer Jul 14, 2024 Author

xhluca Jul 14, 2024 Maintainer

snewcomer Jul 15, 2024 Author

snewcomer
Jul 10, 2024

Replies: 2 comments 6 replies

xhluca
Jul 10, 2024
Maintainer

snewcomer
Jul 10, 2024
Author

snewcomer Jul 11, 2024
Author

xhluca Jul 11, 2024
Maintainer

snewcomer Jul 14, 2024
Author

xhluca Jul 14, 2024
Maintainer

snewcomer Jul 15, 2024
Author