Skip to content

Can you query without a tokenization step? #26

Closed Answered by snewcomer
snewcomer asked this question in Q&A
Discussion options

You must be logged in to vote

I made an egregious error to start. In the corpus list, I forgot a comma and thus was seeing self inflicted incorrect results out of the tokenizer. Second, it is very possible I have not described the situation well 😄.

I found partially what I was looking for. You don't necessarily need the Tokenized class to pass to retrieve. retrieve can take the list of tokenized words: e.g. [['architectur', 'diagram']]. here.

Here is the functionality I externalized in my code. For any corpus index, I know the positions of the document I am after.

def get_vocab_for_entry(self, idx):
  return [
    [self.reverse_vocab[token_id] for token_id in self.corpus_tokens[idx]]
  ]

So instead of tokenizing

y_te…

Replies: 2 comments 6 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
6 replies
@snewcomer
Comment options

@xhluca
Comment options

@snewcomer
Comment options

@xhluca
Comment options

@snewcomer
Comment options

Answer selected by xhluca
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #21 on July 10, 2024 14:58.