how to make more embeddings? #12
-
Hi! Thanks a lot for this project, super interesting. How can i generate more ebmeddings? I assume that embeddings are based on a relative big text chunks and my question is: how can i improve the quality of the search by making more embeddings with a smaller size? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @LexiestLeszek, if I understand your question correctly, I think you're referring to customizing the size of the text chunks. If so, that's a great question - the library attempts to pick some defaults for this, but they are fully configurable with the let splitter = RecursiveTokenSplitter(withTokenizer: BertTokenizer())
- let (splitText, _) = splitter.split(text: documentText)
+ let (splitText, _) = splitter.split(text: documentText, chunkSize: 100)
chunks = splitText This will set the splitter to try to make chunks up to 100 tokens without exceeding. The default is 510 tokens for this splitter so it should reduce the chunk size a good amount and allow you to get more refined results - just keep in mind that the 510 is a hard limit based on the BERT models, because that is their max token window size. Regarding improving search quality, there are also some methods that start with large chunks and then re-rank them with smaller chunks, but that depends on the use case. Does that help? |
Beta Was this translation helpful? Give feedback.
-
It works! Thanks! |
Beta Was this translation helpful? Give feedback.
Hi @LexiestLeszek, if I understand your question correctly, I think you're referring to customizing the size of the text chunks. If so, that's a great question - the library attempts to pick some defaults for this, but they are fully configurable with the
TextSplitterProtocol
. Here's how you can do it inside thePDFExample
with my preferred method - theRecursiveTokenSplitter
This will set the splitter to try to make chunks up t…