how to make more embeddings? #12

LexiestLeszek · 2023-06-28T14:03:45Z

LexiestLeszek
Jun 28, 2023

Hi! Thanks a lot for this project, super interesting.

How can i generate more ebmeddings? I assume that embeddings are based on a relative big text chunks and my question is: how can i improve the quality of the search by making more embeddings with a smaller size?

Answered by ZachNagengast

Jun 28, 2023

Hi @LexiestLeszek, if I understand your question correctly, I think you're referring to customizing the size of the text chunks. If so, that's a great question - the library attempts to pick some defaults for this, but they are fully configurable with the TextSplitterProtocol. Here's how you can do it inside the PDFExample with my preferred method - the RecursiveTokenSplitter

            let splitter = RecursiveTokenSplitter(withTokenizer: BertTokenizer())
-           let (splitText, _) = splitter.split(text: documentText)
+           let (splitText, _) = splitter.split(text: documentText, chunkSize: 100)
            chunks = splitText

This will set the splitter to try to make chunks up t…

View full answer

ZachNagengast · 2023-06-28T15:32:10Z

ZachNagengast
Jun 28, 2023
Maintainer

Hi @LexiestLeszek, if I understand your question correctly, I think you're referring to customizing the size of the text chunks. If so, that's a great question - the library attempts to pick some defaults for this, but they are fully configurable with the TextSplitterProtocol. Here's how you can do it inside the PDFExample with my preferred method - the RecursiveTokenSplitter

            let splitter = RecursiveTokenSplitter(withTokenizer: BertTokenizer())
-           let (splitText, _) = splitter.split(text: documentText)
+           let (splitText, _) = splitter.split(text: documentText, chunkSize: 100)
            chunks = splitText

This will set the splitter to try to make chunks up to 100 tokens without exceeding. The default is 510 tokens for this splitter so it should reduce the chunk size a good amount and allow you to get more refined results - just keep in mind that the 510 is a hard limit based on the BERT models, because that is their max token window size. Regarding improving search quality, there are also some methods that start with large chunks and then re-rank them with smaller chunks, but that depends on the use case.

Does that help?

0 replies

LexiestLeszek · 2023-06-28T20:47:59Z

LexiestLeszek
Jun 28, 2023
Author

It works! Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to make more embeddings? #12

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

how to make more embeddings? #12

LexiestLeszek Jun 28, 2023

Replies: 2 comments

ZachNagengast Jun 28, 2023 Maintainer

LexiestLeszek Jun 28, 2023 Author

LexiestLeszek
Jun 28, 2023

ZachNagengast
Jun 28, 2023
Maintainer

LexiestLeszek
Jun 28, 2023
Author