-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Community: Load a PDF document as per its Table of Contents #28346
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
I don't fully understand what this does (it looks like the docs page you added isn't actually populated), but I think it's unnecessary to add another PDF document loader because we have so many. Depending what the goal of loading in the TOC in this way is, couldn't this be added as some kind of postprocessing on documents in a guide instead? |
Description: Many PDFs such as textbooks come with a rich Table of Contents (also known as bookmarks). It would be helpful for a RAG system to see the entire content for a ToC entry in tandem, rather than limiting it to a single PDF page!
This PR adds adds a new PDF document loader which loads content of PDF by using its Table of Contents
Many PDFs have detailed Table of Contents. For example this book with the pdf has this Table of Contents structure...
PyPfium2Loader breaks the flow of content which logically continues on the next page.
This PR, loads content based on Toc entry. It can