Community: Load a PDF document as per its Table of Contents #28346

vineetm · 2024-11-26T06:46:04Z

Description: Many PDFs such as textbooks come with a rich Table of Contents (also known as bookmarks). It would be helpful for a RAG system to see the entire content for a ToC entry in tandem, rather than limiting it to a single PDF page!

This PR adds adds a new PDF document loader which loads content of PDF by using its Table of Contents

Many PDFs have detailed Table of Contents. For example this book with the pdf has this Table of Contents structure...

PyPfium2Loader breaks the flow of content which logically continues on the next page.

This PR, loads content based on Toc entry. It can

Extract logical chunks within a single page
Extract logical chunks of document spanning across multiple pages
Is not limited by the page scope

vercel · 2024-11-26T06:46:08Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 26, 2024 3:32pm

efriis · 2024-12-09T22:33:35Z

I don't fully understand what this does (it looks like the docs page you added isn't actually populated), but I think it's unnecessary to add another PDF document loader because we have so many. Depending what the goal of loading in the TOC in this way is, couldn't this be added as some kind of postprocessing on documents in a guide instead?

Vineet Kumar added 2 commits November 26, 2024 11:53

Load PDF documents by using Table of Contents based demarcation

1c43be2

sample notebook

ed0d5f6

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Nov 26, 2024

vercel bot deployed to Preview November 26, 2024 06:57 View deployment

Vineet Kumar added 3 commits November 26, 2024 12:42

update integrations notebook

ffb018c

update integrations notebook

5b605fb

update integrations notebook

f29bea3

vercel bot deployed to Preview November 26, 2024 07:29 View deployment

Vineet Kumar added 2 commits November 26, 2024 13:18

update integrations notebook

b720445

unit tests

04e5870

vercel bot deployed to Preview November 26, 2024 08:05 View deployment

fix metadata

0c0022c

vercel bot deployed to Preview November 26, 2024 15:32 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community: Load a PDF document as per its Table of Contents #28346

Community: Load a PDF document as per its Table of Contents #28346

vineetm commented Nov 26, 2024 •

edited

Loading

vercel bot commented Nov 26, 2024 •

edited

Loading

efriis commented Dec 9, 2024

Community: Load a PDF document as per its Table of Contents #28346

Are you sure you want to change the base?

Community: Load a PDF document as per its Table of Contents #28346

Conversation

vineetm commented Nov 26, 2024 • edited Loading

vercel bot commented Nov 26, 2024 • edited Loading

efriis commented Dec 9, 2024

vineetm commented Nov 26, 2024 •

edited

Loading

vercel bot commented Nov 26, 2024 •

edited

Loading