Refactoring all PDF loader and parser #28652

pprados · 2024-12-10T15:23:41Z

WIP

vercel · 2024-12-10T15:24:01Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	❌ Failed (Inspect)			Dec 16, 2024 5:28pm

…to pprados/refactor_pdf_loaders # Conflicts: # docs/docs/how_to/document_loader_custom.ipynb # docs/docs/integrations/document_loaders/pdfminer.ipynb # docs/docs/integrations/document_loaders/pdfplumber.ipynb # docs/docs/integrations/document_loaders/pymupdf.ipynb # docs/docs/integrations/document_loaders/pypdfium2.ipynb # docs/docs/integrations/document_loaders/pypdfloader.ipynb

…to pprados/refactor_pdf_loaders

vercel · 2024-12-11T13:33:38Z

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

efriis · 2024-12-12T00:46:38Z

Hey @pprados! I don't think this work is going to result in a PR that is reviewable if it's only partially done and already adding 7000 lines.

What is your goal in this work?

pprados · 2024-12-13T12:37:15Z

@efriis

I'm well aware of that. That's why a meeting is to be organized with LangChain (via AXA France), normally next week, to see how best to proceed, with @eyurtsev.

We're sorry, it may take you several hours to validate it. The changes are important and cannot be published one after the other, as everything is linked. It's going to be difficult to cut the code into 12 successive PRs, and end up with the same result. And that's going to take months. All this work is validated by two matrix tests, ensuring the consistency of all modifications.

In order to qualify all the code, we worked on a separate project, using the langchain-common structure. In this way, we can compare the results of the historical implementation with the new ones.

We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the langchain-common structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here. The only difference is the name to import classes.

We prepare the PR and its description. Look here to understand our work. We welcome any suggestions you may have to help us integrate it.

You can now pre-view the description. The final version won't be far off.

The aim is to submit the PR in early 2025.

…pdf_loaders # Conflicts: # docs/docs/integrations/vectorstores/azure_cosmos_db.ipynb

efriis · 2024-12-13T21:47:33Z

ah got it - is there an issue or discussion of proposed changes? It might be easier to discuss ideas than these code changes

pprados · 2024-12-14T08:12:08Z

@efriis
90% of my customers work with PDF files and don't have a satisfactory solution at the moment. They cobble together solutions outside langchain (pdf processing outside loaders/parsers), sacrificing a good part of the benefits of this framework. Seeing the same problems over and over again, and the same bad solutions, I couldn't let them go on like that. I had to deal with the problem in the best possible way, for my customers and all LangChain users.

I've been funded by my client to simultaneously help projects in Belgium, Switzerland, Spain, Italy and France. I couldn't wait for a discussion on the subject. In the end, what I'm proposing is a no-brainer:

Integrate tables where possible
to indicate what you want to do with the images (invoke a multimodal LLM, for example)
standardize the various solutions to eventually enable automatic selection of the parser according to document characteristics

efriis · 2024-12-15T21:38:16Z

will let you and eugene discuss in the time you have scheduled.

…pdf_loaders # Conflicts: # libs/community/langchain_community/document_loaders/parsers/pdf.py

…pdf_loaders

First integration of the refactoring of PDFloader/parser

c348a50

make format

ecbead2

vercel bot had a problem deploying to Preview December 10, 2024 15:36 Failure

Fix lint

5aa5589

vercel bot had a problem deploying to Preview December 10, 2024 17:33 Failure

Fix lint 2

4713f6b

vercel bot had a problem deploying to Preview December 11, 2024 06:59 Failure

pprados added 2 commits December 11, 2024 08:36

Fix lint 3

7930ec7

Fix lint 4

5d59d1c

vercel bot had a problem deploying to Preview December 11, 2024 07:54 Failure

pprados added 3 commits December 11, 2024 09:05

Fix lint 5

59ec411

Fix lint 6

7e94fd0

Fix lint 7

d7a7090

vercel bot had a problem deploying to Preview December 11, 2024 08:26 Failure

align doc with template

eadd68d

vercel bot had a problem deploying to Preview December 11, 2024 09:57 Failure

pprados added 4 commits December 11, 2024 11:40

Fix lint/test 8

d6413ec

Fix lint/test 9

d0d5a1b

Fix lint/test 10

a23b4c7

vercel bot had a problem deploying to Preview December 11, 2024 10:53 Failure

pprados added 4 commits December 11, 2024 12:00

Fix lint/test 11

d09e5bb

Fix lint/test 12

78d1ebc

Fix lint/test 13

781da0a

Fix lint/test 13

0a35d1f

vercel bot had a problem deploying to Preview December 11, 2024 11:18 Failure

align doc with template

4ef1a6e

vercel bot deployed to Preview December 11, 2024 13:21 View deployment

Fix lint/test 14

6d7bf1d

Merge remote-tracking branch 'origin/pprados/refactor_pdf_loaders' in…

c7b392e

…to pprados/refactor_pdf_loaders

pprados added 2 commits December 11, 2024 14:38

Fix lint/test 14

2ce97f5

Fix lint/test 14

5bb6e50

vercel bot deployed to Preview December 11, 2024 13:47 View deployment

vercel bot deployed to Preview December 11, 2024 13:56 View deployment

delete dist installation in the notebooks pushed to remote

b505538

vercel bot deployed to Preview December 11, 2024 16:32 View deployment

efriis self-assigned this Dec 12, 2024

refacto & format lint notebooks

0255868

vercel bot deployed to Preview December 12, 2024 11:07 View deployment

Add ZeroxPDFLoader

855a2d9

vercel bot deployed to Preview December 13, 2024 12:26 View deployment

Merge remote-tracking branch 'upstream/master' into pprados/refactor_…

7aa400b

…pdf_loaders # Conflicts: # docs/docs/integrations/vectorstores/azure_cosmos_db.ipynb

vercel bot deployed to Preview December 13, 2024 12:49 View deployment

Merge remote-tracking branch 'upstream/master' into pprados/refactor_…

7ea7847

…pdf_loaders # Conflicts: # libs/community/langchain_community/document_loaders/parsers/pdf.py

vercel bot deployed to Preview December 16, 2024 07:07 View deployment

pprados added 3 commits December 16, 2024 18:06

Merge remote-tracking branch 'upstream/master' into pprados/refactor_…

4e2f3e3

…pdf_loaders

Update notebooks & fix bugs

71ce392

Import orders

a262610

vercel bot had a problem deploying to Preview December 16, 2024 17:22 Failure

efriis assigned eyurtsev and unassigned efriis Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring all PDF loader and parser #28652

Refactoring all PDF loader and parser #28652

pprados commented Dec 10, 2024

vercel bot commented Dec 10, 2024 •

edited

Loading

vercel bot commented Dec 11, 2024

efriis commented Dec 12, 2024

pprados commented Dec 13, 2024

efriis commented Dec 13, 2024

pprados commented Dec 14, 2024 •

edited

Loading

efriis commented Dec 15, 2024

Refactoring all PDF loader and parser #28652

Are you sure you want to change the base?

Refactoring all PDF loader and parser #28652

Conversation

pprados commented Dec 10, 2024

vercel bot commented Dec 10, 2024 • edited Loading

vercel bot commented Dec 11, 2024

efriis commented Dec 12, 2024

pprados commented Dec 13, 2024

efriis commented Dec 13, 2024

pprados commented Dec 14, 2024 • edited Loading

efriis commented Dec 15, 2024

vercel bot commented Dec 10, 2024 •

edited

Loading

pprados commented Dec 14, 2024 •

edited

Loading