Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring all PDF loader and parser #28652

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Dec 10, 2024

WIP

Copy link

vercel bot commented Dec 10, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ❌ Failed (Inspect) Dec 16, 2024 5:28pm

…to pprados/refactor_pdf_loaders

# Conflicts:
#	docs/docs/how_to/document_loader_custom.ipynb
#	docs/docs/integrations/document_loaders/pdfminer.ipynb
#	docs/docs/integrations/document_loaders/pdfplumber.ipynb
#	docs/docs/integrations/document_loaders/pymupdf.ipynb
#	docs/docs/integrations/document_loaders/pypdfium2.ipynb
#	docs/docs/integrations/document_loaders/pypdfloader.ipynb
Copy link

vercel bot commented Dec 11, 2024

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@efriis
Copy link
Member

efriis commented Dec 12, 2024

Hey @pprados! I don't think this work is going to result in a PR that is reviewable if it's only partially done and already adding 7000 lines.

What is your goal in this work?

@efriis efriis self-assigned this Dec 12, 2024
@pprados
Copy link
Contributor Author

pprados commented Dec 13, 2024

@efriis

I'm well aware of that. That's why a meeting is to be organized with LangChain (via AXA France), normally next week, to see how best to proceed, with @eyurtsev.

We're sorry, it may take you several hours to validate it. The changes are important and cannot be published one after the other, as everything is linked. It's going to be difficult to cut the code into 12 successive PRs, and end up with the same result. And that's going to take months. All this work is validated by two matrix tests, ensuring the consistency of all modifications.

In order to qualify all the code, we worked on a separate project, using the langchain-common structure. In this way, we can compare the results of the historical implementation with the new ones.

We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the langchain-common structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here. The only difference is the name to import classes.

We prepare the PR and its description. Look here to understand our work. We welcome any suggestions you may have to help us integrate it.

You can now pre-view the description. The final version won't be far off.

The aim is to submit the PR in early 2025.

…pdf_loaders

# Conflicts:
#	docs/docs/integrations/vectorstores/azure_cosmos_db.ipynb
@efriis
Copy link
Member

efriis commented Dec 13, 2024

ah got it - is there an issue or discussion of proposed changes? It might be easier to discuss ideas than these code changes

@pprados
Copy link
Contributor Author

pprados commented Dec 14, 2024

@efriis
90% of my customers work with PDF files and don't have a satisfactory solution at the moment. They cobble together solutions outside langchain (pdf processing outside loaders/parsers), sacrificing a good part of the benefits of this framework. Seeing the same problems over and over again, and the same bad solutions, I couldn't let them go on like that. I had to deal with the problem in the best possible way, for my customers and all LangChain users.

I've been funded by my client to simultaneously help projects in Belgium, Switzerland, Spain, Italy and France. I couldn't wait for a discussion on the subject. In the end, what I'm proposing is a no-brainer:

  • Integrate tables where possible
  • to indicate what you want to do with the images (invoke a multimodal LLM, for example)
  • standardize the various solutions to eventually enable automatic selection of the parser according to document characteristics

@efriis
Copy link
Member

efriis commented Dec 15, 2024

will let you and eugene discuss in the time you have scheduled.

…pdf_loaders

# Conflicts:
#	libs/community/langchain_community/document_loaders/parsers/pdf.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs support
Development

Successfully merging this pull request may close these issues.

3 participants