-
Notifications
You must be signed in to change notification settings - Fork 15.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring all PDF loader and parser #28652
base: master
Are you sure you want to change the base?
Refactoring all PDF loader and parser #28652
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
…to pprados/refactor_pdf_loaders # Conflicts: # docs/docs/how_to/document_loader_custom.ipynb # docs/docs/integrations/document_loaders/pdfminer.ipynb # docs/docs/integrations/document_loaders/pdfplumber.ipynb # docs/docs/integrations/document_loaders/pymupdf.ipynb # docs/docs/integrations/document_loaders/pypdfium2.ipynb # docs/docs/integrations/document_loaders/pypdfloader.ipynb
…to pprados/refactor_pdf_loaders
Deployment failed with the following error:
|
Hey @pprados! I don't think this work is going to result in a PR that is reviewable if it's only partially done and already adding 7000 lines. What is your goal in this work? |
I'm well aware of that. That's why a meeting is to be organized with LangChain (via AXA France), normally next week, to see how best to proceed, with @eyurtsev. We're sorry, it may take you several hours to validate it. The changes are important and cannot be published one after the other, as everything is linked. It's going to be difficult to cut the code into 12 successive PRs, and end up with the same result. And that's going to take months. All this work is validated by two matrix tests, ensuring the consistency of all modifications. In order to qualify all the code, we worked on a separate project, using the We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the We prepare the PR and its description. Look here to understand our work. We welcome any suggestions you may have to help us integrate it. You can now pre-view the description. The final version won't be far off. The aim is to submit the PR in early 2025. |
…pdf_loaders # Conflicts: # docs/docs/integrations/vectorstores/azure_cosmos_db.ipynb
ah got it - is there an issue or discussion of proposed changes? It might be easier to discuss ideas than these code changes |
@efriis I've been funded by my client to simultaneously help projects in Belgium, Switzerland, Spain, Italy and France. I couldn't wait for a discussion on the subject. In the end, what I'm proposing is a no-brainer:
|
will let you and eugene discuss in the time you have scheduled. |
…pdf_loaders # Conflicts: # libs/community/langchain_community/document_loaders/parsers/pdf.py
WIP