regarding the pdf loader selection #23918
Replies: 4 comments 11 replies
-
To handle table-related content in PDFs effectively, you can use several document loaders in LangChain. Here are some recommendations and code snippets for each: 1. AmazonTextractPDFLoaderThis loader uses Amazon Textract to extract text, tables, and other document structures from PDFs. It is suitable for documents with complex layouts or scanned images. from langchain_community.document_loaders import AmazonTextractPDFLoader
file_path = "path/to/your/document.pdf"
loader = AmazonTextractPDFLoader(file_path, textract_features=["TABLES"])
documents = loader.load() 2. PDFPlumberLoaderThis loader provides detailed metadata about the PDF and its pages and returns one document per page. It is useful for more advanced document layout analysis. from langchain_community.document_loaders import PDFPlumberLoader
file_path = "path/to/your/document.pdf"
loader = PDFPlumberLoader(file_path)
data = loader.load()
print(data[0]) 3. UnstructuredThis loader is particularly useful for extracting images, text, and tables from documents (PDFs). It can be used to partition the document and categorize elements by type. from unstructured.partition.pdf import partition_pdf
def extract_pdf_elements(path, fname):
return partition_pdf(
filename=path + fname,
extract_images_in_pdf=False,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path=path,
)
def categorize_elements(raw_pdf_elements):
tables = []
texts = []
for element in raw_pdf_elements:
if "unstructured.documents.elements.Table" in str(type(element)):
tables.append(str(element))
elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
texts.append(str(element))
return texts, tables
fpath = "/path/to/your/document/"
fname = "document.pdf"
raw_pdf_elements = extract_pdf_elements(fpath, fname)
texts, tables = categorize_elements(raw_pdf_elements) 4. PyMuPDFLoaderThis loader provides detailed metadata about the PDF and its pages, and it returns one document per page. You can pass any options from the PyMuPDF documentation as keyword arguments in the from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("path/to/your/document.pdf", extract_images=True)
data = loader.load()
print(data[0]) 5. AzureAIDocumentIntelligenceLoaderThis loader uses Azure AI Document Intelligence to extract texts, tables, and document structures from PDFs. It supports various file formats and provides detailed document analysis. from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
file_path = "path/to/your/document.pdf"
endpoint = "your_azure_endpoint"
key = "your_azure_key"
loader = AzureAIDocumentIntelligenceLoader(
api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)
documents = loader.load() Each of these loaders has its own strengths and can be chosen based on the complexity of the PDF and the specific requirements of your application. Once the PDF is loaded and split into pages, you can further process the extracted text to identify and extract tables using LangChain's tools and integrations [1][2][3][4][5]. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for your quick response the above-mentioned samples are very helpful, and I will try with all functionalities. but have a query that if I have tree or subtree structured tables as I mentioned in the above screenshot these features will deal with such kind of activities to get the best answer? and if I have the table structure information how these features are loading the content (text) from the table can you give me a brief explanation? Thanks |
Beta Was this translation helpful? Give feedback.
-
This response looks like answer to my question thank you so much
…On Sat, 6 Jul, 2024, 12:35 am dosubot[bot], ***@***.***> wrote:
The document loaders you mentioned, specifically the DocugamiLoader, are
designed to handle tree or subtree structured tables effectively. The
DocugamiLoader breaks down documents into a hierarchical semantic XML
tree of chunks, which includes structural attributes like tables and other
common elements. This structured representation ensures that complex table
structures are accurately extracted.
To further enhance the extraction of complex table structures, you can use
specific parameters and methods provided by the DocugamiLoader:
1. *Include XML Tags*: Set loader.include_xml_tags = True to include
additional XML metadata on the returned chunks.
2. *Parent Hierarchy Levels*: Set loader.parent_hierarchy_levels to a
non-zero value to get parent chunk references, which can help in
understanding the context of subtrees.
3. *Sub-chunk Tables*: Set loader.sub_chunk_tables = True to return
sub-chunks within tables, which is particularly useful for handling complex
table structures.
Here is an example of how to set these parameters:
loader = DocugamiLoader(docset_id="your_docset_id")loader.include_xml_tags = True # Include additional XML metadataloader.parent_hierarchy_levels = 3 # Get parent chunk references for expanded contextloader.sub_chunk_tables = True # Return sub-chunks within tableschunks = loader.load()
These settings will help in accurately extracting and handling complex
table structures within your documents [1]
<https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/document_loaders/docugami.ipynb>
[2]
<https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/docugami.py>
.
*To continue talking to Dosu <https://dosu.dev>, mention @dosu
<https://github.com/dosu>.*
—
Reply to this email directly, view it on GitHub
<#23918 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALG2WIHQCUDBJYGT465SGXDZK3VAVAVCNFSM6AAAAABKNRWE5SVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TSNZRGA3TO>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
thank you so much for your valuable information but here while we are doing chunking do we have any control that we can create a chunk for each line from the table. thanks |
Beta Was this translation helpful? Give feedback.
-
Checked other resources
Commit to Help
Example Code
Description
Hello team,
thanks in advance for providing great platform to share the issues or questions.
i am actually facing an issue with pdf loader while loading pdf documents if the chunk or text information in tabular format then langchain is failing to fetch the proper information based on the table.
my ask is
1.which document loader is best to handle table related content if we want to process through pdfs.
2.it is actually best if we have any special parameters in pymupdf or pdfplumber loader to extract the tabular information in standard format.
3.it would be really helpful that if you share code snippet if you handled previously..
Thanks and Regards,
Ashok Rayapati
System Info
Beta Was this translation helpful? Give feedback.
All reactions