regarding the pdf loader selection #23918

ashokrayapati · 2024-07-05T17:46:24Z

ashokrayapati
Jul 5, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

loader = PyMuPDFLoader(inputfile)
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=10,)
docs=loader.load_and_split()
texts=text_splitter.split_documents(docs)
for text in texts:
    print(text)

Description

Hello team,
thanks in advance for providing great platform to share the issues or questions.

i am actually facing an issue with pdf loader while loading pdf documents if the chunk or text information in tabular format then langchain is failing to fetch the proper information based on the table.

my ask is
1.which document loader is best to handle table related content if we want to process through pdfs.
2.it is actually best if we have any special parameters in pymupdf or pdfplumber loader to extract the tabular information in standard format.
3.it would be really helpful that if you share code snippet if you handled previously..

Thanks and Regards,
Ashok Rayapati

System Info

@dosu · 2024-07-05T18:29:28Z

dosubot[bot]
bot Jul 5, 2024

To handle table-related content in PDFs effectively, you can use several document loaders in LangChain. Here are some recommendations and code snippets for each:

1. AmazonTextractPDFLoader

This loader uses Amazon Textract to extract text, tables, and other document structures from PDFs. It is suitable for documents with complex layouts or scanned images.

from langchain_community.document_loaders import AmazonTextractPDFLoader

file_path = "path/to/your/document.pdf"
loader = AmazonTextractPDFLoader(file_path, textract_features=["TABLES"])
documents = loader.load()

2. PDFPlumberLoader

This loader provides detailed metadata about the PDF and its pages and returns one document per page. It is useful for more advanced document layout analysis.

from langchain_community.document_loaders import PDFPlumberLoader

file_path = "path/to/your/document.pdf"
loader = PDFPlumberLoader(file_path)
data = loader.load()
print(data[0])

3. Unstructured

This loader is particularly useful for extracting images, text, and tables from documents (PDFs). It can be used to partition the document and categorize elements by type.

from unstructured.partition.pdf import partition_pdf

def extract_pdf_elements(path, fname):
    return partition_pdf(
        filename=path + fname,
        extract_images_in_pdf=False,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=path,
    )

def categorize_elements(raw_pdf_elements):
    tables = []
    texts = []
    for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
            tables.append(str(element))
        elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
            texts.append(str(element))
    return texts, tables

fpath = "/path/to/your/document/"
fname = "document.pdf"

raw_pdf_elements = extract_pdf_elements(fpath, fname)
texts, tables = categorize_elements(raw_pdf_elements)

4. PyMuPDFLoader

This loader provides detailed metadata about the PDF and its pages, and it returns one document per page. You can pass any options from the PyMuPDF documentation as keyword arguments in the load call, which will be passed along to the get_text() call.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("path/to/your/document.pdf", extract_images=True)
data = loader.load()
print(data[0])

5. AzureAIDocumentIntelligenceLoader

This loader uses Azure AI Document Intelligence to extract texts, tables, and document structures from PDFs. It supports various file formats and provides detailed document analysis.

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "path/to/your/document.pdf"
endpoint = "your_azure_endpoint"
key = "your_azure_key"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)

documents = loader.load()

Each of these loaders has its own strengths and can be chosen based on the complexity of the PDF and the specific requirements of your application. Once the PDF is loaded and split into pages, you can further process the extracted text to identify and extract tables using LangChain's tools and integrations ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

0 replies

ashokrayapati · 2024-07-05T18:58:21Z

ashokrayapati
Jul 5, 2024
Author

@dosuken123,

Thank you so much for your quick response

the above-mentioned samples are very helpful, and I will try with all functionalities. but have a query that if I have tree or subtree structured tables as I mentioned in the above screenshot these features will deal with such kind of activities to get the best answer?

and if I have the table structure information how these features are loading the content (text) from the table can you give me a brief explanation?

Thanks

1 reply

dosubot[bot] bot Jul 5, 2024

The document loaders you mentioned, specifically the DocugamiLoader, are designed to handle tree or subtree structured tables effectively. The DocugamiLoader breaks down documents into a hierarchical semantic XML tree of chunks, which includes structural attributes like tables and other common elements. This structured representation ensures that complex table structures are accurately extracted.

To further enhance the extraction of complex table structures, you can use specific parameters and methods provided by the DocugamiLoader:

Include XML Tags: Set loader.include_xml_tags = True to include additional XML metadata on the returned chunks.
Parent Hierarchy Levels: Set loader.parent_hierarchy_levels to a non-zero value to get parent chunk references, which can help in understanding the context of subtrees.
Sub-chunk Tables: Set loader.sub_chunk_tables = True to return sub-chunks within tables, which is particularly useful for handling complex table structures.

Here is an example of how to set these parameters:

loader = DocugamiLoader(docset_id="your_docset_id")
loader.include_xml_tags = True  # Include additional XML metadata
loader.parent_hierarchy_levels = 3  # Get parent chunk references for expanded context
loader.sub_chunk_tables = True  # Return sub-chunks within tables
chunks = loader.load()

These settings will help in accurately extracting and handling complex table structures within your documents ^[1]^[2].

To continue talking to Dosu, mention @dosu.

ashokrayapati · 2024-07-05T19:12:23Z

ashokrayapati
Jul 5, 2024
Author

This response looks like answer to my question thank you so much

…

On Sat, 6 Jul, 2024, 12:35 am dosubot[bot], ***@***.***> wrote: The document loaders you mentioned, specifically the DocugamiLoader, are designed to handle tree or subtree structured tables effectively. The DocugamiLoader breaks down documents into a hierarchical semantic XML tree of chunks, which includes structural attributes like tables and other common elements. This structured representation ensures that complex table structures are accurately extracted. To further enhance the extraction of complex table structures, you can use specific parameters and methods provided by the DocugamiLoader: 1. *Include XML Tags*: Set loader.include_xml_tags = True to include additional XML metadata on the returned chunks. 2. *Parent Hierarchy Levels*: Set loader.parent_hierarchy_levels to a non-zero value to get parent chunk references, which can help in understanding the context of subtrees. 3. *Sub-chunk Tables*: Set loader.sub_chunk_tables = True to return sub-chunks within tables, which is particularly useful for handling complex table structures. Here is an example of how to set these parameters: loader = DocugamiLoader(docset_id="your_docset_id")loader.include_xml_tags = True # Include additional XML metadataloader.parent_hierarchy_levels = 3 # Get parent chunk references for expanded contextloader.sub_chunk_tables = True # Return sub-chunks within tableschunks = loader.load() These settings will help in accurately extracting and handling complex table structures within your documents [1] <https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/document_loaders/docugami.ipynb> [2] <https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/document_loaders/docugami.py> . *To continue talking to Dosu <https://dosu.dev>, mention @dosu <https://github.com/dosu>.* — Reply to this email directly, view it on GitHub <#23918 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALG2WIHQCUDBJYGT465SGXDZK3VAVAVCNFSM6AAAAABKNRWE5SVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TSNZRGA3TO> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

1 reply

dosubot[bot] bot Jul 5, 2024

I'm glad the information was helpful! For handling tree or subtree structured tables, the DocugamiLoader is indeed a robust choice. It breaks down documents into a hierarchical semantic XML tree, which helps in accurately extracting complex table structures.

To summarize how these features load content from tables:

Hierarchical Breakdown: The loader breaks the document into chunks, preserving the hierarchical structure, including tables.
XML Metadata: By including XML tags, the loader provides additional context and structure, making it easier to parse complex tables.
Parent-Child Relationships: Setting parent hierarchy levels helps in understanding the context of subtrees within tables.
Sub-chunk Tables: This feature ensures that even nested tables are accurately extracted and represented.

These settings ensure that the extracted content maintains the structure and relationships present in the original document, making it easier to process and analyze.

Feel free to reach out if you have more questions or need further assistance!

To continue talking to Dosu, mention @dosu.

ashokrayapati · 2024-07-10T14:36:39Z

ashokrayapati
Jul 10, 2024
Author

@dosuken123

thank you so much for your valuable information

but here while we are doing chunking do we have any control that we can create a chunk for each line from the table.
because sometimes i can see if i ask the question specific the retriever is not picking the complete chunk .it is breaking the existing chunk and giving the incomplete response to my question can you please provide me the solution for my issue

thanks
ashok.

9 replies

dosubot[bot] bot Jul 11, 2024

To load a PDF document, create chunks from table data using PyMuPDF, and then index these chunks for retrieval, you can follow these steps:

Load the PDF Document using PyMuPDF:
Use the PyMuPDFLoader class to load the PDF document.
Extract Table Data and Create Chunks:
Implement a method to extract table data from the PDF and create chunks.
Index the Chunks for Retrieval:
Use a suitable indexing method to index the chunks for retrieval.

Here is a sample implementation:

import fitz  # PyMuPDF
from langchain_core.documents import Document
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.pydantic_v1 import BaseModel
import uuid

class CustomPDFLoader(PyMuPDFLoader):
    def extract_table_data(self, page):
        # Implement table extraction logic here
        # This is a placeholder for actual table extraction logic
        table_data = []
        for block in page.get_text("dict")["blocks"]:
            if block["type"] == 0:  # text block
                for line in block["lines"]:
                    for span in line["spans"]:
                        table_data.append(span["text"])
        return table_data

    def create_chunks(self, table_data, chunk_size=100):
        # Create chunks from table data
        chunks = []
        for i in range(0, len(table_data), chunk_size):
            chunk = table_data[i:i + chunk_size]
            chunks.append(" ".join(chunk))
        return chunks

    def load_and_chunk(self):
        documents = self.load()
        all_chunks = []
        for doc in documents:
            for page in doc.pages:
                table_data = self.extract_table_data(page)
                chunks = self.create_chunks(table_data)
                for chunk in chunks:
                    all_chunks.append(Document(page_content=chunk, metadata=doc.metadata))
        return all_chunks

# Usage
loader = CustomPDFLoader("path_to_your_pdf.pdf")
chunks = loader.load_and_chunk()

# Indexing the chunks using Chroma
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add chunks to the retriever
doc_ids = [str(uuid.uuid4()) for _ in chunks]
summary_chunks = [
    Document(page_content=chunk.page_content, metadata={id_key: doc_ids[i]})
    for i, chunk in enumerate(chunks)
]
retriever.vectorstore.add_documents(summary_chunks)
retriever.docstore.mset(list(zip(doc_ids, chunks)))

# Retrieval example
def retrieve_chunk(index, query, k=2):
    docs = index.similarity_search(query, k=k)
    return docs

# Example query
query = "search_term"
results = retrieve_chunk(retriever, query)
for result in results:
    print(result.page_content if result else "No match found")

This code demonstrates how to load a PDF, extract table data, create chunks, and index them for retrieval using Chroma for vector search. Adjust the extract_table_data method to fit your specific table extraction needs ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

ashokrayapati Jul 11, 2024
Author

@dosuken123
if we have both text and table data in the pdf
then how can we handle with redis?

dosubot[bot] bot Jul 11, 2024

To handle both text and table data in a PDF and index them using Redis, you can follow these steps:

Load the PDF and Extract Text and Table Data:
Use PyMuPDF to load the PDF and extract both text and table data.
Create Chunks from the Extracted Data:
Split the extracted text and table data into manageable chunks.
Create Embeddings for the Chunks:
Use a suitable embedding model to create embeddings for the chunks.
Index the Chunks in Redis:
Store the chunks and their embeddings in Redis for efficient retrieval.

Here is a complete example:

import fitz  # PyMuPDF
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Redis
from langchain_core.documents import Document
from rag_redis.config import EMBED_MODEL, INDEX_NAME, INDEX_SCHEMA, REDIS_URL

class CustomPDFLoader:
    def __init__(self, pdf_path):
        self.pdf_path = pdf_path

    def extract_text_and_table_data(self):
        doc = fitz.open(self.pdf_path)
        text_data = []
        table_data = []

        for page in doc:
            page_dict = page.get_text("dict")
            for block in page_dict["blocks"]:
                if block["type"] == 0:  # Text block
                    text_data.append(block["text"])
                elif block["type"] == 1:  # Image block (could be part of a table)
                    table_data.append(block["text"])

        return text_data, table_data

    def create_chunks(self, data, chunk_size=1500, chunk_overlap=100):
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap, add_start_index=True
        )
        chunks = text_splitter.split_documents([Document(page_content=d) for d in data])
        return chunks

    def load_and_chunk(self):
        text_data, table_data = self.extract_text_and_table_data()
        text_chunks = self.create_chunks(text_data)
        table_chunks = self.create_chunks(table_data)
        return text_chunks + table_chunks

def ingest_documents(pdf_path):
    loader = CustomPDFLoader(pdf_path)
    chunks = loader.load_and_chunk()

    embedder = HuggingFaceEmbeddings(model_name=EMBED_MODEL)
    documents = [Document(page_content=chunk.page_content, metadata=chunk.metadata) for chunk in chunks]

    _ = Redis.from_texts(
        texts=[doc.page_content for doc in documents],
        metadatas=[doc.metadata for doc in documents],
        embedding=embedder,
        index_name=INDEX_NAME,
        index_schema=INDEX_SCHEMA,
        redis_url=REDIS_URL,
    )

# Usage
pdf_path = "path/to/your/pdf_file.pdf"
ingest_documents(pdf_path)

Explanation:

CustomPDFLoader: This class loads the PDF and extracts both text and table data.
extract_text_and_table_data: This method extracts text and table data from the PDF.
create_chunks: This method splits the extracted data into chunks.
load_and_chunk: This method combines the text and table chunks.
ingest_documents: This function uses the loader to extract, chunk, and index the data in Redis.

This approach ensures that both text and table data are handled and indexed efficiently in Redis for retrieval ^[1]^[2].

To continue talking to Dosu, mention @dosu.

ashokrayapati Jul 11, 2024
Author

Awesome @dosuken123,

is there any direct way that we can identify the block whether it is table or text?

i can see the type0 means text but for table it is not predicting automatically?

text_data.append(block["text"]) throwing the exception at block["text"]

dosubot[bot] bot Jul 11, 2024

To directly identify whether a block in a PDF is a table or text using PyMuPDF, you can analyze the structure and layout of the text blocks. While PyMuPDF does not automatically classify blocks as tables, you can infer table structures by examining the alignment and positioning of text blocks.

Here is an approach to identify table blocks:

Identify Text Blocks: Use block["type"] == 0 to identify text blocks.
Analyze Alignment and Positioning: Check the alignment and relative positions of the text blocks to infer table structures.

Here is a sample code snippet to help you get started:

import fitz  # PyMuPDF

def is_table_block(block):
    # Heuristic to determine if a block is a table
    # Check if the block contains multiple lines with aligned text
    if block["type"] != 0:
        return False

    lines = block["lines"]
    if len(lines) < 2:
        return False

    # Check alignment of spans within lines
    for line in lines:
        spans = line["spans"]
        if len(spans) < 2:
            return False
        # Check if spans are aligned (e.g., similar y-coordinates)
        y_coords = [span["bbox"][1] for span in spans]
        if max(y_coords) - min(y_coords) > 5:  # Adjust threshold as needed
            return False

    return True

def extract_blocks_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text_blocks = []
    table_blocks = []

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        blocks = page.get_text("dict")["blocks"]

        for block in blocks:
            if is_table_block(block):
                table_blocks.append(block)
            elif block["type"] == 0:
                text_blocks.append(block)

    return text_blocks, table_blocks

# Example usage
pdf_path = "path/to/your/pdf_file.pdf"
text_blocks, table_blocks = extract_blocks_from_pdf(pdf_path)

print("Text Blocks:")
for block in text_blocks:
    print(block)

print("\nTable Blocks:")
for block in table_blocks:
    print(block)

This code uses a heuristic to determine if a block is a table by checking if it contains multiple lines with aligned text spans. You can adjust the heuristic as needed based on the specific structure of your PDF documents.

This approach helps you differentiate between text and table blocks, even though PyMuPDF does not automatically classify them ^[1].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regarding the pdf loader selection #23918

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

regarding the pdf loader selection #23918

ashokrayapati Jul 5, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 4 comments · 11 replies

dosubot[bot] bot Jul 5, 2024

1. AmazonTextractPDFLoader

2. PDFPlumberLoader

3. Unstructured

4. PyMuPDFLoader

5. AzureAIDocumentIntelligenceLoader

ashokrayapati Jul 5, 2024 Author

dosubot[bot] bot Jul 5, 2024

ashokrayapati Jul 5, 2024 Author

dosubot[bot] bot Jul 5, 2024

ashokrayapati Jul 10, 2024 Author

dosubot[bot] bot Jul 11, 2024

ashokrayapati Jul 11, 2024 Author

dosubot[bot] bot Jul 11, 2024

Explanation:

ashokrayapati Jul 11, 2024 Author

dosubot[bot] bot Jul 11, 2024

ashokrayapati
Jul 5, 2024

Replies: 4 comments 11 replies

dosubot[bot]
bot Jul 5, 2024

ashokrayapati
Jul 5, 2024
Author

ashokrayapati
Jul 5, 2024
Author

ashokrayapati
Jul 10, 2024
Author

ashokrayapati Jul 11, 2024
Author

ashokrayapati Jul 11, 2024
Author