- Objective
- Features
- Requirements
- Installation
- Usage
- Performance and Accuracy
- Limitations
- Use in Downstream Tasks
- Contributing
- License
This project aims to extract markdown-formatted content from PDF files, specifically designed for downstream tasks such as Retrieval Augmented Generation (RAG). It preserves various markdown elements such as tables, images, links, bold and italic text, blockquotes, code blocks, and other markdown-specific syntax. The script utilizes Python libraries like PyMuPDF (fitz), pdfplumber, pytesseract, and others to achieve accurate extraction and conversion, focusing solely on converting PDF files to Markdown format.
- Extracts text, images, tables, and code blocks from PDF files
- Converts PDF content to markdown format optimized for RAG and other NLP tasks
- Preserves formatting for bold, italic, tables, images, links, lists, and code blocks
- Handles complex layouts including multi-column text
- Performs OCR on images to extract text
- Generates image captions using a pre-trained model
- Outputs clean, structured markdown suitable for information retrieval and text generation tasks
- Python 3.8+
- PyMuPDF (fitz)
- pdfplumber
- pytesseract
- OpenCV (cv2)
- numpy
- Pillow (PIL)
- transformers
- torch
-
Clone the repository:
git clone https://github.com/iamarunbrahma/pdf-to-markdown.git cd pdf-to-markdown
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate
-
Install the required packages:
pip install -r requirements.txt
-
Install Tesseract OCR:
- On Ubuntu:
sudo apt-get install tesseract-ocr
- On macOS:
brew install tesseract
- On Windows: Download and install from GitHub
- On Ubuntu:
Run the script with the path to your PDF file as an argument:
python extract.py --pdf_path path/to/your/file.pdf
The extracted markdown content will be saved in the outputs
directory with the same name as the input PDF file, but with a .md
extension.
The script is designed to handle various PDF layouts and content types, with a focus on producing high-quality markdown for downstream NLP tasks:
-
Accuracy: The extractor aims for high accuracy in preserving the original document's structure and formatting. It handles common elements like text, tables, images, links, and code blocks well, ensuring the output is suitable for tasks like RAG. However, very complex layouts or PDFs with non-standard formatting might require manual review.
-
Speed: The processing time depends on the PDF's size and complexity. On average, for a 10-page PDF with mixed content (text, images, tables, and code blocks), the extraction process typically takes about 30-60 seconds on a modern computer.
-
Optimization for RAG: The output is structured to facilitate easy parsing and chunking for RAG systems, with clear delineation between different sections and content types.
- This tool is specifically designed for PDF to Markdown conversion and does not handle other file formats.
- Very large PDFs (100+ pages) may require significant processing time.
- PDFs with complex mathematical formulas or specialized symbols may not be perfectly converted.
- Scanned PDFs without embedded text will rely on OCR, which may not be 100% accurate.
The markdown output from this extractor is particularly well-suited for:
- Retrieval Augmented Generation (RAG): The structured markdown can be easily indexed and retrieved, providing context for language models in RAG systems.
- Text Summarization: Clean, well-formatted markdown facilitates more accurate summarization of document content.
- Information Extraction: The preserved structure aids in extracting specific information from documents.
Contributions to improve the extractor's accuracy, speed, or feature set are welcome, especially those that enhance its utility for RAG and other NLP tasks. Please feel free to submit issues or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.