👋 Welcome to the Document QA system! This repository contains the code for a system that allows you to ask questions about your documents and get answers based on their contents. It supports a wide range of document formats, including PDF, Word, Excel, PowerPoint, text files, and even images!
- 💻 Supports a variety of document formats, including PDF, Word, Excel, PowerPoint, text files, and images
- 🤖 Uses the Hugging Face Transformers library to create embeddings for document chunks
- 🔍 Uses the FAISS library to create an index for those embeddings, allowing for efficient similarity search
- 💬 Allows users to ask questions about their documents and get answers based on the contents of those documents
- ⚡️ Uses multiprocessing to parallelize the creation of the index for improved performance
- Python 3.6 or higher
- The following Python packages:
- transformers
- langchain
- fitz
- Pillow
- textract
- pandas
- python-pptx
- concurrent-futures
- opencv-python (for image support)
- Clone this repository to your local machine:
git clone https://github.com/AiGptCode/AskyourDocuments.git
- Install the required Python packages:
pip install transformers langchain fitz pillow textract pandas python-pptx opencv-python concurrent-futures
- Set your Hugging Face API key as an environment variable:
export HUGGINGFACE_API_TOKEN=your-api-key
- Run the
main.py
script and enter the path to the directory containing your documents:
python AskyourDocuments.py
- Ask a question about your documents and get an answer based on the contents of those documents.
Note: If you want to include images in your search, make sure they are in a supported format (e.g., JPEG, PNG) and are located in the same directory as your other documents.
If you would like to contribute to this project, please follow these steps:
- Fork this repository to your own GitHub account.
- Create a new branch for your changes:
git checkout -b my-feature-branch
- Make your changes and commit them:
git commit -am 'Add some feature'
- Push your changes to your fork:
git push origin my-feature-branch
- Open a pull request against the original repository.
This project is licensed under the MIT License.
- The Hugging Face Transformers library for providing pre-trained models and tokenizers
- The FAISS library for providing efficient similarity search and clustering of dense vectors
- The
langchain
library for providing utilities for creating and working with language models - The
fitz
library for providing utilities for working with PDF files - The
Pillow
library for providing utilities for working with image files - The
textract
library for providing utilities for extracting text from various file formats - The
pandas
library for providing utilities for working with tabular data in Python - The
python-pptx
library for providing utilities for working with PowerPoint files - The
concurrent-futures
library for providing a high-level interface for asynchronously executing callables - The
opencv-python
library for providing utilities for working with image and video data (for image support)