paperRAG

PaperRAG is a command-line tool that enables users to populate a local database with PDF documents and query it using natural language questions. PaperRAG leverages Retrieval Augmented Generation (RAG), combining information retrieval from the document database with answers generated from large language models (LLMs). This approach allows users to obtain informative answers synthesized from relevant document passages, without exposing the original documents or requiring internet access. PaperRAG is designed with privacy and security in mind, operating entirely on the user's local machine without any data transfer. Specifically, it uses the LangChain library and the Chroma vector database to store and retrieve information from PDF files, and large language models from OllaMa to generate answers to queries.

Installation

Clone the repository: git clone https://github.com/your_username/paperRAG.git
Navigate to the project directory: cd paperRAG
Install the package and its dependencies: pip install .
Check and modify the config.py to change the models and settings
Replace the PDFs in the data directory to the ones you'd like to query & chat with

Install Ollama

Install your embedding / LLM models of choice

ollama pull mxbai-embed-large  
ollama pull llama3.1

Run ChromaDB

chroma serve

Usage

paperRAG provides two main commands: populate and query.

Populate the Database

To populate the database with PDF documents, use the populate command: paperrag populate [--reset]

--reset: This optional flag will clear the existing database before populating it with new documents.

The populate command will load all PDF files from the data directory, split them into chunks, and add them to the Chroma vector database.

Query the Database

To query the database, use the query command: paperrag query "your query text" [--num_queries NUM]

Replace "your query text" with the actual query you want to ask. The command will search the database for relevant chunks of text and generate an answer using the OllaMa language model.

--num_queries NUM: This optional argument specifies the number of queries to perform. If not provided, the default value of 10 will be used.

The output will include the generated answer and the sources (document IDs) used to construct the answer.

Configuration

The following configuration options are available in the config.py file:

DATA_PATH: The directory where your PDF files are stored (default: "data").
CHROMA_PATH: The directory where the Chroma vector database will be stored (default: "chroma").
EMBEDDING_MODEL_NAME: The name of the embedding model to use (default: "mxbai-embed-large").
LLM_MODEL_NAME: The name of the language model (via OllaMa) to use (default: "phi3:mini").
PROMPT_TEMPLATE: The prompt template used for querying the language model.

Contributing

Please see CONTRIBUTING.md for details on how to contribute to this project.

This work is largely inspired by the following works, thanks to the authors that provided the following:

License

This project is licensed under the MIT License.

Contact

For any questions or issues, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Notebooks		Notebooks
data		data
paperRAG		paperRAG
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
env.example		env.example
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paperRAG

Installation

Usage

Populate the Database

Query the Database

Configuration

Contributing

License

Contact

About

Releases

Packages

Contributors 2

Languages

Huang-lab/paperRAG

Folders and files

Latest commit

History

Repository files navigation

paperRAG

Installation

Usage

Populate the Database

Query the Database

Configuration

Contributing

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages