RAG
application that uses OpenAI embeddings
to allow user interaction with the Safaricom PDF M-pesa statement
reports to analyze one's transactions patterns.
The application uses llama-index
as the base for the Retrieval Augmneted Generation
and OpenAI
embeddings as the vector store for similarity search purposes.
The model gets it wrong at some instances during vector inferencing & similarity search and therefore refining the queries or using the LllamaParser
is necessary to produce quality results.
To improve the query results, it is very essential to use tools which clean the data for any RAG
applications. One such tool is the Llama_parser
. The main goal of LlamaParse
is to parse and clean your data, ensuring that it's good quality before passing to any downstream LLM use case such as advanced RAG. To utilize the use of its 1000 pages free API, check the following and to get the code snippets for use, check
Below is a snippet showing the benefits of using the LlamaParser
:
Running the command pip install -r requirements.txt
installs all the required dependencies including the LlamaParser
. To use this Parser, one remaining dependency is using the library net-ascyncio
which can be installed using the command pip install ascyncio
.
After using the Parser
library, the search queries improve significantly. Especially for applications involving use of tables and figures.
Alternatively, instead of converting the pdf fully with OpenAI
embeddings, the library tabular-py
which extracts tables form pdfs and converts them to CSVs
can be used. This library is a simple python wrapper for java-tables and their documentation is conclusive about all approaches. The library however requires JAVA
be installed because it's a python wrapper for JAVA
.
Below is a snippet about how the library achieves this:
After conversion to a CSV, the use of Pandas-AI
can now be employed to allow querying the data using user prompts. Their documentation is also conclusive with a tone of code snippets with examples for querying:
- Excel files
- Google sheets
- CSVs
It also employs use of various API KEYS
which serve as credentials for interacting with the Generative AI
models.