From 3e599c7bbeef211dc346e9bc1d3a249113fcc4e4 Mon Sep 17 00:00:00 2001 From: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Date: Tue, 17 Dec 2024 14:24:40 +0100 Subject: [PATCH] docs: add Haystack RAG example (#615) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --- docs/examples/rag_haystack.ipynb | 386 +++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 387 insertions(+) create mode 100644 docs/examples/rag_haystack.ipynb diff --git a/docs/examples/rag_haystack.ipynb b/docs/examples/rag_haystack.ipynb new file mode 100644 index 00000000..3e47eacb --- /dev/null +++ b/docs/examples/rag_haystack.ipynb @@ -0,0 +1,386 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# RAG with Haystack" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This example leverages the\n", + "[Haystack Docling extension](https://github.com/DS4SD/docling-haystack), along with\n", + "Milvus-based document store and retriever instances, as well as sentence-transformers\n", + "embeddings.\n", + "\n", + "The presented `DoclingConverter` component enables you to:\n", + "- use various document types in your LLM applications with ease and speed, and\n", + "- leverage Docling's rich format for advanced, document-native grounding.\n", + "\n", + "`DoclingConverter` supports two different export modes:\n", + "- `ExportType.MARKDOWN`: if you want to capture each input document as a separate\n", + " Haystack document, or\n", + "- `ExportType.DOC_CHUNKS` (default): if you want to have each input document chunked and\n", + " to then capture each individual chunk as a separate Haystack document downstream.\n", + "\n", + "The example allows to explore both modes via parameter `EXPORT_TYPE`; depending on the\n", + "value set, the ingestion and RAG pipelines are then set up accordingly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n", + "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n", + "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from pathlib import Path\n", + "from tempfile import mkdtemp\n", + "\n", + "from docling_haystack.converter import ExportType\n", + "from dotenv import load_dotenv\n", + "\n", + "def _get_env_from_colab_or_os(key):\n", + " try:\n", + " from google.colab import userdata\n", + "\n", + " try:\n", + " return userdata.get(key)\n", + " except userdata.SecretNotFoundError:\n", + " pass\n", + " except ImportError:\n", + " pass\n", + " return os.getenv(key)\n", + "\n", + "load_dotenv()\n", + "HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\n", + "PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\n", + "EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n", + "GENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n", + "EXPORT_TYPE = ExportType.DOC_CHUNKS\n", + "QUESTION = \"Which are the main AI models in Docling?\"\n", + "TOP_K = 3\n", + "MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Indexing pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "80beca8762c34095a21467fb7f056059", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Batches: 0%| | 0/2 [00:00