Vectara is the trusted GenAI platform providing simple APIs to create conversational experiencesβsuch as chatbots, semantic search, and question answeringβfrom your data.
vectara-ingest
is an open source Python project that demonstrates how to crawl datasets and ingest them into Vectara. It provides a step-by-step guide on building your own crawler and some pre-built crawlers for ingesting data from sources such as:
- Websites
- RSS feeds
- Jira tickets
- Notion notes
- Docusaurus documentation sites
- Slack
- And many others...
For more information about this repository, see Code Organization and Crawling.
This guide explains how to create a basic crawler to scrape content from Paul Graham's website, and ingest it into Vectara.
- A Vectara account
- A Vectara corpus with an API key that provides indexing permissions
- Python 3.8 (or higher)
- pyyaml - install with:
pip install pyyaml
- Docker installed
This section explains how to clone the vectara-ingest
repository to your machine.
Open a terminal session and clone the repository to a directory on your machine:
git clone https://github.com/vectara/vectara-ingest.git
-
Open a Windows PowerShell terminal.
-
Update your Windows Subsystem for Linux (WSL):
wsl --update
-
Ensure that WSL has the correct version of Linux:
wsl --install ubuntu-20.04
-
Open your Linux terminal and clone the repository to a directory on your machine:
git clone https://github.com/vectara/vectara-ingest.git
Note: make sure you execute step #4 above only within your Linux environment and not in the windows environment. You may need to choose a username for your Ubuntu environment as part of the setup process.
For our example we would index the content of https://www.paulgraham.com website to Vectara. Since this website does not provide a sitemap, but does provide an RSS feed, we will use the vectara-ingest RSS crawler
instead.
-
Navigate to the directory that you have cloned.
-
Copy the
secrets.example.toml
tosecrets.toml
. -
In the
secrets.toml
file, changeapi_key
to the Vectara API Key.To retrieve your API key from the Vectara console, click Access Control in your corpus view or use the Authorization Tab.
-
In the
config/
directory, copy thenews-bbc.yaml
config file topg-rss.yaml
. -
Edit the
pg-rss.yaml
file and make the following changes:-
Change the
vectara.corpus_id
value to the ID of the corpus into which you want to ingest the content of the website.To retrieve your corpus ID from the Vectara console, click Data > Your Corpus Name and you will see the ID on the top of the screen.
-
Change the
vectara.customer_id
value to the ID of your account.To retrieve your customer ID from the Vectara console, click your username in the upper-right corner.
-
Change
rss_crawler.source
topg
. -
Change
rss_crawler.rss_pages
to["http://www.aaronsw.com/2002/feeds/pgessays.rss"]
so it points to the Paul Graham RSS feed. -
Change
rss_crawler.days_past
to365
.
-
-
Ensure that Docker is running on your local machine.
-
Run the script from the directory that you cloned and specify your
.yaml
configuration file and yourdefault
profile from thesecrets.toml
file.bash run.sh config/pg-rss.yaml default
Note:
-
On Linux, ensure that the
run.sh
file is executable by running the following command:cmhod +x run.sh
-
On Windows, ensure that you run this command from within the WSL 2 environment.
Note: To protect your system's resources and make it easier to move your crawlers to the cloud, the crawler executes inside a Docker container. This is a lengthy process because in involves numerous dependencies
-
-
When the container is set up, you can track your crawlerβs progress:
docker logs -f vingest
While your crawler is ingesting data into your Vectara corpus, you can try queries against your corpus on the Vectara Console, click Data > Your Corpus Name and the under the Query tab type in a query such as "What is a maker schedule?"
The codebase includes the following components.
run.sh
: The main shell script to execute when you want to launch a crawl job (see below for more details).ingest.py
: The main entry point for a crawl job.Dockerfile
: The Docker image definition file- Various documentation files like
README.MD
,CONTRIBUTING.MD
, orSECURITY.MD
Fundamental utilities depended upon by the crawlers:
indexer.py
: Defines theIndexer
class which implements helpful methods to index data into Vectara such asindex_url
,index_file()
andindex_document()
.crawler.py
: Defines theCrawler
class which implements a base class for crawling, where each specific crawler should implement thecrawl()
method specific to its type.pdf_convert.py
: Helper class to convert URLs into local PDF documents.extract.py
: Utilities for text extraction from HTMLutils.py
: Some utility functions used by the other code.
Includes implementations of the various specific crawlers. Crawler files are always in the form of xxx_crawler.py
where xxx
is the type of crawler.
Includes example YAML configuration files for various crawling jobs.
Includes some images (png files) used in the documentation
To crawl and index a source you run a crawl "job", which is controlled by several paramters that you can define in a YAML configuration file. You can see example configuration files in the config/ directory.
Each configuration YAML file includes a set of standard variables, for example:
vectara:
# the corpus ID for indexing
corpus_id: 4
# the Vectara customer ID
customer_id: 1234567
# flag: should vectara-ingest reindex if document already exists (optional)
reindex: false
# flag: store a copy of all crawled data that is indexed into a local folder
store_docs: false
# timeout: sets the URL crawling timeout in seconds (optional)
timeout: 90
# post_load_timeout: sets additional timeout past full page load to wait for animations and AJAX
post_load_timeout: 5
# flag: if true, will print extra debug messages when active
verbose: false
# flag: remove code or not from HTML (optional)
remove_code: true
# flag: should text extraction from web pages use special processing to remove boilerplate (optional)
# this can be helpful when processing news pages or others which have a lot of advertising content
remove_boilerplate: false
# flag: enable special processing for tables inside PDFs or HTML (optional)
# Notes:
# 1. This processing uses OPENAI, and requires to list the OPENAI_API_KEY in your `secrets.toml` under a special profile called `general`.
# 2. If enabled,
# - When crawling PDF, PPT or DOC files, the code will extract table content, then use GPT-4o to summarize the table, and ingest this summarized text while into Vectara.
# - This flag also enables processing of images (e.g. diagrams) and uses GPT-4o to summarize the content of the images using the vision capabilities of GPT-4o.
# - Simiarly, when crawling HTML files, any HTML table data will be summarized with GPT-4o
# 3. This processing is quite slow and will require you to have an additional paid subscription to OpenAI. The code uses the "detectron2_onnx" unstructured model which is fastest. You can modify this to use one of the alternatives: https://unstructured-io.github.io/unstructured/best_practices/models.html) if you want a slower but more performance model.
# See [here](TABLE_SUMMARY.md) for some examples of how table summary works.
summarize_tables: false
# Whether masking of PII is attempted on all text fields (title, text, metadata values)
# Notes:
# 1. This masking is never done on files uploaded to Vectara directly (via e.g. indexer.index_file())
# 2. Masking is done using Microsoft Presidio PII analyzer and anonymizer, and is limited to English only
mask_pii: false
# Which whisper model to use for audio files (relevant for YT, S3 and folder crawlers)
# Valid values: tiny, base, small, medium or large. Defaults to base.
whisper_model: the model name for whisper
crawling:
# type of crawler; valid options are website, docusaurus, notion, jira, rss, mediawiki, discourse, github and others (this continues to evolve as new crawler types are added)
crawler_type: XXX
Following that, where needed, the same YAML configuration file will a include crawler-specific section with crawler-specific parameters (see about crawlers):
XXX_crawler:
# specific parameters for the crawler XXX
We use a secrets.toml
file to hold secret keys and parameters. You need to create this file in the root directory before running a crawl job. This file can hold multiple "profiles", and specific specific secrets for each of these profiles. For example:
[general]
OPENAI_API_KEY="sk-..."
[profile1]
api_key="<VECTAR-API-KEY-1>"
[profile2]
api_key="<VECTARA-API-KEY-2>"
[profile3]
api_key="<VECTARA-API-KEY-3>"
MOTION_API_KEY="<YOUR-NOTION-API-KEY>
The use of the toml
standard allows easy secrets management when you have multiple crawl jobs that may not share the same secrets. For example when you have a different Vectara API key for indexing differnet corpora.
Many of the crawlers have their own secrets, for example Notion, Discourse, Jira, or GitHub. These are also kept in the secrets.toml
file in the appropriate section and need to be all upper case (e.g. NOTION_API_KEY
or JIRA_PASSWORD
).
If you are using the table summarization feature, which utilizes OPENAI, you have to provide your own OPENAI API key. In this case, you would need to put that key under the [general]
profile. This is a special profile name reserved for this purpose.
The Indexer
class provides useful functionality to index documents into Vectara.
This is probably the most useful method. It takes a URL as input and extracts the content from that URL (using the playwright
library), then sends that content to Vectara using the standard indexing API. If the URL points to a PDF document, special care is taken to ensure proper processing.
Please note that the special flag remove_boilerplate
can be set to true if you want the content to be stripped of boilerplate text (e.g. advertising content). In this case the indexer uses Goose3
and justext
to extract the main (most important) content of the article, ignoring links, ads and other not-important content.
Use this when you have a file that you want to index using Vectara's file_uplaod API, so that it takes care of format identification, segmentation of text and indexing.
Use these when you build the document
JSON structure directly and want to index this document in the Vectara corpus.
The reindex
parameter determines whether an existing document should be reindexed or not. If reindexing is required, the code automatically takes care of that by calling delete_doc()
to first remove the document from the corpus and then indexes the document.
The project is designed to be used within a Docker container, so that a crawl job can be run anywhere - on a local machine or on any cloud machine. See the Dockerfile for more information on the Docker file structure and build.
To run vectara-ingest
locally, perform the following steps:
- Make sure you have Docker installed on your machine, and that there is enough memory and storage to build the docker image.
- Clone this repo locally with
git clone https://github.com/vectara/vectara-ingest.git
. - Enter the directory with
cd vectara-ingest
. - Choose the configuration file for your project and run:
bash run.sh config/<config-file>.yaml <profile>
.
This command creates the Docker container locally, configures it with the parameters specified in your configuration file (with secrets taken from the appropriate <profile>
in secrets.toml
), and starts up the Docker container.
If you want your vectara-ingest
to run on Render, please follow these steps:
- Sign Up/Log In: If you don't have a Render account, you'll need to create one. If you already have one, just log in.
- Create New Service: Once you're logged in, click on the "New" button usually found on the dashboard and select "Background Worker".
- Choose "Deploy an existing image from a registry" and click "Next" Specify Docker Image: In the "Image URL" fill in "vectara/vectara-ingest" and click "Next"
- Choose a name for your deployment (e.g. "vectara-ingest"), and if you need to pick a region or leave the default. Then pick your instance type.
- Click "Create Web Service"
- Click "Environment", then "Add Secret File": name the file config.yaml, and copy the contents of the config.yaml for your crawler
- Assuming you have a
secrets.toml
file with multiple profiles and you want to use the secrets for the profile[my-profile]
, click "Environment", then "Add Secret File": name the file secrets.toml, and copy only the contents of[my-profile]
from the secrets.toml to this file (incuding the profile name). Make sure to copy[general]
profile and your OPENAI_API_KEY if you are using table summarization. - Click "Settings" and go to "Docker Command" and click "Edit", the enter the following command:
/bin/bash -c mkdir /home/vectara/env && cp /etc/secrets/config.yaml /home/vectara/env/ && cp /etc/secrets/secrets.toml /home/vectara/env/ && python3 ingest.py /home/vectara/env/config.yaml <my-profile>"
Then click "Save Changes", and your application should now be deployed.
Note:
- Hosting in this way does not support the CSV or folder crawlers.
- Where vectara-ingest uses
playwright
to crawl content (e.g. website crawler or docs crawler), the Render instance may require more RAM to work properly with headless browser. Make sure your Render deployment uses the correct machine type to allow that.
vectara-ingest
can be easily deployed on any cloud platform such as AWS, Azure or GCP. You simply create a cloud VM, SSH into your machine, and follow the local-deployment instructions above.
The vectara-ingest
container is available for easy deployment via docker-hub.
π€ Vectara
- Website: vectara.com
- Twitter: @vectara
- GitHub: @vectara
- LinkedIn: @vectara
- Discord: @vectara
Contributions, issues and feature requests are welcome and appreciated!
Feel free to check issues page. You can also take a look at the contributing guide.
Give a βοΈ if this project helped you!
Copyright Β© 2024 Vectara.
This project is Apache 2.0 licensed.