Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bulk operations #23

Open
jgpruitt opened this issue Jun 10, 2024 · 8 comments · May be fixed by #280
Open

Support bulk operations #23

jgpruitt opened this issue Jun 10, 2024 · 8 comments · May be fixed by #280
Labels
enhancement New feature or request

Comments

@jgpruitt
Copy link
Collaborator

What is the most performant/efficient way to embed lots of rows? Can we build functions or procedures to make this easy? If not, can we document guidance and provide example code?

@jgpruitt jgpruitt added the enhancement New feature or request label Jun 10, 2024
@kolaente
Copy link
Contributor

kolaente commented Nov 4, 2024

I have hit this problem with an application I'm building (not yet with pgai). We were ingesting so much data into the system that we'd run into openai's rate limits. The solution here was to build a batch processing job which creates openai embedding batches and, on a schedule, checks if openai has processed the batch, then saving the returned embeddings into the databse.

I wonder if pgai could do something like this as well?

@theodufort
Copy link

I met the same problem when I wanted to bulk generate embeddings for 25 million + rows, I can not do it without reaching OSError: [Errno 24] Too many open files

@theodufort
Copy link

Right now only way that is working for me without having any error is doing it by small batches this way:
UPDATE public.subjects SET subject_embedding = ai.ollama_embed('nomic-embed-text:v1.5', name) WHERE id IN ( SELECT id FROM public.subjects WHERE subject_embedding IS NULL LIMIT 100 );

@alejandrodnm
Copy link
Contributor

Have you tried setting up a vectorizer? When you run the worker, you can specify the number of batches, concurrency, and poll interval.

Batches are sent in a single request to openAI.

Of course, if you tried to do too much in a small period, you'll bound to be rate limited. We currently don't support the openAI's batch API. The alternative will be setting the vectorizer's config to stay below rate limit's threshold. This will work if you get spikes of ingest, and you're fine with some delay between inserts and generating the embeddings.

@kolaente
Copy link
Contributor

I haven't set one up yet, still exploring options.

Is it possible to either extend the vectorizer to support adding embeddings through openai's batch api or manually, via a service which I run? (The latter would then do all the checking and batch creation etc, but would require marking a chunk as "this will be created asynchronously, please do not do anything")

@alejandrodnm
Copy link
Contributor

@kolaente Supporting the batch API in the vectorizer worker docker image could take some effort, and it's not something in our current roadmap. But you can extend the vectorizer, and make your own worker. Once you have something running, we can discuss integrating that into the pgai repo.

When you create a vectorizer in your DB with the ai.create_vectorizer function, an embeddings store table, and a queue table will be created. The queue table will be populated whenever something changes in your source table.

These is the query we use to fetch items from the queue:

@cached_property
def fetch_work_query(self) -> sql.Composed:

From a very simplistic point of view, I think this is more or less what you need to do:

  • Create another table (or tables) to keep track of what queue items have been sent and the job id.
  • Update the fetch work query to skip those queue items that have been sent to a batch. Or delete them from the queue once you create a batch job with them, and keep track of the batches in a different table.
  • Have a separate process that polls for the batch jobs, inserts the embeddings, and cleans up the queues.

There are many more pieces, so that's why it's non-trivial to add it right now to the project.

If you implement something like this, we'd be very interested to learn about your experience.

Hope this helps to get you started. Feel free to reach out if you have more questions, and you can always reach out in the PGAI discord https://discord.com/channels/1246241636019605616/1246243698111676447

@kolaente
Copy link
Contributor

@alejandrodnm Thanks! I'll look into implementing this and report back with my findings. (might take a while until I get to it though)

Would I need to fork and build everything from scratch to extend the vectorizer or is there a clear path to extending it?

@kolaente
Copy link
Contributor

kolaente commented Dec 5, 2024

I've just opened a PR for this: #280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants