Skip to content

Commit

Permalink
fix: ai asset links
Browse files Browse the repository at this point in the history
  • Loading branch information
zlatanpham committed Oct 4, 2024
1 parent a8125d9 commit f494551
Show file tree
Hide file tree
Showing 9 changed files with 65 additions and 54 deletions.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ tags:
authors:
- hoangnnh
date: 2024-09-06
title: "Multi-agent collaboration for task completion"
description: "In AI integrated systems, instead of put all workload on a single agent, we can apply divide and conquer strategy to distribute workload to multiple agents. This approach can enhance task completion by leveraging the unique skills and capabilities of each agent.This approach allows for more complex and nuanced problem-solving, as well as increased efficiency and scalability. By coordinating and communicating effectively, agents can work together to achieve common goals, divide labor, and overcome challenges that a single agent might face alone"
title: 'Multi-agent collaboration for task completion'
description: 'In AI integrated systems, instead of put all workload on a single agent, we can apply divide and conquer strategy to distribute workload to multiple agents. This approach can enhance task completion by leveraging the unique skills and capabilities of each agent.This approach allows for more complex and nuanced problem-solving, as well as increased efficiency and scalability. By coordinating and communicating effectively, agents can work together to achieve common goals, divide labor, and overcome challenges that a single agent might face alone'
---

In AI integrated systems, instead of putting all the workload on a single agent, we can apply a divide and conquer strategy to distribute workload to multiple agents. This approach can enhance task completion by leveraging the unique skills and capabilities of each agent. This approach allows for more complex and nuanced problem-solving, as well as increased efficiency and scalability. By coordinating and communicating effectively, agents can work together to achieve common goals, divide labor, and overcome challenges that a single agent might face alone.
Expand All @@ -18,7 +18,7 @@ Imagine we plan to integrate AI into our application, we build an AI agent with

## System design

![Multi-agent system](assets/multi-agent-design.webp)
![](assets/multi-agent-design.webp)

A Multi-agent AI system can be designed as follows:

Expand All @@ -32,7 +32,7 @@ There are other variations of this design like add a layer of agent to become su

Let's consider a scenario where we have an event management application, it has features like event creation, project management,... We want to create an AI agent that can handle a complex task of creating an event, creating project, event managements. We can design a multi-agent AI system as follows:

![Multi-agent example](assets/multi-agent-example.webp)
![](assets/multi-agent-example.webp)

- Supervisor: Responsible for routing the task request to appropriate agents and collecting the results. We will defined its system prompt as below:

Expand All @@ -47,15 +47,15 @@ const systemPrompt = `You are a supervisor tasked with managing a conversation b
- Event agent: Responsible for handling the Event module including creating, and managing events within projects. We will defined its system prompt similar like this:

```ts
const systemPrompt=`You are an intelligent assistant responsible for handling the Event module. Given a Event struct format, you will collect event information and map it to the Event struct fields when processing requests. Your responses should be concise and focused on the event details.
const systemPrompt = `You are an intelligent assistant responsible for handling the Event module. Given a Event struct format, you will collect event information and map it to the Event struct fields when processing requests. Your responses should be concise and focused on the event details.
{event_struct_format}
`
```

- Project agent: Responsible for handling the Project module including listing projects/workspaces/hubs, creating, updating, and managing projects. We will defined its system prompt similar like this:

```ts
const systemPrompt= `You are an intelligent assistant responsible for handling the Project module. Given a project struct format, you will collect project information from user input and map it to the Project struct fields when processing requests. Your responses should be concise and focused on the project details.
const systemPrompt = `You are an intelligent assistant responsible for handling the Project module. Given a project struct format, you will collect project information from user input and map it to the Project struct fields when processing requests. Your responses should be concise and focused on the project details.
{project_struct_format}
`
```
Expand All @@ -66,11 +66,11 @@ Now, let's consider a user request: "I want to create event with title "Lady Gag

- Result:

![Multi-agent result](assets/multi-agent-example-result.webp)
![](assets/multi-agent-example-result.webp)

With multi-agent AI, the task is completed successfully, 2 agents collaborate to complete the task, and the supervisor agent manage the workflow. So how supervior agent route the task to appropriate agents? Let's see inside the system.

![Multi-agent routing](assets/multi-agent-example-inside.webp)
![](assets/multi-agent-example-inside.webp)

As you can see, the supervisor is divide tasks into smaller tasks, and handle them one by one. it route task to agents to reasoning, process task, when agents process task, they will user power of LLM to decide to call tool or not. After that, it will return result to supervisor, supervisor will collect result and combine them to continue reasoning, thiking to process request until it reach the final result.

Expand All @@ -79,5 +79,6 @@ As you can see, the supervisor is divide tasks into smaller tasks, and handle th
Multi-agent AI system is a powerful tool that can be used to solve complex tasks. It allows us to distribute the workload to multiple agents, each of which is responsible for a specific scope of work. This can improve the efficiency and accuracy of the system. However, it also introduces new challenges such as coordination and communication between agents, and managing the workflow. To overcome these challenges, we need to design a well-defined system prompt for each agent, and a supervisor agent to manage the workflow.

## References

- https://arxiv.org/abs/2308.08155
- https://github.com/langchain-ai/langgraphjs/blob/main/examples/multi_agent/agent_supervisor.ipynb
102 changes: 56 additions & 46 deletions AI/Building LLM system/multimodal-in-rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,26 @@ tags:
authors:
- hoangnnh
date: 2024-06-28
title: "Multimodal in RAG"
description: "In spite of having taken the world by storm, Large Language Models(LLM) still has some limitations such as limited context window and a knowledge cutoff date. Retrieval-Augmented Generation(RAG) steps in to bridge this gap by allowing LLMs to access and utilize external knowledge sources beyond their training data. However, data is not text based only, it also can be image, audio, table in docs,..."
title: 'Multimodal in RAG'
description: 'In spite of having taken the world by storm, Large Language Models(LLM) still has some limitations such as limited context window and a knowledge cutoff date. Retrieval-Augmented Generation(RAG) steps in to bridge this gap by allowing LLMs to access and utilize external knowledge sources beyond their training data. However, data is not text based only, it also can be image, audio, table in docs,...'
---

In spite of having taken the world by storm, Large Language Models(LLM) still has some limitations such as limited context window and a knowledge cutoff date. Retrieval-Augmented Generation(RAG) steps in to bridge this gap by allowing LLMs to access and utilize external knowledge sources beyond their training data. However, data is not text based only, it also can be image, audio, table in docs,... It make information captured is lost in most RAG application. Therefore, preprocess multimodal data is a problem we should not ignore in making RAG application. In this note, we will explore how to effectively preprocess and integrate multimodal data to enhance the performance and utility of RAG systems.

## Challenge in Multimodal RAG
Taking an example: Doing preprocessing for document(.pdf) file. the document contain a mixture of content types, including text, table and images. When we chunking and embedding data, text splitting may break up tables, corrupting the data in retrieval and the images can lose data in someway.
So how to do it properly. There are several method, but there are 2 main methods are currently used:
- Use a multimodal embedding model to embed both text and images.
- Use a multimodal LLM to summarize images, tables, pass summaries and text data to a text embedding model such as OpenAI’s “text-embedding-3-small”.

Taking an example: Doing preprocessing for document(.pdf) file. the document contain a mixture of content types, including text, table and images. When we chunking and embedding data, text splitting may break up tables, corrupting the data in retrieval and the images can lose data in someway. So how to do it properly. There are several method, but there are 2 main methods are currently used:

- Use a multimodal embedding model to embed both text and images.
- Use a multimodal LLM to summarize images, tables, pass summaries and text data to a text embedding model such as OpenAI’s “text-embedding-3-small”.

In this note, we will focus on second method.

## Multimodal LLM

The main idea of this approach is transform all of your data into a single modality: text. This means that you only need to use a text embedding model to store all of your data within the same vector space.

![Multimodal LLM](assets/multimodal-in-rag-multimodel-llm.webp)
![](assets/multimodal-in-rag-multimodel-llm.webp)

This method is involved following step:

Expand All @@ -33,43 +35,49 @@ This method is involved following step:
4. When searching similarity in retrieval step, get the relevant context and feed raw data to LLM to generate output.

## Implementation

We take this [post](https://cloudedjudgement.substack.com/p/clouded-judgement-111023) for doing implementation cause it contain many chart images. We will follow steps above to do preprocessing for this document.

1. **Extract data from document**: We use [Unstructured](https://unstructured.io/) - a great ELT tool well-suited for this because it can extract elements (tables, images, text) from numerous file types. And categorized them base on there types.
```python
from unstructured.partition.pdf import partition_pdf

# Get element
raw_pdf_elements= partition_pdf(
filename=path + fname,
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
extract_image_block_types=["Image", "Table"],
extract_image_block_output_dir=path,
extract_image_block_to_payload=False
)
```
1. **Extract data from document**: We use [Unstructured](https://unstructured.io/) - a great ELT tool well-suited for this because it can extract elements (tables, images, text) from numerous file types. And categorized them base on there types.

```python
from unstructured.partition.pdf import partition_pdf

# Get element
raw_pdf_elements= partition_pdf(
filename=path + fname,
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
extract_image_block_types=["Image", "Table"],
extract_image_block_output_dir=path,
extract_image_block_to_payload=False
)
```

2. **Summary tables and images**: We chunking text data normally and for extracted table, image, we pass them through LLM (gpt-4o model) to get summary. We can use those prompt for each kind of data to get main content.
```python
table_sum_prompt = """You are an assistant tasked with summarizing tables for retrieval. \
These summaries will be embedded and used to retrieve the raw table elements. \
Give a concise summary of the table that is well optimized for retrieval. Table: {element} """

image_sum_prompt = """You are an assistant tasked with summarizing images for retrieval. \
These summaries will be embedded and used to retrieve the raw image. \
Give a concise summary of the image that is well optimized for retrieval."""
```
After summarizing, the sample result will similar to below.

![Image summary](assets/multimodal-in-rag-img-summary.webp)

3. **Embedding data**: We embedding tables and images summaries to vectorDB and also store raw data to get reference. Remember that we store embeded summarized data(vector) and its raw content but not summarized content.

4. **Retrieval**: when we search for similarity through vectorDB, we will get related context(raw content) and then we feed it with original user's input to generate the response. That why we store raw data but not summarized data because we want something like: "Hey GPT, I have some images and table, can you answer my question based on them", but not: "Hey GPT, I have some images summaries and table summaries, can you answer my question based on these summaries".

```python
table_sum_prompt = """You are an assistant tasked with summarizing tables for retrieval. \
These summaries will be embedded and used to retrieve the raw table elements. \
Give a concise summary of the table that is well optimized for retrieval. Table: {element} """

image_sum_prompt = """You are an assistant tasked with summarizing images for retrieval. \
These summaries will be embedded and used to retrieve the raw image. \
Give a concise summary of the image that is well optimized for retrieval."""
```

After summarizing, the sample result will similar to below.

![](assets/multimodal-in-rag-img-summary.webp)

1. **Embedding data**: We embedding tables and images summaries to vectorDB and also store raw data to get reference. Remember that we store embeded summarized data(vector) and its raw content but not summarized content.

2. **Retrieval**: when we search for similarity through vectorDB, we will get related context(raw content) and then we feed it with original user's input to generate the response. That why we store raw data but not summarized data because we want something like: "Hey GPT, I have some images and table, can you answer my question based on them", but not: "Hey GPT, I have some images summaries and table summaries, can you answer my question based on these summaries".

```python
def prompt_func(data_dict):
"""
Expand Down Expand Up @@ -103,16 +111,18 @@ We take this [post](https://cloudedjudgement.substack.com/p/clouded-judgement-11
return [HumanMessage(content=messages)]
```

5. **Testing**: To testing what we have done so far, let take and image in document and findout our RAG can extract the information from it and answer correctly.
![Testing](assets/multimodal-in-rag-testing.webp)
We take an image which is a table content data about reported revenue of tech companies in quarter. An then we ask some information inside that image. For example: "what is actual reported revenue of Datadog in quarter?" which we can see on the image is $547.5 million. Our RAG response the ansewr correctly.
3. **Testing**: To testing what we have done so far, let take and image in document and findout our RAG can extract the information from it and answer correctly.

![](assets/multimodal-in-rag-testing.webp)

We take an image which is a table content data about reported revenue of tech companies in quarter. An then we ask some information inside that image. For example: "what is actual reported revenue of Datadog in quarter?" which we can see on the image is $547.5 million. Our RAG response the ansewr correctly.

## Conclusion

The integration of various data types, such as text and images, into LLMs enhances their ability to generate more wholistic responses to a user’s queries. More new model come and solve the problems realted to different type of data in LLM. This concept of multimodal RAG is an early but important step toward achieving human-like perception in machines.

## References

- https://medium.com/kx-systems/guide-to-multimodal-rag-for-images-and-text-10dab36e3117
- https://blog.langchain.dev/semi-structured-multi-modal-rag/
- https://unstructured.io

0 comments on commit f494551

Please sign in to comment.