Skip to content

Commit

Permalink
feat: improved the testset generation to_pandas and docs (#1536)
Browse files Browse the repository at this point in the history
  • Loading branch information
jjmachan authored Oct 19, 2024
1 parent 2ca00c1 commit 8efe80d
Show file tree
Hide file tree
Showing 9 changed files with 226 additions and 63 deletions.
1 change: 1 addition & 0 deletions docs/getstarted/rag_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,4 @@ df = results.to_pandas()
df.head()
```

![evaluation-result](./raga_evaluation_output.png)
111 changes: 110 additions & 1 deletion docs/getstarted/rag_testset_generation.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
## Testset Generation for RAG
# Testset Generation for RAG

This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents.

## Quickstart
Let's walk through an quick example of generating a testset for a RAG pipeline. Following that will will explore the main components of the testset generation pipeline.

### Load Sample Documents

For the sake of this tutorial we will use sample documents from this [repository](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can replace this with your own documents.
Expand Down Expand Up @@ -47,3 +50,109 @@ You may now export and inspect the generated testset.
```python
dataset.to_pandas()
```

![testset](./testset_output.png)


## A Deeper Look

Now that we have a seen how to generate a testset, let's take a closer look at the main components of the testset generation pipeline and how you can quickly customize it.

At the core there are 2 main operations that are performed to generate a testset.

1. **KnowledgeGraph Creation**: We first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents you provide and use various [Transformations][ragas.testset.transforms.base.BaseGraphTransformation] to enrich the knowledge graph with additional information that we can use to generate the testset. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#knowledge-graph-creation).
2. **Testset Generation**: We use the [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] to generate a set of [scenarios][ragas.testset.synthesizers.base.BaseScenario]. These scenarios are used to generate the [testset][ragas.testset.synthesizers.generate.Testset]. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#scenario-generation).

Now let's see an example of how these components work together to generate a testset.

### KnowledgeGraph Creation

Let's first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents we loaded earlier.

```python
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
```
```
KnowledgeGraph(nodes: 0, relationships: 0)
```

and then add the documents to the knowledge graph.

```python
from ragas.testset.graph import Node, NodeType

for doc in docs:
kg.nodes.append(
Node(
type=NodeType.DOCUMENT,
properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
)
)
```
```
KnowledgeGraph(nodes: 10, relationships: 0)
```

Now we will enrich the knowledge graph with additional information using [Transformations][ragas.testset.transforms.base.BaseGraphTransformation]. Here we will use [default_transforms][ragas.testset.transforms.default_transforms] to create a set of default transformations to apply with an LLM and Embedding Model of your choice.
But you can mix and match transforms or build your own as needed.

```python
from ragas.testset.transforms import default_transforms

# choose your LLM and Embedding Model
from ragas.llms import llm_factory
from ragas.embeddings import embedding_factory

transformer_llm = llm_factory("gpt-4o")
embedding_model = embedding_factory("text-embedding-3-large")

trans = default_transforms(llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, trans)
```

Now we have a knowledge graph with additional information. You can save the knowledge graph too.

```python
kg.save("knowledge_graph.json")
loaded_kg = KnowledgeGraph.load("knowledge_graph.json")
loaded_kg
```
```
KnowledgeGraph(nodes: 48, relationships: 605)
```

### Testset Generation

Now we will use the `loaded_kg` to create the [TestsetGenerator][ragas.testset.synthesizers.generate.TestsetGenerator].

```python
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, knowledge_graph=loaded_kg)
```

We can also define the distribution of queries we would like to generate. Here lets use the default distribution.

```python
from ragas.testset.synthesizers import default_query_distribution

query_distribution = default_query_distribution(generator_llm)
```
```
[
(AbstractQuerySynthesizer(llm=generator_llm), 0.25),
(ComparativeAbstractQuerySynthesizer(llm=generator_llm), 0.25),
(SpecificQuerySynthesizer(llm=generator_llm), 0.5),
]
```

Now we can generate the testset.

```python
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()
```

![testset](./testset_output.png)
Binary file added docs/getstarted/raga_evaluation_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/getstarted/testset_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions src/ragas/dataset_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,8 +237,8 @@ def to_csv(self, path: t.Union[str, Path]):
def to_jsonl(self, path: t.Union[str, Path]):
"""Converts the dataset to a JSONL file."""
with open(path, "w") as jsonlfile:
for sample in self.samples:
jsonlfile.write(json.dumps(sample.to_dict(), ensure_ascii=False) + "\n")
for sample in self.to_list():
jsonlfile.write(json.dumps(sample, ensure_ascii=False) + "\n")

@classmethod
def from_jsonl(cls: t.Type[T], path: t.Union[str, Path]) -> T:
Expand Down
19 changes: 18 additions & 1 deletion src/ragas/testset/synthesizers/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@
from langchain_core.documents import Document as LCDocument
from langchain_core.language_models import BaseLanguageModel as LangchainLLM

from ragas.embeddings.base import BaseRagasEmbeddings
from ragas.llms.base import BaseRagasLLM
from ragas.testset.synthesizers import QueryDistribution
from ragas.testset.synthesizers.base import BaseScenario


RAGAS_TESTSET_GENERATION_GROUP_NAME = "ragas testset generation"
logger = logging.getLogger(__name__)


@dataclass
Expand Down Expand Up @@ -60,6 +63,8 @@ def generate_with_langchain_docs(
documents: t.Sequence[LCDocument],
testset_size: int,
transforms: t.Optional[Transforms] = None,
transforms_llm: t.Optional[BaseRagasLLM] = None,
transforms_embedding_model: t.Optional[BaseRagasEmbeddings] = None,
query_distribution: t.Optional[QueryDistribution] = None,
run_config: t.Optional[RunConfig] = None,
callbacks: t.Optional[Callbacks] = None,
Expand All @@ -69,7 +74,19 @@ def generate_with_langchain_docs(
"""
Generates an evaluation dataset based on given scenarios and parameters.
"""
transforms = transforms or default_transforms()
if transforms is None:
# use default transforms
if transforms_llm is None:
transforms_llm = self.llm
logger.info("Using TestGenerator.llm for transforms")
if transforms_embedding_model is None:
raise ValueError(
"embedding_model must be provided for default_transforms. Alternatively you can provide your own transforms through the `transforms` parameter."
)
transforms = default_transforms(
llm=transforms_llm or self.llm,
embedding_model=transforms_embedding_model,
)

# convert the documents to Ragas nodes
nodes = []
Expand Down
19 changes: 14 additions & 5 deletions src/ragas/testset/synthesizers/testset_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,12 @@ def to_list(self) -> t.List[t.Dict]:
"""
Converts the Testset to a list of dictionaries.
"""
return [sample.model_dump() for sample in self.samples]
list_dict = []
for sample in self.samples:
sample_dict = sample.eval_sample.model_dump(exclude_none=True)
sample_dict["synthesizer_name"] = sample.synthesizer_name
list_dict.append(sample_dict)
return list_dict

@classmethod
def from_list(cls, data: t.List[t.Dict]) -> Testset:
Expand All @@ -61,19 +66,23 @@ def from_list(cls, data: t.List[t.Dict]) -> Testset:
# first create the samples
samples = []
for sample in data:
eval_sample = sample["eval_sample"]
synthesizer_name = sample["synthesizer_name"]
# remove the synthesizer name from the sample
sample.pop("synthesizer_name")
# the remaining sample is the eval_sample
eval_sample = sample

# if user_input is a list it is MultiTurnSample
if "user_input" in eval_sample and not isinstance(
eval_sample.get("user_input"), list
):
eval_sample = SingleTurnSample(**sample["eval_sample"])
eval_sample = SingleTurnSample(**eval_sample)
else:
eval_sample = MultiTurnSample(**sample["eval_sample"])
eval_sample = MultiTurnSample(**eval_sample)

samples.append(
TestsetSample(
eval_sample=eval_sample, synthesizer_name=sample["synthesizer_name"]
eval_sample=eval_sample, synthesizer_name=synthesizer_name
)
)
# then create the testset
Expand Down
55 changes: 1 addition & 54 deletions src/ragas/testset/transforms/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .base import BaseGraphTransformation, Extractor, RelationshipBuilder, Splitter
from .default import default_transforms
from .engine import Parallel, Transforms, apply_transforms, rollback_transforms
from .extractors import (
EmbeddingExtractor,
Expand All @@ -13,60 +14,6 @@
)
from .splitters import HeadlineSplitter


def default_transforms() -> Transforms:
"""
Creates and returns a default set of transforms for processing a knowledge graph.
This function defines a series of transformation steps to be applied to a
knowledge graph, including extracting summaries, keyphrases, titles,
headlines, and embeddings, as well as building similarity relationships
between nodes.
The transforms are applied in the following order:
1. Parallel extraction of summaries and headlines
2. Embedding of summaries for document nodes
3. Splitting of headlines
4. Parallel extraction of embeddings, keyphrases, and titles
5. Building cosine similarity relationships between nodes
6. Building cosine similarity relationships between summaries
Returns
-------
Transforms
A list of transformation steps to be applied to the knowledge graph.
"""
from ragas.testset.graph import NodeType

# define the transforms
summary_extractor = SummaryExtractor()
keyphrase_extractor = KeyphrasesExtractor()
title_extractor = TitleExtractor()
headline_extractor = HeadlinesExtractor()
embedding_extractor = EmbeddingExtractor()
headline_splitter = HeadlineSplitter()
cosine_sim_builder = CosineSimilarityBuilder(threshold=0.8)
summary_embedder = EmbeddingExtractor(
name="summary_embedder",
property_name="summary_embedding",
embed_property_name="summary",
filter_nodes=lambda node: True if node.type == NodeType.DOCUMENT else False,
)
summary_cosine_sim_builder = SummaryCosineSimilarityBuilder(threshold=0.6)

# specify the transforms and their order to be applied
transforms = [
Parallel(summary_extractor, headline_extractor),
summary_embedder,
headline_splitter,
Parallel(embedding_extractor, keyphrase_extractor, title_extractor),
cosine_sim_builder,
summary_cosine_sim_builder,
]
return transforms


__all__ = [
# base
"BaseGraphTransformation",
Expand Down
80 changes: 80 additions & 0 deletions src/ragas/testset/transforms/default.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
from __future__ import annotations

import typing as t

from .engine import Parallel
from .extractors import (
EmbeddingExtractor,
HeadlinesExtractor,
KeyphrasesExtractor,
SummaryExtractor,
TitleExtractor,
)
from .relationship_builders.cosine import (
CosineSimilarityBuilder,
SummaryCosineSimilarityBuilder,
)
from .splitters import HeadlineSplitter

if t.TYPE_CHECKING:
from ragas.embeddings.base import BaseRagasEmbeddings
from ragas.llms.base import BaseRagasLLM

from .engine import Transforms


def default_transforms(
llm: BaseRagasLLM,
embedding_model: BaseRagasEmbeddings,
) -> Transforms:
"""
Creates and returns a default set of transforms for processing a knowledge graph.
This function defines a series of transformation steps to be applied to a
knowledge graph, including extracting summaries, keyphrases, titles,
headlines, and embeddings, as well as building similarity relationships
between nodes.
The transforms are applied in the following order:
1. Parallel extraction of summaries and headlines
2. Embedding of summaries for document nodes
3. Splitting of headlines
4. Parallel extraction of embeddings, keyphrases, and titles
5. Building cosine similarity relationships between nodes
6. Building cosine similarity relationships between summaries
Returns
-------
Transforms
A list of transformation steps to be applied to the knowledge graph.
"""
from ragas.testset.graph import NodeType

# define the transforms
summary_extractor = SummaryExtractor(llm=llm)
keyphrase_extractor = KeyphrasesExtractor(llm=llm)
title_extractor = TitleExtractor(llm=llm)
headline_extractor = HeadlinesExtractor(llm=llm)
embedding_extractor = EmbeddingExtractor(embedding_model=embedding_model)
headline_splitter = HeadlineSplitter()
cosine_sim_builder = CosineSimilarityBuilder(threshold=0.8)
summary_embedder = EmbeddingExtractor(
name="summary_embedder",
property_name="summary_embedding",
embed_property_name="summary",
filter_nodes=lambda node: True if node.type == NodeType.DOCUMENT else False,
embedding_model=embedding_model,
)
summary_cosine_sim_builder = SummaryCosineSimilarityBuilder(threshold=0.6)

# specify the transforms and their order to be applied
transforms = [
Parallel(summary_extractor, headline_extractor),
summary_embedder,
headline_splitter,
Parallel(embedding_extractor, keyphrase_extractor, title_extractor),
cosine_sim_builder,
summary_cosine_sim_builder,
]
return transforms

0 comments on commit 8efe80d

Please sign in to comment.