feat: improved the testset generation to_pandas and docs (#1536)

explodinggradients · Oct 19, 2024 · 8efe80d · 8efe80d
1 parent 2ca00c1
commit 8efe80d
Show file tree

Hide file tree

Showing 9 changed files with 226 additions and 63 deletions.
diff --git a/docs/getstarted/rag_evaluation.md b/docs/getstarted/rag_evaluation.md
@@ -51,3 +51,4 @@ df = results.to_pandas()
 df.head()
 ```
 
+![evaluation-result](./raga_evaluation_output.png)
diff --git a/docs/getstarted/rag_testset_generation.md b/docs/getstarted/rag_testset_generation.md
@@ -1,7 +1,10 @@
-## Testset Generation for RAG
+# Testset Generation for RAG
 
 This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents.
 
+## Quickstart
+Let's walk through an quick example of generating a testset for a RAG pipeline. Following that will will explore the main components of the testset generation pipeline.
+
 ### Load Sample Documents
 
 For the sake of this tutorial we will use sample documents from this [repository](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can replace this with your own documents.
@@ -47,3 +50,109 @@ You may now export and inspect the generated testset.
 ```python
 dataset.to_pandas()
 ```
+
+![testset](./testset_output.png)
+
+
+## A Deeper Look
+
+Now that we have a seen how to generate a testset, let's take a closer look at the main components of the testset generation pipeline and how you can quickly customize it.
+
+At the core there are 2 main operations that are performed to generate a testset.
+
+1. **KnowledgeGraph Creation**: We first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents you provide and use various [Transformations][ragas.testset.transforms.base.BaseGraphTransformation] to enrich the knowledge graph with additional information that we can use to generate the testset. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#knowledge-graph-creation).
+2. **Testset Generation**: We use the [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] to generate a set of [scenarios][ragas.testset.synthesizers.base.BaseScenario]. These scenarios are used to generate the [testset][ragas.testset.synthesizers.generate.Testset]. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#scenario-generation).
+
+Now let's see an example of how these components work together to generate a testset.
+
+### KnowledgeGraph Creation
+
+Let's first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents we loaded earlier.
+
+```python
+from ragas.testset.graph import KnowledgeGraph
+
+kg = KnowledgeGraph()
+```
+```
+KnowledgeGraph(nodes: 0, relationships: 0)
+```
+
+and then add the documents to the knowledge graph.
+
+```python
+from ragas.testset.graph import Node, NodeType
+
+for doc in docs:
+    kg.nodes.append(
+        Node(
+            type=NodeType.DOCUMENT,
+            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
+        )
+    )
+```
+```
+KnowledgeGraph(nodes: 10, relationships: 0)
+```
+
+Now we will enrich the knowledge graph with additional information using [Transformations][ragas.testset.transforms.base.BaseGraphTransformation]. Here we will use [default_transforms][ragas.testset.transforms.default_transforms] to create a set of default transformations to apply with an LLM and Embedding Model of your choice. 
+But you can mix and match transforms or build your own as needed.
+
+```python
+from ragas.testset.transforms import default_transforms
+
+# choose your LLM and Embedding Model
+from ragas.llms import llm_factory
+from ragas.embeddings import embedding_factory
+
+transformer_llm = llm_factory("gpt-4o")
+embedding_model = embedding_factory("text-embedding-3-large")
+
+trans = default_transforms(llm=transformer_llm, embedding_model=embedding_model)
+apply_transforms(kg, trans)
+```
+
+Now we have a knowledge graph with additional information. You can save the knowledge graph too.
+
+```python
+kg.save("knowledge_graph.json")
+loaded_kg = KnowledgeGraph.load("knowledge_graph.json")
+loaded_kg
+```
+```
+KnowledgeGraph(nodes: 48, relationships: 605)
+```
+
+### Testset Generation
+
+Now we will use the `loaded_kg` to create the [TestsetGenerator][ragas.testset.synthesizers.generate.TestsetGenerator].
+
+```python
+from ragas.testset import TestsetGenerator
+
+generator = TestsetGenerator(llm=generator_llm, knowledge_graph=loaded_kg)
+```
+
+We can also define the distribution of queries we would like to generate. Here lets use the default distribution.
+
+```python
+from ragas.testset.synthesizers import default_query_distribution
+
+query_distribution = default_query_distribution(generator_llm)
+```
+```
+[
+    (AbstractQuerySynthesizer(llm=generator_llm), 0.25),
+    (ComparativeAbstractQuerySynthesizer(llm=generator_llm), 0.25),
+    (SpecificQuerySynthesizer(llm=generator_llm), 0.5),
+]
+```
+
+Now we can generate the testset.
+
+```python
+testset = generator.generate(testset_size=10, query_distribution=query_distribution)
+testset.to_pandas()
+```
+
+![testset](./testset_output.png)
diff --git a/docs/getstarted/raga_evaluation_output.png b/docs/getstarted/raga_evaluation_output.png
diff --git a/docs/getstarted/testset_output.png b/docs/getstarted/testset_output.png
diff --git a/src/ragas/dataset_schema.py b/src/ragas/dataset_schema.py
@@ -237,8 +237,8 @@ def to_csv(self, path: t.Union[str, Path]):
     def to_jsonl(self, path: t.Union[str, Path]):
         """Converts the dataset to a JSONL file."""
         with open(path, "w") as jsonlfile:
-            for sample in self.samples:
-                jsonlfile.write(json.dumps(sample.to_dict(), ensure_ascii=False) + "\n")
+            for sample in self.to_list():
+                jsonlfile.write(json.dumps(sample, ensure_ascii=False) + "\n")
 
     @classmethod
     def from_jsonl(cls: t.Type[T], path: t.Union[str, Path]) -> T:

diff --git a/src/ragas/testset/synthesizers/generate.py b/src/ragas/testset/synthesizers/generate.py
@@ -20,11 +20,14 @@
     from langchain_core.documents import Document as LCDocument
     from langchain_core.language_models import BaseLanguageModel as LangchainLLM
 
+    from ragas.embeddings.base import BaseRagasEmbeddings
+    from ragas.llms.base import BaseRagasLLM
     from ragas.testset.synthesizers import QueryDistribution
     from ragas.testset.synthesizers.base import BaseScenario
 
 
 RAGAS_TESTSET_GENERATION_GROUP_NAME = "ragas testset generation"
+logger = logging.getLogger(__name__)
 
 
 @dataclass
@@ -60,6 +63,8 @@ def generate_with_langchain_docs(
         documents: t.Sequence[LCDocument],
         testset_size: int,
         transforms: t.Optional[Transforms] = None,
+        transforms_llm: t.Optional[BaseRagasLLM] = None,
+        transforms_embedding_model: t.Optional[BaseRagasEmbeddings] = None,
         query_distribution: t.Optional[QueryDistribution] = None,
         run_config: t.Optional[RunConfig] = None,
         callbacks: t.Optional[Callbacks] = None,
@@ -69,7 +74,19 @@ def generate_with_langchain_docs(
         """
         Generates an evaluation dataset based on given scenarios and parameters.
         """
-        transforms = transforms or default_transforms()
+        if transforms is None:
+            # use default transforms
+            if transforms_llm is None:
+                transforms_llm = self.llm
+                logger.info("Using TestGenerator.llm for transforms")
+            if transforms_embedding_model is None:
+                raise ValueError(
+                    "embedding_model must be provided for default_transforms. Alternatively you can provide your own transforms through the `transforms` parameter."
+                )
+            transforms = default_transforms(
+                llm=transforms_llm or self.llm,
+                embedding_model=transforms_embedding_model,
+            )
 
         # convert the documents to Ragas nodes
         nodes = []

diff --git a/src/ragas/testset/synthesizers/testset_schema.py b/src/ragas/testset/synthesizers/testset_schema.py
@@ -51,7 +51,12 @@ def to_list(self) -> t.List[t.Dict]:
         """
         Converts the Testset to a list of dictionaries.
         """
-        return [sample.model_dump() for sample in self.samples]
+        list_dict = []
+        for sample in self.samples:
+            sample_dict = sample.eval_sample.model_dump(exclude_none=True)
+            sample_dict["synthesizer_name"] = sample.synthesizer_name
+            list_dict.append(sample_dict)
+        return list_dict
 
     @classmethod
     def from_list(cls, data: t.List[t.Dict]) -> Testset:
@@ -61,19 +66,23 @@ def from_list(cls, data: t.List[t.Dict]) -> Testset:
         # first create the samples
         samples = []
         for sample in data:
-            eval_sample = sample["eval_sample"]
+            synthesizer_name = sample["synthesizer_name"]
+            # remove the synthesizer name from the sample
+            sample.pop("synthesizer_name")
+            # the remaining sample is the eval_sample
+            eval_sample = sample
 
             # if user_input is a list it is MultiTurnSample
             if "user_input" in eval_sample and not isinstance(
                 eval_sample.get("user_input"), list
             ):
-                eval_sample = SingleTurnSample(**sample["eval_sample"])
+                eval_sample = SingleTurnSample(**eval_sample)
             else:
-                eval_sample = MultiTurnSample(**sample["eval_sample"])
+                eval_sample = MultiTurnSample(**eval_sample)
 
             samples.append(
                 TestsetSample(
-                    eval_sample=eval_sample, synthesizer_name=sample["synthesizer_name"]
+                    eval_sample=eval_sample, synthesizer_name=synthesizer_name
                 )
             )
         # then create the testset

diff --git a/src/ragas/testset/transforms/__init__.py b/src/ragas/testset/transforms/__init__.py
@@ -1,4 +1,5 @@
 from .base import BaseGraphTransformation, Extractor, RelationshipBuilder, Splitter
+from .default import default_transforms
 from .engine import Parallel, Transforms, apply_transforms, rollback_transforms
 from .extractors import (
     EmbeddingExtractor,
@@ -13,60 +14,6 @@
 )
 from .splitters import HeadlineSplitter
 
-
-def default_transforms() -> Transforms:
-    """
-    Creates and returns a default set of transforms for processing a knowledge graph.
-
-    This function defines a series of transformation steps to be applied to a
-    knowledge graph, including extracting summaries, keyphrases, titles,
-    headlines, and embeddings, as well as building similarity relationships
-    between nodes.
-
-    The transforms are applied in the following order:
-    1. Parallel extraction of summaries and headlines
-    2. Embedding of summaries for document nodes
-    3. Splitting of headlines
-    4. Parallel extraction of embeddings, keyphrases, and titles
-    5. Building cosine similarity relationships between nodes
-    6. Building cosine similarity relationships between summaries
-
-    Returns
-    -------
-    Transforms
-        A list of transformation steps to be applied to the knowledge graph.
-
-    """
-    from ragas.testset.graph import NodeType
-
-    # define the transforms
-    summary_extractor = SummaryExtractor()
-    keyphrase_extractor = KeyphrasesExtractor()
-    title_extractor = TitleExtractor()
-    headline_extractor = HeadlinesExtractor()
-    embedding_extractor = EmbeddingExtractor()
-    headline_splitter = HeadlineSplitter()
-    cosine_sim_builder = CosineSimilarityBuilder(threshold=0.8)
-    summary_embedder = EmbeddingExtractor(
-        name="summary_embedder",
-        property_name="summary_embedding",
-        embed_property_name="summary",
-        filter_nodes=lambda node: True if node.type == NodeType.DOCUMENT else False,
-    )
-    summary_cosine_sim_builder = SummaryCosineSimilarityBuilder(threshold=0.6)
-
-    # specify the transforms and their order to be applied
-    transforms = [
-        Parallel(summary_extractor, headline_extractor),
-        summary_embedder,
-        headline_splitter,
-        Parallel(embedding_extractor, keyphrase_extractor, title_extractor),
-        cosine_sim_builder,
-        summary_cosine_sim_builder,
-    ]
-    return transforms
-
-
 __all__ = [
     # base
     "BaseGraphTransformation",

diff --git a/src/ragas/testset/transforms/default.py b/src/ragas/testset/transforms/default.py
@@ -0,0 +1,80 @@
+from __future__ import annotations
+
+import typing as t
+
+from .engine import Parallel
+from .extractors import (
+    EmbeddingExtractor,
+    HeadlinesExtractor,
+    KeyphrasesExtractor,
+    SummaryExtractor,
+    TitleExtractor,
+)
+from .relationship_builders.cosine import (
+    CosineSimilarityBuilder,
+    SummaryCosineSimilarityBuilder,
+)
+from .splitters import HeadlineSplitter
+
+if t.TYPE_CHECKING:
+    from ragas.embeddings.base import BaseRagasEmbeddings
+    from ragas.llms.base import BaseRagasLLM
+
+    from .engine import Transforms
+
+
+def default_transforms(
+    llm: BaseRagasLLM,
+    embedding_model: BaseRagasEmbeddings,
+) -> Transforms:
+    """
+    Creates and returns a default set of transforms for processing a knowledge graph.
+
+    This function defines a series of transformation steps to be applied to a
+    knowledge graph, including extracting summaries, keyphrases, titles,
+    headlines, and embeddings, as well as building similarity relationships
+    between nodes.
+
+    The transforms are applied in the following order:
+    1. Parallel extraction of summaries and headlines
+    2. Embedding of summaries for document nodes
+    3. Splitting of headlines
+    4. Parallel extraction of embeddings, keyphrases, and titles
+    5. Building cosine similarity relationships between nodes
+    6. Building cosine similarity relationships between summaries
+
+    Returns
+    -------
+    Transforms
+        A list of transformation steps to be applied to the knowledge graph.
+
+    """
+    from ragas.testset.graph import NodeType
+
+    # define the transforms
+    summary_extractor = SummaryExtractor(llm=llm)
+    keyphrase_extractor = KeyphrasesExtractor(llm=llm)
+    title_extractor = TitleExtractor(llm=llm)
+    headline_extractor = HeadlinesExtractor(llm=llm)
+    embedding_extractor = EmbeddingExtractor(embedding_model=embedding_model)
+    headline_splitter = HeadlineSplitter()
+    cosine_sim_builder = CosineSimilarityBuilder(threshold=0.8)
+    summary_embedder = EmbeddingExtractor(
+        name="summary_embedder",
+        property_name="summary_embedding",
+        embed_property_name="summary",
+        filter_nodes=lambda node: True if node.type == NodeType.DOCUMENT else False,
+        embedding_model=embedding_model,
+    )
+    summary_cosine_sim_builder = SummaryCosineSimilarityBuilder(threshold=0.6)
+
+    # specify the transforms and their order to be applied
+    transforms = [
+        Parallel(summary_extractor, headline_extractor),
+        summary_embedder,
+        headline_splitter,
+        Parallel(embedding_extractor, keyphrase_extractor, title_extractor),
+        cosine_sim_builder,
+        summary_cosine_sim_builder,
+    ]
+    return transforms