Skip to content

Commit

Permalink
Merge branch 'main' into fix/zero-division-error-toolcallaccuracy
Browse files Browse the repository at this point in the history
  • Loading branch information
shahules786 authored Nov 19, 2024
2 parents 2e504f5 + f14cd85 commit 9b41b93
Show file tree
Hide file tree
Showing 36 changed files with 1,304 additions and 2,020 deletions.
121 changes: 11 additions & 110 deletions docs/concepts/metrics/available_metrics/general_purpose.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ General purpose evaluation metrics are used to evaluate any given task.

`AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.

**Without reference**

### Example

Expand All @@ -28,32 +27,6 @@ scorer = AspectCritic(
await scorer.single_turn_ascore(sample)
```

**With reference**

### Example

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import AspectCriticWithReference


sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris.",
)

scorer = AspectCritic(
name="correctness",
definition="Is the response factually similar to the reference?",
llm=evaluator_llm

)

await scorer.single_turn_ascore(sample)

```

### How it works

Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:
Expand All @@ -74,41 +47,22 @@ Critics are essentially basic LLM calls using the defined criteria. For example,

Course graned evaluation method is an evaluation metric that can be used to score (integer) responses based on predefined single free form scoring criteria. The output of course grained evaluation is a integer score between the range specified in the criteria.

**Without Reference**

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SimpleCriteriaScoreWithoutReference


sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
)

scorer = SimpleCriteriaScoreWithoutReference(name="course_grained_score",
definition="Score 0 to 5 for correctness",
llm=evaluator_llm
)
await scorer.single_turn_ascore(sample)
```

**With Reference**

```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SimpleCriteriaScoreWithReference
from ragas.metrics import SimpleCriteriaScore


sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
user_input="Where is the Eiffel Tower loc
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Egypt"
)

scorer = SimpleCriteriaScoreWithReference(name="course_grained_score",
definition="Score 0 to 5 by similarity",
llm=evaluator_llm)
scorer = SimpleCriteriaScore(
name="course_grained_score",
definition="Score 0 to 5 by similarity",
llm=evaluator_llm
)

await scorer.single_turn_ascore(sample)
```
Expand All @@ -117,14 +71,10 @@ await scorer.single_turn_ascore(sample)

Domain specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific domain. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations.

### With Reference

Used when you have reference answer to evaluate the responses against.

#### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScoreWithReference
from ragas.metrics import RubricsScore
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
Expand All @@ -137,67 +87,18 @@ rubrics = {
"score4_description": "The response is mostly accurate and aligns well with the ground truth, with only minor issues or missing details.",
"score5_description": "The response is fully accurate, aligns completely with the ground truth, and is clear and detailed.",
}
scorer = RubricsScoreWithReference(rubrics=rubrics, llm=evaluator_llm)
scorer = RubricsScore(rubrics=rubrics, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
```

### Without Reference

Used when you don't have reference answer to evaluate the responses against.

#### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RubricsScoreWithoutReference
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
)

scorer = RubricsScoreWithoutReference(rubrics=rubrics, llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
```


## Instance Specific rubrics criteria scoring

Instance specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific instance, ie each instance to be evaluated is annotated with a rubric based evaluation criteria. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations. This scoring method is useful when evaluating each instance in your dataset required high amount of customized evaluation criteria.

### With Reference

Used when you have reference answer to evaluate the responses against.

#### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import InstanceRubricsWithReference


SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris.",
rubrics = {
"score1": "The response is completely incorrect or irrelevant (e.g., 'The Eiffel Tower is in London.' or no mention of the Eiffel Tower).",
"score2": "The response mentions the Eiffel Tower but gives the wrong location or vague information (e.g., 'The Eiffel Tower is in Europe.' or 'It is in France.' without specifying Paris).",
"score3": "The response provides the correct city but with minor factual or grammatical issues (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'The tower is located at Paris.').",
"score4": "The response is correct but lacks some clarity or extra detail (e.g., 'The Eiffel Tower is in Paris, France.' without other useful context or slightly awkward phrasing).",
"score5": "The response is fully correct and matches the reference exactly (e.g., 'The Eiffel Tower is located in Paris.' with no errors or unnecessary details)."
}
)

scorer = InstanceRubricsWithReference(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
```

### Without Reference

Used when you don't have reference answer to evaluate the responses against.

#### Example
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import InstanceRubricsScoreWithoutReference
from ragas.metrics import InstanceRubricsScore


SingleTurnSample(
Expand All @@ -212,6 +113,6 @@ SingleTurnSample(
}
)

scorer = InstanceRubricsScoreWithoutReference(llm=evaluator_llm)
scorer = InstanceRubricsScore(llm=evaluator_llm)
await scorer.single_turn_ascore(sample)
```
4 changes: 2 additions & 2 deletions docs/concepts/test_data_generation/rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ You can write your own [custom relationship builder]() to establish the relation

```python
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.transforms.relationship_builders.cosine import JaccardSimilarityBuilder
from ragas.testset.transforms.relationship_builders.traditional import JaccardSimilarityBuilder

kg = KnowledgeGraph(nodes=sample_nodes)
rel_builder = JaccardSimilarityBuilder(property_name="entities", key_name="PER", new_property_name="entity_jaccard_similarity")
Expand Down Expand Up @@ -287,4 +287,4 @@ class EntityQuerySynthesizer(QuerySynthesizer):
"""

return SingleTurnSample(user_input=query, reference_contexs=contexts, reference=reference)
```
```
1 change: 1 addition & 0 deletions docs/extra/components/choose_generator_llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

```python
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
Expand Down
11 changes: 7 additions & 4 deletions docs/howtos/customizations/metrics/_cost.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ For an example here is one that will parse OpenAI by using a parser we have defi

```python
import os

os.environ["OPENAI_API_KEY"] = "your-api-key"
```

Expand Down Expand Up @@ -61,8 +62,6 @@ metric = AspectCriticWithReference(
name="answer_correctness",
definition="is the response correct compared to reference",
)


```

Repo card metadata block was not found. Setting CardData to empty.
Expand All @@ -73,8 +72,12 @@ metric = AspectCriticWithReference(
from ragas import evaluate
from ragas.cost import get_token_usage_for_openai

results = evaluate(eval_dataset[:5], metrics=[metric], llm=gpt4o,
token_usage_parser=get_token_usage_for_openai,)
results = evaluate(
eval_dataset[:5],
metrics=[metric],
llm=gpt4o,
token_usage_parser=get_token_usage_for_openai,
)
```

Evaluating: 100%|██████████| 5/5 [00:01<00:00, 2.81it/s]
Expand Down
4 changes: 2 additions & 2 deletions docs/howtos/customizations/metrics/_write_your_own_metric.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,9 @@ Now lets init the metric with the rubric and evaluator llm and evaluate the data


```python
from ragas.metrics import RubricsScoreWithoutReference
from ragas.metrics import RubricsScore

hallucinations_rubric = RubricsScoreWithoutReference(
hallucinations_rubric = RubricsScore(
name="hallucinations_rubric", llm=evaluator_llm, rubrics=rubric
)

Expand Down
12 changes: 8 additions & 4 deletions docs/howtos/customizations/metrics/cost.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = \"your-api-key\""
]
},
Expand Down Expand Up @@ -105,8 +106,7 @@
"metric = AspectCriticWithReference(\n",
" name=\"answer_correctness\",\n",
" definition=\"is the response correct compared to reference\",\n",
")\n",
"\n"
")"
]
},
{
Expand All @@ -126,8 +126,12 @@
"from ragas import evaluate\n",
"from ragas.cost import get_token_usage_for_openai\n",
"\n",
"results = evaluate(eval_dataset[:5], metrics=[metric], llm=gpt4o,\n",
" token_usage_parser=get_token_usage_for_openai,)"
"results = evaluate(\n",
" eval_dataset[:5],\n",
" metrics=[metric],\n",
" llm=gpt4o,\n",
" token_usage_parser=get_token_usage_for_openai,\n",
")"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -160,9 +160,9 @@
}
],
"source": [
"from ragas.metrics import RubricsScoreWithoutReference\n",
"from ragas.metrics import RubricsScore\n",
"\n",
"hallucinations_rubric = RubricsScoreWithoutReference(\n",
"hallucinations_rubric = RubricsScore(\n",
" name=\"hallucinations_rubric\", llm=evaluator_llm, rubrics=rubric\n",
")\n",
"\n",
Expand Down
16 changes: 12 additions & 4 deletions docs/howtos/customizations/testgenerator/_persona_generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,18 @@ Which we can define as follows:
```python
from ragas.testset.persona import Persona

persona_new_joinee = Persona(name="New Joinee", role_description="Don't know much about the company and is looking for information on how to get started.")
persona_manager = Persona(name="Manager", role_description="Wants to know about the different teams and how they collaborate with each other.")
persona_senior_manager = Persona(name="Senior Manager", role_description="Wants to know about the company vision and how it is executed.")
persona_new_joinee = Persona(
name="New Joinee",
role_description="Don't know much about the company and is looking for information on how to get started.",
)
persona_manager = Persona(
name="Manager",
role_description="Wants to know about the different teams and how they collaborate with each other.",
)
persona_senior_manager = Persona(
name="Senior Manager",
role_description="Wants to know about the company vision and how it is executed.",
)

personas = [persona_new_joinee, persona_manager, persona_senior_manager]
personas
Expand Down Expand Up @@ -49,7 +58,6 @@ testset_generator = TestsetGenerator(knowledge_graph=kg, persona_list=personas,
# Generate the Testset
testset = testset_generator.generate(testset_size=10)
testset

```


Expand Down
17 changes: 13 additions & 4 deletions docs/howtos/customizations/testgenerator/persona_generator.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,18 @@
"source": [
"from ragas.testset.persona import Persona\n",
"\n",
"persona_new_joinee = Persona(name=\"New Joinee\", role_description=\"Don't know much about the company and is looking for information on how to get started.\")\n",
"persona_manager = Persona(name=\"Manager\", role_description=\"Wants to know about the different teams and how they collaborate with each other.\")\n",
"persona_senior_manager = Persona(name=\"Senior Manager\", role_description=\"Wants to know about the company vision and how it is executed.\")\n",
"persona_new_joinee = Persona(\n",
" name=\"New Joinee\",\n",
" role_description=\"Don't know much about the company and is looking for information on how to get started.\",\n",
")\n",
"persona_manager = Persona(\n",
" name=\"Manager\",\n",
" role_description=\"Wants to know about the different teams and how they collaborate with each other.\",\n",
")\n",
"persona_senior_manager = Persona(\n",
" name=\"Senior Manager\",\n",
" role_description=\"Wants to know about the company vision and how it is executed.\",\n",
")\n",
"\n",
"personas = [persona_new_joinee, persona_manager, persona_senior_manager]\n",
"personas"
Expand Down Expand Up @@ -72,7 +81,7 @@
"testset_generator = TestsetGenerator(knowledge_graph=kg, persona_list=personas, llm=llm)\n",
"# Generate the Testset\n",
"testset = testset_generator.generate(testset_size=10)\n",
"testset\n"
"testset"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/howtos/integrations/_langgraph_agent_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ ragas_trace = convert_to_ragas_messages(result["messages"])


```python
ragas_trace # List of Ragas messages
ragas_trace # List of Ragas messages
```


Expand Down
Loading

0 comments on commit 9b41b93

Please sign in to comment.