Add BERTScore as potential "non-LLM" metric / MetricWithEmbedding for context-recall and context-precision #1555

ahgraber · 2024-10-22T18:20:18Z

Describe the Feature
Add BERTScore as additional evaluation metric scorer for context-precision and context-recall.

Why is the feature important for you?

As a RAGAS user trying to evaluate/benchmark my RAG applications, I would like metrics that are more deterministic than LLM-as-a-Judge and more advanced and indicative of actual recommendation performance than naive string matching. BERTscore uses embeddings to compare documents and is flexible wrt. the embedding model used. This does not require invoking LLMs (so can be run on an analyst machine locally) and provides research-validated scores that are replicable.

Additional context

The BERTscore package has not been updated recently; the recommended microsoft/deberta-xlarge-mnli model only allows ~500 tokens. Models with longer context (allenai/longformer-large-4096-finetuned-triviaqa and microsoft/deberta-v3) do not perform as well, and modern embedding models (nomic-embed-*, baii/bge-*) are untested.

It would be interesting to see what happens using the same embedding model used in the RAG pipeline; that way you only have to embed the answer since you already have access the chunk embeddings from the RAG lookup.

The text was updated successfully, but these errors were encountered:

jjmachan · 2024-10-23T15:19:30Z

@shahules786 what do you think? could we rework https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/semantic_similarity/ for this?

ahgraber · 2024-10-24T01:00:17Z

I take back my suggestion re: being able to reuse embeddings -- BERTScore is calculated based on token-level embeddings.
To prevent adding a dependency on bert-score, I may just try to hack a custom Metric that calls BertScore internally.

shahules786 · 2024-10-25T18:16:10Z

@ahgraber I have used BERTSCORE in the past, just that with the state of improvement in embeddings /LLMs I feel it doesn't make sense to adopt it anymore. It's not deterministic either. The idea of non-LLM metrics is to provide highly deterministic model-free metrics.

But if users feel adding it to ragas adds value, we can take a look at that. If not please feel free to close the issue.

ahgraber · 2024-10-26T15:45:30Z

@shahules786 In your experience, does [cosine] similarity over text embeddings effectively replace BERTscore?

ahgraber added the enhancement New feature or request label Oct 22, 2024

dosubot bot added the module-metrics this is part of metrics module label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BERTScore as potential "non-LLM" metric / MetricWithEmbedding for context-recall and context-precision #1555

Add BERTScore as potential "non-LLM" metric / MetricWithEmbedding for context-recall and context-precision #1555

ahgraber commented Oct 22, 2024 •

edited

Loading

jjmachan commented Oct 23, 2024

ahgraber commented Oct 24, 2024

shahules786 commented Oct 25, 2024 •

edited

Loading

ahgraber commented Oct 26, 2024

Add BERTScore as potential "non-LLM" metric / MetricWithEmbedding for context-recall and context-precision #1555

Add BERTScore as potential "non-LLM" metric / MetricWithEmbedding for context-recall and context-precision #1555

Comments

ahgraber commented Oct 22, 2024 • edited Loading

jjmachan commented Oct 23, 2024

ahgraber commented Oct 24, 2024

shahules786 commented Oct 25, 2024 • edited Loading

ahgraber commented Oct 26, 2024

ahgraber commented Oct 22, 2024 •

edited

Loading

shahules786 commented Oct 25, 2024 •

edited

Loading