Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BERTScore as potential "non-LLM" metric / MetricWithEmbedding for context-recall and context-precision #1555

Open
ahgraber opened this issue Oct 22, 2024 · 4 comments
Labels
enhancement New feature or request module-metrics this is part of metrics module

Comments

@ahgraber
Copy link
Contributor

ahgraber commented Oct 22, 2024

Describe the Feature
Add BERTScore as additional evaluation metric scorer for context-precision and context-recall.

Why is the feature important for you?

As a RAGAS user trying to evaluate/benchmark my RAG applications, I would like metrics that are more deterministic than LLM-as-a-Judge and more advanced and indicative of actual recommendation performance than naive string matching. BERTscore uses embeddings to compare documents and is flexible wrt. the embedding model used. This does not require invoking LLMs (so can be run on an analyst machine locally) and provides research-validated scores that are replicable.

Additional context

The BERTscore package has not been updated recently; the recommended microsoft/deberta-xlarge-mnli model only allows ~500 tokens. Models with longer context (allenai/longformer-large-4096-finetuned-triviaqa and microsoft/deberta-v3) do not perform as well, and modern embedding models (nomic-embed-*, baii/bge-*) are untested.

It would be interesting to see what happens using the same embedding model used in the RAG pipeline; that way you only have to embed the answer since you already have access the chunk embeddings from the RAG lookup.

@ahgraber ahgraber added the enhancement New feature or request label Oct 22, 2024
@dosubot dosubot bot added the module-metrics this is part of metrics module label Oct 22, 2024
@jjmachan
Copy link
Member

@ahgraber
Copy link
Contributor Author

I take back my suggestion re: being able to reuse embeddings -- BERTScore is calculated based on token-level embeddings.
To prevent adding a dependency on bert-score, I may just try to hack a custom Metric that calls BertScore internally.

@shahules786
Copy link
Member

shahules786 commented Oct 25, 2024

@ahgraber I have used BERTSCORE in the past, just that with the state of improvement in embeddings /LLMs I feel it doesn't make sense to adopt it anymore. It's not deterministic either. The idea of non-LLM metrics is to provide highly deterministic model-free metrics.

But if users feel adding it to ragas adds value, we can take a look at that. If not please feel free to close the issue.

@ahgraber
Copy link
Contributor Author

@shahules786 In your experience, does [cosine] similarity over text embeddings effectively replace BERTscore?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module-metrics this is part of metrics module
Projects
None yet
Development

No branches or pull requests

3 participants