You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the Feature
Add BERTScore as additional evaluation metric scorer for context-precision and context-recall.
Why is the feature important for you?
As a RAGAS user trying to evaluate/benchmark my RAG applications, I would like metrics that are more deterministic than LLM-as-a-Judge and more advanced and indicative of actual recommendation performance than naive string matching. BERTscore uses embeddings to compare documents and is flexible wrt. the embedding model used. This does not require invoking LLMs (so can be run on an analyst machine locally) and provides research-validated scores that are replicable.
The BERTscore package has not been updated recently; the recommended microsoft/deberta-xlarge-mnli model only allows ~500 tokens. Models with longer context (allenai/longformer-large-4096-finetuned-triviaqa and microsoft/deberta-v3) do not perform as well, and modern embedding models (nomic-embed-*, baii/bge-*) are untested.
It would be interesting to see what happens using the same embedding model used in the RAG pipeline; that way you only have to embed the answer since you already have access the chunk embeddings from the RAG lookup.
The text was updated successfully, but these errors were encountered:
I take back my suggestion re: being able to reuse embeddings -- BERTScore is calculated based on token-level embeddings.
To prevent adding a dependency on bert-score, I may just try to hack a custom Metric that calls BertScore internally.
@ahgraber I have used BERTSCORE in the past, just that with the state of improvement in embeddings /LLMs I feel it doesn't make sense to adopt it anymore. It's not deterministic either. The idea of non-LLM metrics is to provide highly deterministic model-free metrics.
But if users feel adding it to ragas adds value, we can take a look at that. If not please feel free to close the issue.
Describe the Feature
Add BERTScore as additional evaluation metric scorer for context-precision and context-recall.
Why is the feature important for you?
As a RAGAS user trying to evaluate/benchmark my RAG applications, I would like metrics that are more deterministic than LLM-as-a-Judge and more advanced and indicative of actual recommendation performance than naive string matching. BERTscore uses embeddings to compare documents and is flexible wrt. the embedding model used. This does not require invoking LLMs (so can be run on an analyst machine locally) and provides research-validated scores that are replicable.
Additional context
The BERTscore package has not been updated recently; the recommended
microsoft/deberta-xlarge-mnli
model only allows ~500 tokens. Models with longer context (allenai/longformer-large-4096-finetuned-triviaqa
andmicrosoft/deberta-v3
) do not perform as well, and modern embedding models (nomic-embed-*
,baii/bge-*
) are untested.It would be interesting to see what happens using the same embedding model used in the RAG pipeline; that way you only have to embed the answer since you already have access the chunk embeddings from the RAG lookup.
The text was updated successfully, but these errors were encountered: