[RFC] 057 - RAG Eval & Benchmark #3714

arvinxx · 2024-09-01T09:47:53Z

arvinxx
Sep 1, 2024
Maintainer

背景

RAG 是一个深坑，如果要提升性能，非常重要的一点就是要做好 Benchmark ，然后在 Benchmark 基础上来评估改造的效果。@cy948 在 https://github.com/cy948/lobe-chat-rag-benchmark 中介绍了接下来我们的构建 Benchmark 的思路。

目前 LobeChat 已经初步具备了 RAG + 异步任务流的能力，因此我们可以尝试做一些更进一步的自动化流程探索。

思路

首先 RAG 的评测会直接采用 RAGAS 框架，它所需要的评测数据集结构为：

{
    'question': '问答对中的问题',
    'context': ['RAG检索器根据问题检索得到的 chunks'],
    'answer': 'LLM 生成的答案',
    'ground_truth': '问答对中的参考答案'
}

以 FinLongEval 数据集为例，单测评测对的数据结构如下：

{
    'question': '问答对中的问题',
    'reference_answer': '问答对中的参考答案',
    'document_name': '回答问题可参考的文档'
}

在 LobeChat 中，我们一次 RAG 调用能得到的数据有：

question: 用户的问题内容
files: 用户要查询的文件
relativeChunks: 根据用户内容查询到的文本块
answer: LLM 基于问题和 chunks 返回的答案

因此理论上现有的数据是可以完全满足 RAGAS 评测所需要的字段，但是手动操作比较麻烦，因此我们需要有一个自动化的方案，解决批量跑任务很麻烦的问题。

设计

数据导入：需要有一个地方导入三方的数据，并且格式需要符合特定的要求。

interface  DataSetItem {
question: string
ideal:string
referenceFiles?:string []
}

执行任务。点击执行，就针对每一条数据执行 RAG 异步任务；
一次执行需要生成一次执行报告，报告的生成数据可以符合 RAGAS 的数据格式，并导出文件或发到 S3 上

import { integer, jsonb, pgTable, text, uuid } from 'drizzle-orm/pg-core';

import { createdAt, updatedAt } from './_helpers';
import { asyncTasks } from './asyncTask';
import { users } from './user';

// Dataset 表
export const evalDatasets = pgTable('rag_eval_datasets', {
  createdAt: createdAt(),

  description: text('description'),
  id: integer('id').generatedAlwaysAsIdentity({ startWith: 30_000 }).primaryKey(),
  name: text('name').notNull(),
  updatedAt: updatedAt(),
  userId: text('user_id').references(() => users.id, { onDelete: 'cascade' }),
});

// DatasetItem 表
export const evalDatasetItems = pgTable('rag_eval_dataset_items', {
  createdAt: createdAt(),
  datasetId: integer('dataset_id')
    .references(() => evalDatasets.id)
    .notNull(),
  id: integer('id').generatedAlwaysAsIdentity().primaryKey(),
  ideal: text('ideal'),
  question: text('question'),

  referenceFiles: text('reference_files').array(),

  userId: text('user_id').references(() => users.id, { onDelete: 'cascade' }),
});

// EvalResult 表
export const evalResults = pgTable('rag_eval_results', {
  answer: text('answer').notNull(),
  context: jsonb('context').notNull(),
  createdAt: createdAt(),

  datasetItemId: integer('dataset_item_id')
    .references(() => evalDatasetItems.id)
    .notNull(),
  groundTruth: text('ground_truth').notNull(),
  id: integer('id').generatedAlwaysAsIdentity().primaryKey(),
  question: text('question').notNull(),

  taskId: uuid('task_id')
    .references(() => asyncTasks.id)
    .notNull(),

  userId: text('user_id').references(() => users.id, { onDelete: 'cascade' }),
});

// EvalReport 表
export const evalReports = pgTable('rag_eval_reports', {
  createdAt: createdAt(),
  exportUrl: text('export_url'),
  id: integer('id').generatedAlwaysAsIdentity().primaryKey(),
  reportData: jsonb('report_data').notNull(),
  taskId: uuid('task_id')
    .references(() => asyncTasks.id)
    .notNull(),

  userId: text('user_id').references(() => users.id, { onDelete: 'cascade' }),
});

进展

🐛 fix: fix .PDF can not be chunked #3720

cy948 · 2024-09-01T15:52:30Z

cy948
Sep 1, 2024

看了几个dataset的结构，发现我们可能要对dataset的内容进行同步分块、向量化，而非之前的可以异步分块、向量化：

neural-bridge/rag-dataset-12000 & neural-bridge/rag-hallucination-dataset-1000

context: RAG索引的上下文；
question: 问题；
answer: 参考回答；

glaiveai/RAG-v1

List of documents for context: RAG索引的上下文
Question: 问题；
Answer Mode: 是否允许模型使用内置的知识回答；
Answer: 参考回答；

7 replies

arvinxx Sep 1, 2024
Maintainer Author

能不能讲具体些？单纯RAG 索引的上下文，在异步链路中不是一样可以拿到的么？

cy948 Sep 1, 2024

上面说的数据集和FinLongEval不太一样，Fin的数据集是指定某个文档进行检索，这样我们可以提前做好检索的准备。而上面的数据集不是指定某个文档，而是给出一段文本让我们进行检索并回答。

arvinxx Sep 1, 2024
Maintainer Author

给出文本和给出文档在 RAG 流程中有什么区别吗？

cy948 Sep 1, 2024

呃，流程上好像没区别。要不先把FinLongEval利用好，搭建出一套评测流程，其余的问题后面再解决?

arvinxx Sep 1, 2024
Maintainer Author

嗯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] 057 - RAG Eval & Benchmark #3714

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RFC] 057 - RAG Eval & Benchmark #3714

arvinxx Sep 1, 2024 Maintainer

背景

思路

设计

进展

Replies: 1 comment · 7 replies

cy948 Sep 1, 2024

arvinxx Sep 1, 2024 Maintainer Author

cy948 Sep 1, 2024

arvinxx Sep 1, 2024 Maintainer Author

cy948 Sep 1, 2024

arvinxx Sep 1, 2024 Maintainer Author

arvinxx
Sep 1, 2024
Maintainer

Replies: 1 comment 7 replies

cy948
Sep 1, 2024

arvinxx Sep 1, 2024
Maintainer Author

arvinxx Sep 1, 2024
Maintainer Author

arvinxx Sep 1, 2024
Maintainer Author