How to set the language type of the generated questions when creating a test set #1485

Z-oo883 · 2024-10-14T03:18:50Z

How to set the language type of the generated questions when creating a test set？
code:

import nest_asyncio
nest_asyncio.apply()
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
print(documents)
for document in documents:
    document.metadata['filename'] = document.metadata['source']
# generator with openai models
generator_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<|im_end|>']
)
critic_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<|im_end|>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=embedding_encode_kwargs
)

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embed_model
)

# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
print(testset)
df.to_csv("test_set.csv",index=False,encoding='utf-8_sig')

Z-oo883 · 2024-10-14T03:22:20Z

@dosu

dosubot · 2024-10-14T03:24:54Z

To generate questions in Chinese using the TestsetGenerator, you need to specify the language when adapting the generator. Here's how you can do it:

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader

# Load and prepare documents
loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
for document in documents:
    document.metadata['filename'] = document.metadata['source']

# Initialize models
generator_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
critic_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=embedding_encode_kwargs
)

# Create TestsetGenerator
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embed_model
)

# Adapt to Chinese language
language = "chinese"
generator.adapt(language, evolutions=[simple, reasoning, multi_context])

# Generate test set
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
df.to_csv("test_set.csv", index=False, encoding='utf-8_sig')

This code adapts the generator to Chinese before generating the test set.

To continue talking to Dosu, mention @dosu.

Z-oo883 · 2024-10-14T03:53:37Z

To generate questions in Chinese using the TestsetGenerator, you need to specify the language when adapting the generator. Here's how you can do it:

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader

# Load and prepare documents
loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
for document in documents:
    document.metadata['filename'] = document.metadata['source']

# Initialize models
generator_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
critic_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=embedding_encode_kwargs
)

# Create TestsetGenerator
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embed_model
)

# Adapt to Chinese language
language = "chinese"
generator.adapt(language, evolutions=[simple, reasoning, multi_context])

# Generate test set
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
df.to_csv("test_set.csv", index=False, encoding='utf-8_sig')

This code adapts the generator to Chinese before generating the test set.

To continue talking to Dosu, mention @dosu.

But according to the code you provided, the following error occurred. What should I do?@dosu
Traceback (most recent call last):
File "D:\RA_LLM\pythonProject\generate_test_data.py", line 53, in
generator.adapt(language, evolutions=[simple, reasoning, multi_context])
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\testset\generator.py", line 340, in adapt
self.docstore.extractor.adapt(language, cache_dir=cache_dir)
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\testset\extractor.py", line 61, in adapt
self.extractor_prompt = self.extractor_prompt.adapt(
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\llms\prompt.py", line 185, in adapt
self_cp = self._load(language, self.name, cache_dir)
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\llms\prompt.py", line 286, in _load
return cls(**json.load(open(path)))
File "D:\anaconda\envs\ragas\lib\site-packages\pydantic\v1\main.py", line 341, in init
raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Prompt
root
output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)

shahules786 · 2024-10-14T14:33:41Z

Hi @Z-oo883 This is a pending item for the new test set gen and we will add this to our roadmap.

tuan3w · 2024-10-23T02:53:44Z

Would love to see this feature soon! Thanks.

jjmachan · 2024-10-23T15:15:11Z

@shahules786 would adapting the prompts for synthesizers should help here right?

@tuan3w thanks for letting us know 🙂 - we'll get this sorted quickly. which language are you looking to adapt to?

tuan3w · 2024-10-23T15:36:52Z

Hi @jjmachan

We build LLM agents for enterprises in Vietnam, so we would love to have Vietnamese support. We are building
auto-testing systems and want to explore different solutions for data synthesis. I recently heard about the Ragas
0.2 release and would love to test it more.

I think I can adapt the synthesizer to support Vietnamese via this merge request:
#1429, with a little trick to bypass the language check, but I'm not
sure if it's the best way to do so. The documentation on the website isn't very clear. Maybe it can be improved
further.

Thanks.

shahules786 · 2024-10-25T18:18:37Z

@tuan3w I'll be working on this next week, will experiment and update the results/docs for this issue. Please keep the feedbacks coming.

Z-oo883 added the question Further information is requested label Oct 14, 2024

dosubot bot added the module-testsetgen Module testset generation label Oct 14, 2024

Z-oo883 changed the title ~~When creating a test set, the question of how to make the build is Chinese~~ How to make the generated questions in Chinese when creating a test set Oct 14, 2024

Z-oo883 changed the title ~~How to make the generated questions in Chinese when creating a test set~~ How to set the language type of the generated questions when creating a test set Oct 14, 2024

shahules786 mentioned this issue Oct 25, 2024

[R-310] Test set generation improvements #1577

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set the language type of the generated questions when creating a test set #1485

How to set the language type of the generated questions when creating a test set #1485

Z-oo883 commented Oct 14, 2024 •

edited by jjmachan

Loading

Z-oo883 commented Oct 14, 2024

dosubot bot commented Oct 14, 2024

Z-oo883 commented Oct 14, 2024 •

edited

Loading

shahules786 commented Oct 14, 2024

tuan3w commented Oct 23, 2024

jjmachan commented Oct 23, 2024

tuan3w commented Oct 23, 2024

shahules786 commented Oct 25, 2024

How to set the language type of the generated questions when creating a test set #1485

How to set the language type of the generated questions when creating a test set #1485

Comments

Z-oo883 commented Oct 14, 2024 • edited by jjmachan Loading

Z-oo883 commented Oct 14, 2024

dosubot bot commented Oct 14, 2024

Z-oo883 commented Oct 14, 2024 • edited Loading

shahules786 commented Oct 14, 2024

tuan3w commented Oct 23, 2024

jjmachan commented Oct 23, 2024

tuan3w commented Oct 23, 2024

shahules786 commented Oct 25, 2024

Z-oo883 commented Oct 14, 2024 •

edited by jjmachan

Loading

Z-oo883 commented Oct 14, 2024 •

edited

Loading