Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set the language type of the generated questions when creating a test set #1485

Open
Z-oo883 opened this issue Oct 14, 2024 · 8 comments
Labels
module-testsetgen Module testset generation question Further information is requested

Comments

@Z-oo883
Copy link

Z-oo883 commented Oct 14, 2024

How to set the language type of the generated questions when creating a test set?
code:

import nest_asyncio
nest_asyncio.apply()
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
print(documents)
for document in documents:
    document.metadata['filename'] = document.metadata['source']
# generator with openai models
generator_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<|im_end|>']
)
critic_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<|im_end|>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=embedding_encode_kwargs
)

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embed_model
)

# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
print(testset)
df.to_csv("test_set.csv",index=False,encoding='utf-8_sig')
@Z-oo883 Z-oo883 added the question Further information is requested label Oct 14, 2024
@dosubot dosubot bot added the module-testsetgen Module testset generation label Oct 14, 2024
@Z-oo883 Z-oo883 changed the title When creating a test set, the question of how to make the build is Chinese How to make the generated questions in Chinese when creating a test set Oct 14, 2024
@Z-oo883
Copy link
Author

Z-oo883 commented Oct 14, 2024

@dosu

Copy link

dosubot bot commented Oct 14, 2024

To generate questions in Chinese using the TestsetGenerator, you need to specify the language when adapting the generator. Here's how you can do it:

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader

# Load and prepare documents
loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
for document in documents:
    document.metadata['filename'] = document.metadata['source']

# Initialize models
generator_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
critic_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=embedding_encode_kwargs
)

# Create TestsetGenerator
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embed_model
)

# Adapt to Chinese language
language = "chinese"
generator.adapt(language, evolutions=[simple, reasoning, multi_context])

# Generate test set
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
df.to_csv("test_set.csv", index=False, encoding='utf-8_sig')

This code adapts the generator to Chinese before generating the test set.

To continue talking to Dosu, mention @dosu.

@Z-oo883 Z-oo883 changed the title How to make the generated questions in Chinese when creating a test set How to set the language type of the generated questions when creating a test set Oct 14, 2024
@Z-oo883
Copy link
Author

Z-oo883 commented Oct 14, 2024

To generate questions in Chinese using the TestsetGenerator, you need to specify the language when adapting the generator. Here's how you can do it:

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader

# Load and prepare documents
loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
for document in documents:
    document.metadata['filename'] = document.metadata['source']

# Initialize models
generator_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
critic_llm = ChatOpenAI(
    model="Qwen2",
    temperature=0.3,
    openai_api_key="xxx",
    openai_api_base='xxx',
    stop=['<im_end>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=embedding_encode_kwargs
)

# Create TestsetGenerator
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embed_model
)

# Adapt to Chinese language
language = "chinese"
generator.adapt(language, evolutions=[simple, reasoning, multi_context])

# Generate test set
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
df.to_csv("test_set.csv", index=False, encoding='utf-8_sig')

This code adapts the generator to Chinese before generating the test set.

To continue talking to Dosu, mention @dosu.

But according to the code you provided, the following error occurred. What should I do?@dosu
Traceback (most recent call last):
File "D:\RA_LLM\pythonProject\generate_test_data.py", line 53, in
generator.adapt(language, evolutions=[simple, reasoning, multi_context])
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\testset\generator.py", line 340, in adapt
self.docstore.extractor.adapt(language, cache_dir=cache_dir)
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\testset\extractor.py", line 61, in adapt
self.extractor_prompt = self.extractor_prompt.adapt(
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\llms\prompt.py", line 185, in adapt
self_cp = self._load(language, self.name, cache_dir)
File "D:\anaconda\envs\ragas\lib\site-packages\ragas\llms\prompt.py", line 286, in _load
return cls(**json.load(open(path)))
File "D:\anaconda\envs\ragas\lib\site-packages\pydantic\v1\main.py", line 341, in init
raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for Prompt
root
output in example 1 is not in valid json format: Expecting value: line 1 column 1 (char 0) (type=value_error)

@shahules786
Copy link
Member

Hi @Z-oo883 This is a pending item for the new test set gen and we will add this to our roadmap.

@tuan3w
Copy link

tuan3w commented Oct 23, 2024

Would love to see this feature soon! Thanks.

@jjmachan
Copy link
Member

@shahules786 would adapting the prompts for synthesizers should help here right?

@tuan3w thanks for letting us know 🙂 - we'll get this sorted quickly. which language are you looking to adapt to?

@tuan3w
Copy link

tuan3w commented Oct 23, 2024

Hi @jjmachan

We build LLM agents for enterprises in Vietnam, so we would love to have Vietnamese support. We are building
auto-testing systems and want to explore different solutions for data synthesis. I recently heard about the Ragas
0.2 release and would love to test it more.

I think I can adapt the synthesizer to support Vietnamese via this merge request:
#1429, with a little trick to bypass the language check, but I'm not
sure if it's the best way to do so. The documentation on the website isn't very clear. Maybe it can be improved
further.

Thanks.

@shahules786
Copy link
Member

@tuan3w I'll be working on this next week, will experiment and update the results/docs for this issue. Please keep the feedbacks coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module-testsetgen Module testset generation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants