-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to set the language type of the generated questions when creating a test set #1485
Comments
To generate questions in Chinese using the from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
# Load and prepare documents
loader = PyPDFLoader("xx.pdf")
documents = loader.load_and_split()
for document in documents:
document.metadata['filename'] = document.metadata['source']
# Initialize models
generator_llm = ChatOpenAI(
model="Qwen2",
temperature=0.3,
openai_api_key="xxx",
openai_api_base='xxx',
stop=['<im_end>']
)
critic_llm = ChatOpenAI(
model="Qwen2",
temperature=0.3,
openai_api_key="xxx",
openai_api_base='xxx',
stop=['<im_end>']
)
embedding_model_name = "\embedding\\bge-large-zh-v1.5"
embedding_model_kwargs = {'device': 'cpu'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}
embed_model = HuggingFaceEmbeddings(
model_name=embedding_model_name,
model_kwargs=embedding_model_kwargs,
encode_kwargs=embedding_encode_kwargs
)
# Create TestsetGenerator
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embed_model
)
# Adapt to Chinese language
language = "chinese"
generator.adapt(language, evolutions=[simple, reasoning, multi_context])
# Generate test set
testset = generator.generate_with_langchain_docs(documents, test_size=1, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
df = testset.to_pandas()
df.to_csv("test_set.csv", index=False, encoding='utf-8_sig') This code adapts the generator to Chinese before generating the test set. |
But according to the code you provided, the following error occurred. What should I do?@dosu |
Hi @Z-oo883 This is a pending item for the new test set gen and we will add this to our roadmap. |
Would love to see this feature soon! Thanks. |
@shahules786 would adapting the prompts for synthesizers should help here right? @tuan3w thanks for letting us know 🙂 - we'll get this sorted quickly. which language are you looking to adapt to? |
Hi @jjmachan We build LLM agents for enterprises in Vietnam, so we would love to have Vietnamese support. We are building I think I can adapt the synthesizer to support Vietnamese via this merge request: Thanks. |
@tuan3w I'll be working on this next week, will experiment and update the results/docs for this issue. Please keep the feedbacks coming. |
How to set the language type of the generated questions when creating a test set?
code:
The text was updated successfully, but these errors were encountered: