General preparing for PPO as Newbie #6295

osadchi · 2024-12-10T03:42:59Z

osadchi
Dec 10, 2024

I'm sorry, but can not find information for regular human :)
I did successfully normal fine-tuning with my dataset, but interesting for same with realtime reward model.
I get noticed even Qwen 3b can provide good score for prompt-response pair.
I'm interested how to use it with Llama-Facotory PPO fine-tuning. How to prepare things to start.
I've found the string where is score 'calculating' (in this discussion #1487), but is not clear how value to score looks like.
I mean, the reward model should give a float value. But Qwen needs additional text of prompt to understand it and produce.

osadchi · 2024-12-10T04:45:35Z

osadchi
Dec 10, 2024
Author

batch: Dict[str, "torch.Tensor"] = self.prepare_model_inputs(queries, responses)

# Logging
final_prompts = self.tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False)
for idx, prompt in enumerate(final_prompts):
    print(f"Final prompt sent to reward model (batch index {idx}): {prompt}")

As GPT said :D. I can log the result prompt with this line of code

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General preparing for PPO as Newbie #6295

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

General preparing for PPO as Newbie #6295

osadchi Dec 10, 2024

Replies: 1 comment

osadchi Dec 10, 2024 Author

osadchi
Dec 10, 2024

osadchi
Dec 10, 2024
Author