PPO is on-policy, but its training used a replay buffer? #3382
Closed
houghtonweihu
started this conversation in
Community | General
Replies: 2 comments 1 reply
-
what is the brawlstar |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi, @houghtonweihu, in the PPO we still collect experience (PPO uses importance sampling) , we don't store it in a “replay buffer” as off-policy but just use the replay buffer class to hold the batch, and then clear the buffer after immediate use. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
PPO is on-policy, but the training of it in https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/coati/trainer/ppo.py
used a replay buffer, which is only possible for off-policy.
Beta Was this translation helpful? Give feedback.
All reactions