PPO is on-policy, but its training used a replay buffer? #3382

houghtonweihu · 2023-03-31T14:27:32Z

houghtonweihu
Mar 31, 2023

PPO is on-policy, but the training of it in https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/coati/trainer/ppo.py
used a replay buffer, which is only possible for off-policy.

13918985654 · 2023-04-01T07:36:42Z

13918985654
Apr 1, 2023

what is the brawlstar

0 replies

Camille7777 · 2023-04-11T04:19:29Z

Camille7777
Apr 11, 2023
Collaborator

Hi, @houghtonweihu, in the PPO we still collect experience (PPO uses importance sampling) , we don't store it in a “replay buffer” as off-policy but just use the replay buffer class to hold the batch, and then clear the buffer after immediate use.

1 reply

houghtonweihu Apr 12, 2023
Author

This makes sense, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO is on-policy, but its training used a replay buffer? #3382

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

PPO is on-policy, but its training used a replay buffer? #3382

houghtonweihu Mar 31, 2023

Replies: 2 comments · 1 reply

13918985654 Apr 1, 2023

Camille7777 Apr 11, 2023 Collaborator

houghtonweihu Apr 12, 2023 Author

houghtonweihu
Mar 31, 2023

Replies: 2 comments 1 reply

13918985654
Apr 1, 2023

Camille7777
Apr 11, 2023
Collaborator

houghtonweihu Apr 12, 2023
Author