This project is aimed to develop a self driving car agent in TORCS Simulator using Deep Reinforcement's Learning Actor-Critic Algorithm.
You can install Python dependencies using pip install -r Requirements.txt
, and it should just work. if you want to install package manually, here's a list:
- Python==3.7
- Tensorflow-gpu==2.3.0
- Keras=2.6.0
- Numpy=1.18.5
- gym_torcs
TORCS simulator is an open source car simulator which is extensively used in AI research. The reason for selecting TORCS for this project is that it is easy to get states from the game using gym_torcs library, which uses SCR plugin to setup connection with the game and thus making it easy to send commands into the game and also retrieving current states. In reinforcement learning we need to get states data and send action values continuously, so this simulator suited best for our project. Self driving car is an area of wide research and it encompasses many fields, implementation of this project was a good method for practically applying various concepts of reinforcement learning.
Imagine you play a video game with a friend that provides you some feedback. You’re the Actor and your friend is the Critic. At the beginning, you don’t know how to play the game, so you try some action randomly. The Critic observes your actions and provides feedback. Learning from this feedback, you’ll update your policy and be better at playing that game. On the other hand, your friend (Critic) will also update their own way to provide feedback so it can be better next time. As we can see, the idea of Actor Critic is to have two neural networks. We estimate both, both run in parallel. Because we have two models (Actor and Critic) that must be trained, it means that we have two set of weights, the weights of actor network are updated with resect toh the output of critic network. Update of target networks is done by soft update.
The Actor Critic model is a better score function. Instead of waiting until the end of the episode as we do in Monte Carlo REINFORCE, we make an update at each step (TD Learning). Because we do an update at each time step, we can’t use the total rewards R(t). Instead, we need to train a Critic model that approximates the value function (remember that value function calculates what is the maximum expected future reward given a state and an action). This value function replaces the reward function in policy gradient that calculates the rewards only at the end of the episode.
s_t = np.hstack((ob.angle, ob.track, ob.trackPos, ob.speedX, ob.speedY, ob.speedZ, ob.wheelSpinVel/100.0, ob.rpm))
retrieves data(states) from game server.
ob, r_t, done, info = env.step(action)
sends command(actions to be taken) to the game server, where r_t is the reward for taking that action.
This Algorithm was implemented using tensorflow as follows :