https://sid-sr.github.io/Q-Snake/
About • Features • Usage • Installation • Acknowledgements
• A website that visualises the Q-learning RL algorithm and shows how an AI agent can learn to play Snake using it. |
- Using just 2 values to represent a state, which are:
- Relative location of the apple to the head (8 directions)
- Presence of danger one step ahead of the head in 4 directions (array of 4 numbers, which results in 16 values).
- This results in a 8 x 16 x 4 Q-table. The visualization to the right is after training the snake for 5000 episodes.
The reward space used here makes the problem a lot easier to solve, but it was to ensure reasonable results are obtained in a short time frame and the changes in the Q-table can be visualized quickly.
Condition | Reward |
---|---|
Hitting the border / eating itself / moving 500 steps without eating the apple | -100 |
Eating the apple | +30 |
Moving towards the apple | +1 |
Moving away from the apple | -5 |
(Used the state and reward space followed in this video: AI learns to play Snake using RL)
- The Q-table shown above has dimensions 8 x 16 (with 4 entries in each cell for each move).
- Each cell in the grid is a state, ie: one situation the snake finds itself in, like the apple is in the top left direction and there is danger to left, which move do I make - up, left, down, or right?
- The blank entries correspond to unexplored states. So initially, all states are unexplored. As the AI plays the game, it explores the different states and tries to learn what moves work (based on the reward for each action made).
- The white entries correspond to unexplored states.
- The red entries correspond to explored states with wrong move learnt by the AI.
- The green entries correspond to explored states with right move learnt by the AI (ie: what move a human would make).
- Episodes: The number of episodes (games/trials) to play and learn from.
- Start Epsilon: The initial probability of exploration. Range: 0 to 1.
- End Epsilon: The final probability of exploration. Range: 0 to 1.
- Discount Factor: The importance given to delayed rewards compared to immediate rewards. Range: 0 to 1.
- Speed/Delay: The delay (in ms) between the moves, lesser values mean faster games (set to lowest value when training).
- The Train button starts training, Stop stops the game and Test shows how the agent plays without training the agent (useful to see how a trained agent plays).
- The probability of exploration decreases linearly over the number of episodes given. So the agent moves randomly at the start and explores the state space and towards the end of the training phase (and during testing) it takes informed decisions based on the learned Q values for each state.
- Clone the repository.
- Run
npm -i install
. - Run
npm start
.
- Coding Snake in React.js
- Excellent explanation on how different rewards can affect the time taken to converge to the optimal Q values: AI learns to play Snake using RL