Tic Tac Toe played by Double Deep Q-Networks

This repository contains a (successful) attempt to train a Double Deep Q-Network (DDQN) agent to play Tic-Tac-Toe. It learned to:

Distinguish valid from invalid moves
Comprehend how to win a game
Block the opponent when poses a threat

Key formulas of algorithms used:

Double Deep Q-Networks:

Based on the DDQN algorithm by Van-Hasselt et al. [1]. The cost function used is:

Where θ represents the trained Q-Network and ϑ represents the semi-static Q-Target network.

The Q-Target update rule is based on the DDPG algorithm by Lillicrap et al. [2] :

for some 0 <= τ <= 1.

Maximum Entropy Learning:

Based on a paper by Haarnoja et al.[3] and designed according to a blog-post by BAIR[4]. Q-Values are computed using the Soft Bellman Equation:

Trained models:

Two types of agents were trained:

a regular DDQN agent
an agent which learns using maximum entropy. They are named 'Q' and 'E' respectively.

Both models use a cyclic memory buffer as their experience-replay memory.

All pre-trained models are found under the models/ directory, where several trained models can be found for each variant. Q files refer to DDQN models and E files refer to DDQN-Max-Entropy models.

Do it yourself:

The main.py holds several useful functions. See doc-strings for more details:

train will initiate a single training process. It will save the weights and plots graphs. Using the current settings, training took me around 70 minutes on a 2018 MacBook Pro
multi_train will train several DDQN and DDQN-Max-Entropy models
play allows a human player to play against a saved model
face_off can be used to compare models by letting them play against each other

The DeepQNetworkModel class can be easily configured using these parameters (among others):

layers_size: set the number and size of the hidden layers of the model (only fully-connected layers are supported)
memory: set memory type (cyclic buffer or reservoir sampling)
double_dqn: set whether to use DDQN or a standard DQN
maximize_entropy: set whether to use maximum entropy learning or not

See the class doc-string for all possible parameters.

Related blogposts:

Read about where I got stuck when developing this code on "Lessons Learned from Tic-Tac-Toe: Practical Reinforcement Learning Tips"
Read about the E Max-Entropy models on "Open Minded AI: Improving Performance by Keeping All Options on the Table"

References:

Hado van Hasselt et al., Deep Reinforcement Learning with Double Q-learning
Lillicrap et al. , Continuous control with deep reinforcement learning
Haarnoja et al., Reinforcement Learning with Deep Energy-Based Policies
Tang & Haarnoja, Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning (blogpost)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tic Tac Toe played by Double Deep Q-Networks

Key formulas of algorithms used:

Double Deep Q-Networks:

Maximum Entropy Learning:

Trained models:

Do it yourself:

Related blogposts:

References:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tic Tac Toe played by Double Deep Q-Networks

Key formulas of algorithms used:

Double Deep Q-Networks:

Maximum Entropy Learning:

Trained models:

Do it yourself:

Related blogposts:

References: