cotalks.dev

Training My Rival in Java: A Deep Q-Learning AI to Play Azul by Victor Uria Valle

(link)
Channel: Devoxx

Summary

Victor Uria Valle presents a personal Java project that uses deep Q-learning to play the board game Azul. The talk starts with a quick explanation of Azul’s scoring, player board, factories, and legal moves, then moves into Q-learning fundamentals and why a neural network is used instead of a Q-table for larger state spaces. The implementation walkthrough covers game-state and action encoding, a feed-forward neural network with weights and biases, a DQN agent with replay memory, epsilon-greedy action selection, and a trainer that computes rewards from score changes and floor-line penalties. The session ends with a short demo of a human playing against the trained agent.

Key Takeaways

  • Azul is modeled with factories, a central pool, pattern lines, wall tiles, and a floor line for penalties.
  • Classic Q-learning uses a Q-table, but deep Q-learning replaces it with a neural network that predicts Q-values for actions.
  • The agent encodes both game state and legal actions, then selects moves with an epsilon-greedy policy.
  • Replay memory is used to sample past transitions and train the network more efficiently.
  • Rewards are based on score differences plus extra penalties for floor-line placement.
  • The demo shows the trained agent making legal, game-like moves against a human player.

Sections

Azul rules and game-state representation

The talk begins with the core Azul mechanics: tiles are drawn from factories or the central area, placed into pattern lines, and eventually scored on the wall. Extra tiles that do not fit are sent to the floor line, which reduces the player’s score. Victor explains how these elements map to implementation details such as tile enums, factory lists, player boards, pattern lines, wall state, floor penalties, and score tracking.

Valid moves, placement constraints, and scoring

A substantial part of the explanation focuses on legal actions. A player can only place a color into a pattern line if that color is not already started there and the row can accept it. Any overflow goes to the floor line. The scoring model is also described: tiles earn points based on adjacency, with bonuses for completing rows, columns, or color sets. These rules shape the reward function used for training.

From Q-learning to deep Q-learning

Victor explains Q-learning as a state-action value table, where each entry estimates the value of taking an action from a given state. He then shows why a Q-table does not scale well for complex games with many states and actions. Deep Q-learning addresses this by using a neural network to approximate Q-values from encoded game state instead of storing them in a table.

Neural network and DQN agent implementation

The implementation section covers a simple feed-forward network with input, hidden, and output layers. Inputs are encoded game features, including the board state and possible actions. The network uses weights and biases initialized with He initialization. The DQN agent wraps the network, maintains replay memory, and stores transitions so training can reuse past experience.

Training loop, exploration, and replay

The trainer creates games, records transitions, computes rewards, and updates the agent. The reward function uses score delta between turns and adds extra punishment for floor-line penalties. Action selection uses an epsilon-greedy policy: most of the time the agent chooses the best predicted action, but occasionally it explores randomly. The replay step samples stored experiences to update the network, and the talk mentions periodically syncing the rival network as part of the training strategy.

Demo and outcome

In the demo, a saved model is loaded and a human plays a short match against the agent. Victor shows the AI selecting legal moves that resemble reasonable Azul play, demonstrating that the trained DQN can act on the encoded game state and board constraints.

Keywords: deep q-learning, dqn, q-learning, java neural network, azul board game ai, replay buffer, epsilon-greedy policy, reward function, game state encoding, legal action masking, feed-forward neural network, he initialization, experience replay, reinforcement learning, board game ai

note