Playing Breakout With a World Model
One of the great early triumphs of deep learning was the building of an agent to play Atari games, specifically Breakout, using deep Q-learning.
In addition , in 2022 Yann LeCun put out a paper suggesting an architecture for achieving AGI, centered around the learning of a world model.
The question - is it possible to build an agent to play Breakout using LeCun’s architecture?
Architecture
The core architecture is chiefly derived from the paper.
State Modeling
Each game state is represented by 4 consecutive frames of raw pixels (84×110 resolution), processed through a convolutional neural network (perception module) to create latent vector representations. Actions are embedded as one-hot vectors and combined with the state representations during world model training.
World Model
The world model is a 6-layer transformer decoder with 8 attention heads, trained to predict the next latent state when given the current state sequence and an action. It learns the dynamics of the game by predicting how states evolve over time.
The world model enables the agent to simulate future scenarios during action selection. Combined with Monte Carlo Tree Search (MCTS), it explores possible action sequences to find the most promising paths forward.
Critic
The critic is a 2-layer fully connected network that evaluates game states by predicting expected rewards. It takes the final frame’s latent representation from a state sequence and outputs a scalar value indicating how promising that state is for achieving high scores.
Results
The agent was able to learn how to move the paddle, and sometimes moved in the correct direction, but it was limited in its success.
Analysis
The main limitations were:
- Model capacity: The relatively simple architectures for the world model and critic may not have had enough representational power to fully capture the game dynamics
- Sparse rewards: Brick-hitting events are rare, making learning difficult despite oversampling these transitions (dropping 99% of zero-reward states)
- Temporal credit assignment: The delay between actions and their consequences (hitting bricks) makes it challenging for the agent to learn cause-and-effect relationships