Q-Learning Table Structure
Q-Learning uses a discrete state representation with 512 possible states, stored as a lookup table mapping state-action pairs to Q-values:
State (Binary) |
UP |
DOWN |
LEFT |
RIGHT |
000000000 (safe, no food) |
0.12 |
0.15 |
0.08 |
0.11 |
001001000 (food right) |
0.05 |
0.07 |
-0.02 |
0.85 |
100000010 (danger ahead, food up) |
0.92 |
-0.85 |
0.15 |
0.18 |
111000000 (surrounded) |
-0.95 |
-0.92 |
-0.89 |
-0.88 |
State Encoding: [danger_straight, danger_left, danger_right, direction_bit1, direction_bit2, food_left, food_right, food_up, food_down]
Deep Q-Network (DQN) Architecture
Input Layer
8D State Vector
[danger, direction, food]
→
Hidden Layer 1
64 Neurons
ReLU Activation
→
Hidden Layer 2
64 Neurons
ReLU Activation
→
Output Layer
4 Q-Values
[Q(s,up), Q(s,down), Q(s,left), Q(s,right)]
PPO & Actor-Critic Architecture (Shared Pattern)
Both PPO and Actor-Critic use the same dual-network architecture pattern with separate policy and value networks:
Policy/Actor Network
8 → 64 → 64 → 4
Softmax Output
→
Action Probabilities
[P(up), P(down), P(left), P(right)]
Value/Critic Network
8 → 64 → 64 → 1
Linear Output
→
State Value
V(s) - Expected Return
Key Difference: PPO uses clipped objective functions for policy updates, while Actor-Critic uses direct policy gradients with advantage estimation.