SnakeAI-MLOps

SnakeAI-MLOps

Multi-Agent Reinforcement Learning Platform with Production MLOps Pipeline

Created by Pranav Mishra - Connect for collaboration and opportunities

AI in Action

This project demonstrates the power of reinforcement learning by training AI agents to master the classic Snake game. Watch how trained agents exhibit intelligent navigation patterns compared to random exploration:

Trained Agent Gameplay

Intelligent Decision Making

Trained agents efficiently navigate towards food while avoiding obstacles, demonstrating learned strategic behavior and spatial awareness.

Untrained Agent Behavior

Random Exploration

Untrained agents move randomly without strategy, frequently colliding with walls or their own body, highlighting the learning achievement.

Project Overview

SnakeAI-MLOps is a comprehensive reinforcement learning platform that implements and compares four major machine learning paradigms within a unified experimental framework. The project bridges the gap between theoretical research and practical deployment by providing GPU-accelerated training pipelines, real-time inference capabilities, and systematic performance evaluation.

This platform addresses critical challenges in reinforcement learning research: algorithm comparison, reproducibility, and production deployment. By implementing multiple RL techniques within the same environment, researchers and developers can conduct meaningful comparative studies and understand the strengths and limitations of different approaches.

Key Capabilities

Multi-Algorithm Training
Train and compare Q-Learning, DQN, PPO, and Actor-Critic algorithms within the same environment using standardized evaluation metrics.
GPU Acceleration
PyTorch CUDA support provides 15-50x training speedup over CPU-only implementations, enabling rapid experimentation and iteration.
Production Integration
Python-trained models seamlessly integrate with C++ game engine using LibTorch for real-time inference during gameplay.
MLOps Pipeline
Automated training, evaluation, model versioning, and deployment through GitHub Actions CI/CD with Docker containerization.
Cross-Platform Deployment
Windows, Linux, and containerized deployment options with comprehensive dependency management and automated builds.
Interactive Gameplay
Multiple game modes including human vs AI, AI vs AI, and hybrid interactions with real-time performance monitoring.
Reinforcement Learning Algorithms

The platform implements four distinct reinforcement learning paradigms, each representing a different approach to learning optimal policies. These implementations provide comprehensive coverage of major RL methodologies, from classical tabular methods to modern deep learning approaches.

Q-Learning (Tabular)
Classical value-based reinforcement learning using tabular Q-value storage for discrete state spaces.
Approach: Builds a lookup table mapping state-action pairs to Q-values, learning through temporal difference updates.
  • State Space: 512 discrete states (8-dimensional binary encoding)
  • Action Space: 4 discrete actions (Up, Down, Left, Right)
  • Learning: Bellman equation with epsilon-greedy exploration
  • Storage: JSON format for cross-platform compatibility
  • Performance: Fastest training, deterministic inference
Deep Q-Network (DQN)
Neural network approximation of Q-values with experience replay and target networks for stable learning.
Approach: Uses deep neural networks to approximate Q-function, enabling learning in larger state spaces.
  • Architecture: 8 → 64 → 64 → 4 fully connected layers
  • Features: Experience replay buffer, target network updates
  • Training: Double DQN to reduce overestimation bias
  • Optimization: Adam optimizer with gradient clipping
  • Memory: 10,000 transition replay buffer
Proximal Policy Optimization (PPO)
Advanced policy gradient method with clipped objective functions for stable policy updates.
Approach: Direct policy optimization using clipped surrogate objective to prevent large policy updates.
  • Networks: Separate policy and value networks
  • Policy: 8 → 64 → 64 → 4 (action probabilities)
  • Value: 8 → 64 → 64 → 1 (state value estimation)
  • Training: Multiple epochs per trajectory with clipping
  • Features: Entropy regularization for exploration
Actor-Critic (A2C)
Hybrid approach combining policy gradients with value function estimation for reduced variance learning.
Approach: Combines actor (policy) and critic (value) networks, where critic reduces variance in policy gradient updates.
  • Actor Network: 8 → 64 → 64 → 4 (policy π(a|s))
  • Critic Network: 8 → 64 → 64 → 1 (value V(s))
  • Updates: Simultaneous actor and critic optimization
  • Advantage: TD error provides variance reduction
  • Learning: Separate learning rates for actor and critic
Model Architectures

Q-Learning Table Structure

Q-Learning uses a discrete state representation with 512 possible states, stored as a lookup table mapping state-action pairs to Q-values:

State (Binary) UP DOWN LEFT RIGHT
000000000 (safe, no food) 0.12 0.15 0.08 0.11
001001000 (food right) 0.05 0.07 -0.02 0.85
100000010 (danger ahead, food up) 0.92 -0.85 0.15 0.18
111000000 (surrounded) -0.95 -0.92 -0.89 -0.88

State Encoding: [danger_straight, danger_left, danger_right, direction_bit1, direction_bit2, food_left, food_right, food_up, food_down]

Deep Q-Network (DQN) Architecture

Input Layer
8D State Vector
[danger, direction, food]
Hidden Layer 1
64 Neurons
ReLU Activation
Hidden Layer 2
64 Neurons
ReLU Activation
Output Layer
4 Q-Values
[Q(s,up), Q(s,down), Q(s,left), Q(s,right)]

PPO & Actor-Critic Architecture (Shared Pattern)

Both PPO and Actor-Critic use the same dual-network architecture pattern with separate policy and value networks:

Input State
8D Vector
Policy/Actor Network
8 → 64 → 64 → 4
Softmax Output
Action Probabilities
[P(up), P(down), P(left), P(right)]
Value/Critic Network
8 → 64 → 64 → 1
Linear Output
State Value
V(s) - Expected Return

Key Difference: PPO uses clipped objective functions for policy updates, while Actor-Critic uses direct policy gradients with advantage estimation.

Technical Implementation

The platform employs a hybrid architecture combining Python's flexibility for research and experimentation with C++'s performance for production deployment. This approach enables rapid algorithm development while maintaining real-time performance requirements.

Training Pipeline (Python)

PyTorch Framework
GPU-accelerated tensor operations with automatic differentiation for efficient neural network training. CUDA support provides significant speedup over CPU-only implementations.
Modular Architecture
Separate trainer modules for each algorithm with shared utilities for environment interaction, state generation, and evaluation metrics.
Comprehensive Logging
Detailed training metrics, model checkpointing, and performance visualization with automated report generation and statistical analysis.

Game Engine (C++)

SFML Graphics
Cross-platform multimedia library providing 60fps gameplay with responsive input handling and smooth graphics rendering.
LibTorch Integration
Real-time neural network inference using PyTorch's C++ API, enabling seamless integration of trained Python models into the game engine.
State Management
Efficient game state representation and agent interface supporting multiple AI algorithms with standardized action and observation spaces.

MLOps Infrastructure

CI/CD Pipeline
GitHub Actions workflow for automated testing, building, and deployment across Windows and Linux platforms with comprehensive artifact generation.
Container Deployment
Docker containerization for consistent deployment environments with all dependencies pre-configured and optimized for different hardware configurations.
Model Versioning
Automated model storage, versioning, and performance tracking with comprehensive metadata and evaluation metrics for reproducible research.