SnakeAI-MLOps

Multi-Agent Reinforcement Learning Platform with Production MLOps Pipeline

Created by Pranav Mishra - Connect for collaboration and opportunities

Portfolio GitHub Repository LinkedIn Profile

AI in Action ▼

This project demonstrates the power of reinforcement learning by training AI agents to master the classic Snake game. Watch how trained agents exhibit intelligent navigation patterns compared to random exploration:

Intelligent Decision Making

Trained agents efficiently navigate towards food while avoiding obstacles, demonstrating learned strategic behavior and spatial awareness.

Random Exploration

Untrained agents move randomly without strategy, frequently colliding with walls or their own body, highlighting the learning achievement.

Project Overview ▼

SnakeAI-MLOps is a comprehensive reinforcement learning platform that implements and compares four major machine learning paradigms within a unified experimental framework. The project bridges the gap between theoretical research and practical deployment by providing GPU-accelerated training pipelines, real-time inference capabilities, and systematic performance evaluation.

This platform addresses critical challenges in reinforcement learning research: algorithm comparison, reproducibility, and production deployment. By implementing multiple RL techniques within the same environment, researchers and developers can conduct meaningful comparative studies and understand the strengths and limitations of different approaches.

Key Capabilities

Multi-Algorithm Training

Train and compare Q-Learning, DQN, PPO, and Actor-Critic algorithms within the same environment using standardized evaluation metrics.

GPU Acceleration

PyTorch CUDA support provides 15-50x training speedup over CPU-only implementations, enabling rapid experimentation and iteration.

Production Integration

Python-trained models seamlessly integrate with C++ game engine using LibTorch for real-time inference during gameplay.

MLOps Pipeline

Automated training, evaluation, model versioning, and deployment through GitHub Actions CI/CD with Docker containerization.

Cross-Platform Deployment

Windows, Linux, and containerized deployment options with comprehensive dependency management and automated builds.

Interactive Gameplay

Multiple game modes including human vs AI, AI vs AI, and hybrid interactions with real-time performance monitoring.

Reinforcement Learning Algorithms ▼

The platform implements four distinct reinforcement learning paradigms, each representing a different approach to learning optimal policies. These implementations provide comprehensive coverage of major RL methodologies, from classical tabular methods to modern deep learning approaches.

Q-Learning (Tabular)

Classical value-based reinforcement learning using tabular Q-value storage for discrete state spaces.

Approach: Builds a lookup table mapping state-action pairs to Q-values, learning through temporal difference updates.

State Space: 512 discrete states (8-dimensional binary encoding)
Action Space: 4 discrete actions (Up, Down, Left, Right)
Learning: Bellman equation with epsilon-greedy exploration
Storage: JSON format for cross-platform compatibility
Performance: Fastest training, deterministic inference

Deep Q-Network (DQN)

Neural network approximation of Q-values with experience replay and target networks for stable learning.

Approach: Uses deep neural networks to approximate Q-function, enabling learning in larger state spaces.

Architecture: 8 → 64 → 64 → 4 fully connected layers
Features: Experience replay buffer, target network updates
Training: Double DQN to reduce overestimation bias
Optimization: Adam optimizer with gradient clipping
Memory: 10,000 transition replay buffer

Proximal Policy Optimization (PPO)

Advanced policy gradient method with clipped objective functions for stable policy updates.

Approach: Direct policy optimization using clipped surrogate objective to prevent large policy updates.

Networks: Separate policy and value networks
Policy: 8 → 64 → 64 → 4 (action probabilities)
Value: 8 → 64 → 64 → 1 (state value estimation)
Training: Multiple epochs per trajectory with clipping
Features: Entropy regularization for exploration

Actor-Critic (A2C)

Hybrid approach combining policy gradients with value function estimation for reduced variance learning.

Approach: Combines actor (policy) and critic (value) networks, where critic reduces variance in policy gradient updates.

Actor Network: 8 → 64 → 64 → 4 (policy π(a|s))
Critic Network: 8 → 64 → 64 → 1 (value V(s))
Updates: Simultaneous actor and critic optimization
Advantage: TD error provides variance reduction
Learning: Separate learning rates for actor and critic

Model Architectures ▼

Q-Learning Table Structure

Q-Learning uses a discrete state representation with 512 possible states, stored as a lookup table mapping state-action pairs to Q-values:

State (Binary)	UP	DOWN	LEFT	RIGHT
000000000 (safe, no food)	0.12	0.15	0.08	0.11
001001000 (food right)	0.05	0.07	-0.02	0.85
100000010 (danger ahead, food up)	0.92	-0.85	0.15	0.18
111000000 (surrounded)	-0.95	-0.92	-0.89	-0.88

State Encoding: [danger_straight, danger_left, danger_right, direction_bit1, direction_bit2, food_left, food_right, food_up, food_down]

Deep Q-Network (DQN) Architecture

Input Layer
8D State Vector
[danger, direction, food]

→

Hidden Layer 1
64 Neurons
ReLU Activation

→

Hidden Layer 2
64 Neurons
ReLU Activation

→

Output Layer
4 Q-Values
[Q(s,up), Q(s,down), Q(s,left), Q(s,right)]

PPO & Actor-Critic Architecture (Shared Pattern)

Both PPO and Actor-Critic use the same dual-network architecture pattern with separate policy and value networks:

Input State
8D Vector

→

Policy/Actor Network
8 → 64 → 64 → 4
Softmax Output

→

Action Probabilities
[P(up), P(down), P(left), P(right)]

Value/Critic Network
8 → 64 → 64 → 1
Linear Output

→

State Value
V(s) - Expected Return

Key Difference: PPO uses clipped objective functions for policy updates, while Actor-Critic uses direct policy gradients with advantage estimation.

Technical Implementation ▼

The platform employs a hybrid architecture combining Python's flexibility for research and experimentation with C++'s performance for production deployment. This approach enables rapid algorithm development while maintaining real-time performance requirements.

Training Pipeline (Python)

PyTorch Framework

GPU-accelerated tensor operations with automatic differentiation for efficient neural network training. CUDA support provides significant speedup over CPU-only implementations.

Modular Architecture

Separate trainer modules for each algorithm with shared utilities for environment interaction, state generation, and evaluation metrics.

Comprehensive Logging

Detailed training metrics, model checkpointing, and performance visualization with automated report generation and statistical analysis.

Game Engine (C++)

SFML Graphics

Cross-platform multimedia library providing 60fps gameplay with responsive input handling and smooth graphics rendering.

LibTorch Integration

Real-time neural network inference using PyTorch's C++ API, enabling seamless integration of trained Python models into the game engine.

State Management

Efficient game state representation and agent interface supporting multiple AI algorithms with standardized action and observation spaces.

MLOps Infrastructure

CI/CD Pipeline

GitHub Actions workflow for automated testing, building, and deployment across Windows and Linux platforms with comprehensive artifact generation.

Container Deployment

Docker containerization for consistent deployment environments with all dependencies pre-configured and optimized for different hardware configurations.

Model Versioning

Automated model storage, versioning, and performance tracking with comprehensive metadata and evaluation metrics for reproducible research.