Reinforcement Learning Framework

Project Overview

Developed a modular reinforcement learning framework allowing training and evaluating RL agents across multiple environments. The framework leverages Stable Baselines3 for algorithm implementations and provides a flexible architecture for experimentation with different hyperparameters and training configurations.

This project demonstrates practical applications of reinforcement learning algorithms and is designed to be easily extensible, allowing for rapid prototyping and testing of different RL approaches.

Training Demonstration

Watch a trained PPO agent navigate the Walker2D environment in PyBullet. The agent has learned to coordinate its joints and maintain balance, achieving smooth forward locomotion and reaching a reward of 2400+.

Trained PPO agent demonstrating optimal landing strategy in Walker2D

Hyperparameter Tuning Impact

Before vs. After Hyperparameter Optimization

Comparison demonstrating the impact of Optuna-based hyperparameter optimization on SAC performance in LunarLanderContinuous. The tuned agent (mean reward: 290.84) exhibits significantly smoother control, more efficient fuel usage, and more consistent successful landings compared to default hyperparameters.

Before Tuning

After Tuning

Technologies Used

Python

Stable Baselines3

Gymnasium

PyTorch

TensorBoard

NumPy

Algorithms Implemented

The framework supports several reinforcement learning algorithms, each optimized for different types of control tasks and action spaces.

PPO

Proximal Policy Optimization: A policy gradient method that strikes a balance between sample efficiency and ease of implementation. Ideal for continuous control tasks.

A2C

Advantage Actor-Critic: An actor-critic method that uses parallel environments for efficient data collection. Great for faster training cycles.

DQN

Deep Q-Network: Value-based method using experience replay and target networks. Perfect for discrete action spaces.

SAC

Soft Actor-Critic: Off-policy algorithm that maximizes both reward and entropy. Excellent for continuous control with sample efficiency.

DDPG

Deep Deterministic Policy Gradient: Off-policy actor–critic algorithm that learns a deterministic policy. Good for continuous control but prone to instability and over-estimation without careful tuning.

TD3

Twin Delayed DDPG: Improved variant of DDPG that fixes its weaknesses with twin critics, delayed policy updates, and target smoothing. Much more stable and reliable for continuous control.

Training Environments

Training and evaluation across multiple Gymnasium and PyBullet environments, from classic control problems to complex robotic locomotion tasks.

Gymnasium Environments

CartPole-v1

Balance a pole on a moving cart through precise control.

Pendulum-v1

Swing up and balance an inverted pendulum with continuous control.

Acrobot-v1

Swing a two-link robot arm to reach above a target height.

LunarLander-v2

Land spacecraft safely using discrete thruster controls.

LunarLanderContinuous-v2

Continuous control variant with smooth thruster adjustments.

BipedalWalker-v3

Train a 2D bipedal robot to walk across varied terrain.

PyBullet Environments

AntBulletEnv-v0

Quadruped robot learning to walk and navigate efficiently.

HalfCheetahBulletEnv-v0

High-speed 2D running with a cheetah-inspired robot.

HopperBulletEnv-v0

Single-legged hopping robot balancing and forward movement.

Walker2DBulletEnv-v0

Bipedal robot with two articulated legs learning stable balance and forward locomotion.

Technical Approach

Framework Architecture

The framework follows a modular design pattern with separate components for environment setup, agent configuration, training loops, and evaluation metrics. This separation allows for easy swapping of algorithms and environments without restructuring the codebase.

Key Features

Modular Design

Clean separation of concerns with reusable components for training, evaluation, and logging.

Hyperparameter Management

YAML-based configuration files for easy experimentation with different parameters.

Training Monitoring

Real-time tracking of rewards, losses, and other metrics using TensorBoard integration.

Model Checkpointing

Automatic saving of best-performing models based on evaluation metrics.

Multi-Environment Support

Unified interface for training across different Gymnasium environments.

Training Process

The training pipeline consists of several stages: environment initialization, agent creation with specified hyperparameters, iterative training with periodic evaluation, and final model demonstration. The framework supports both on-policy (PPO, A2C) and off-policy (DQN, SAC, DDPG, TD3) algorithms, each with tailored and optimized training loops.

Results & Performance

After extensive training and hyperparameter tuning, the agents achieved strong performance across all tested environments. The PPO algorithm consistently demonstrated the best balance between sample efficiency and final performance.

Key Achievements

Walker2DBulletEnv-v0: Achieved 2403 reward with PPO through systematic hyperparameter optimization
Implemented Optuna-based Bayesian optimization reducing hyperparameter search time by efficiently exploring 50+ trial configurations
Built modular factory pattern architecture enabling rapid experimentation across 10 environments and 6 RL algorithms
Designed comprehensive evaluation framework with automated metrics tracking, TensorBoard integration, and JSON result logging

Hyperparameter Tuning Results

Systematic hyperparameter optimization led to significant improvements in agent performance. Key findings include the importance of learning rate scheduling, optimal batch sizes for different algorithms, and the critical role of entropy coefficient in exploration vs. exploitation balance.

Challenges & Solutions

High Variance and Seed Sensitivity

With identical hyperparameters, there was a major performance variability for the Walker2D environment (500-2400). This was due to random seed initializaiton having large impact on final results.

Solution:

Implemented multi-seed validation methodology, running each configuration with multiple random seeds (42, 43, 44) to assess reproducibility and measure performance variance.

Hyperparameter Stability

Optuna results showed fast early learning (due to low entropy coefficients) but caused catastropic forgetting during longer training runs.

Solution:

Initial hyperparameters were tuned on shorter timesteps (500k) for efficiency, followed by full-duration validation runs (2M timesteps). Hyperparameters were then modified to allow for longer, more stable training.