Reinforcement Learning Framework
Modular framework for training Reinforcement Learning agents
Project Overview
Developed a modular reinforcement learning framework allowing training and evaluating RL agents across multiple environments. The framework leverages Stable Baselines3 for algorithm implementations and provides a flexible architecture for experimentation with different hyperparameters and training configurations.
This project demonstrates practical applications of reinforcement learning algorithms and is designed to be easily extensible, allowing for rapid prototyping and testing of different RL approaches.
Training Demonstration
Watch a trained PPO agent navigate the Walker2D environment in PyBullet. The agent has learned to coordinate its joints and maintain balance, achieving smooth forward locomotion and reaching a reward of 2400+.
Trained PPO agent demonstrating optimal landing strategy in Walker2D
Hyperparameter Tuning Impact
Before vs. After Hyperparameter Optimization
Comparison demonstrating the impact of Optuna-based hyperparameter optimization on SAC performance in LunarLanderContinuous. The tuned agent (mean reward: 290.84) exhibits significantly smoother control, more efficient fuel usage, and more consistent successful landings compared to default hyperparameters.
Technologies Used
Python
Stable Baselines3
Gymnasium
PyTorch
TensorBoard
NumPy
Algorithms Implemented
The framework supports several reinforcement learning algorithms, each optimized for different types of control tasks and action spaces.
PPO
Proximal Policy Optimization: A policy gradient method that strikes a balance between sample efficiency and ease of implementation. Ideal for continuous control tasks.
A2C
Advantage Actor-Critic: An actor-critic method that uses parallel environments for efficient data collection. Great for faster training cycles.
DQN
Deep Q-Network: Value-based method using experience replay and target networks. Perfect for discrete action spaces.
SAC
Soft Actor-Critic: Off-policy algorithm that maximizes both reward and entropy. Excellent for continuous control with sample efficiency.
DDPG
Deep Deterministic Policy Gradient: Off-policy actor–critic algorithm that learns a deterministic policy. Good for continuous control but prone to instability and over-estimation without careful tuning.
TD3
Twin Delayed DDPG: Improved variant of DDPG that fixes its weaknesses with twin critics, delayed policy updates, and target smoothing. Much more stable and reliable for continuous control.
Training Environments
Training and evaluation across multiple Gymnasium and PyBullet environments, from classic control problems to complex robotic locomotion tasks.
Gymnasium Environments
CartPole-v1
Balance a pole on a moving cart through precise control.
Pendulum-v1
Swing up and balance an inverted pendulum with continuous control.
Acrobot-v1
Swing a two-link robot arm to reach above a target height.
LunarLander-v2
Land spacecraft safely using discrete thruster controls.
LunarLanderContinuous-v2
Continuous control variant with smooth thruster adjustments.
BipedalWalker-v3
Train a 2D bipedal robot to walk across varied terrain.
PyBullet Environments
AntBulletEnv-v0
Quadruped robot learning to walk and navigate efficiently.
HalfCheetahBulletEnv-v0
High-speed 2D running with a cheetah-inspired robot.
HopperBulletEnv-v0
Single-legged hopping robot balancing and forward movement.
Walker2DBulletEnv-v0
Bipedal robot with two articulated legs learning stable balance and forward locomotion.
Technical Approach
Framework Architecture
The framework follows a modular design pattern with separate components for environment setup, agent configuration, training loops, and evaluation metrics. This separation allows for easy swapping of algorithms and environments without restructuring the codebase.
Key Features
Modular Design
Clean separation of concerns with reusable components for training, evaluation, and logging.
Hyperparameter Management
YAML-based configuration files for easy experimentation with different parameters.
Training Monitoring
Real-time tracking of rewards, losses, and other metrics using TensorBoard integration.
Model Checkpointing
Automatic saving of best-performing models based on evaluation metrics.
Multi-Environment Support
Unified interface for training across different Gymnasium environments.
Training Process
The training pipeline consists of several stages: environment initialization, agent creation with specified hyperparameters, iterative training with periodic evaluation, and final model demonstration. The framework supports both on-policy (PPO, A2C) and off-policy (DQN, SAC, DDPG, TD3) algorithms, each with tailored and optimized training loops.
Results & Performance
After extensive training and hyperparameter tuning, the agents achieved strong performance across all tested environments. The PPO algorithm consistently demonstrated the best balance between sample efficiency and final performance.
Key Achievements
- Walker2DBulletEnv-v0: Achieved 2403 reward with PPO through systematic hyperparameter optimization
- Implemented Optuna-based Bayesian optimization reducing hyperparameter search time by efficiently exploring 50+ trial configurations
- Built modular factory pattern architecture enabling rapid experimentation across 10 environments and 6 RL algorithms
- Designed comprehensive evaluation framework with automated metrics tracking, TensorBoard integration, and JSON result logging
Hyperparameter Tuning Results
Systematic hyperparameter optimization led to significant improvements in agent performance. Key findings include the importance of learning rate scheduling, optimal batch sizes for different algorithms, and the critical role of entropy coefficient in exploration vs. exploitation balance.
Challenges & Solutions
High Variance and Seed Sensitivity
With identical hyperparameters, there was a major performance variability for the Walker2D environment (500-2400). This was due to random seed initializaiton having large impact on final results.
Solution:Implemented multi-seed validation methodology, running each configuration with multiple random seeds (42, 43, 44) to assess reproducibility and measure performance variance.
Hyperparameter Stability
Optuna results showed fast early learning (due to low entropy coefficients) but caused catastropic forgetting during longer training runs.
Solution:Initial hyperparameters were tuned on shorter timesteps (500k) for efficiency, followed by full-duration validation runs (2M timesteps). Hyperparameters were then modified to allow for longer, more stable training.