Adapting Deterministic Environments into Stochastic Ones¶
Install¶
Uncomment the following cells:
[1]:
#!git clone https://github.com/ricgama/maenvs4vrp.git
[2]:
# When using Colab
#%cd maenvs4vrp
#%mv maenvs4vrp/ repo_temp/
#%mv repo_temp/ ..
#%cd ..
#%cp maenvs4vrp/setup.py repo_temp/
#%rm -r maenvs4vrp
#%mv repo_temp/ maenvs4vrp/
#%cd maenvs4vrp/
#!pip install .
This tutorial aims to demonstrate how to adapt the existing deterministic environments in the library to create stochastic variants. Introducing randomness into environments is crucial for improving robustness and realism in agent training. For instance, stochasticity can better simulate real-world variability in factors like customer demand, travel times, or service durations. For more context on its importance, check Opportunities for reinforcement learning in stochastic dynamic vehicle routing.
[4]:
import torch
import matplotlib.pyplot as plt
from typing import Optional
from tensordict import TensorDict
Capacitated Vehicle Routing Problem with Soft Time Windows and Stochastic Travel Times¶
In this example, we will base our approach on the AI4TSP competition environment, where travel times between nodes are affected by random noise. Specifically, the travel time between nodes \(i\) and \(j\) is modeled as:
Here, \(\text{dist\_euclidean}(i, j)\) denotes the Euclidean distance between nodes i and j, and \(U(0, \text{dist\_euclidean}(i, j))\) represents uniform noise in the range from 0 to the Euclidean distance. In this environment, violating a time window incurred a penalty of 1 unit, while exceeding the total available time resulted in a penalty of 2 units (assuming precedence on tw penalty).
To do this, we will change the Capacitated Vehicle Routing Problem with Soft Time Windows environment by introducing these elements of randomness.
[5]:
from maenvs4vrp.environments.cvrpstw.env import Environment
from maenvs4vrp.environments.cvrpstw.env_agent_selector import AgentSelector
from maenvs4vrp.environments.cvrpstw.observations import Observations
from maenvs4vrp.environments.cvrpstw.instances_generator import InstanceGenerator
from maenvs4vrp.environments.cvrpstw.env_agent_reward import DenseReward
[6]:
gen = InstanceGenerator()
sel = AgentSelector()
rew = DenseReward()
obs = Observations()
Let’s start by updating the _update_feasibility method. Since the environment involves randomness, the only checks we need to perform are the Capacity constraint and whether nodes have been visited previously, so we comment out everything else (i.e. we remove the time window hard constraints ):
[7]:
class Environment(Environment):
def _update_feasibility(self):
"""
"""
_mask = self.td_state['nodes']['active_nodes_mask'].clone()
# time windows constraints
#loc = self.td_state['coords'].gather(1, self.td_state['cur_agent']['cur_node'][:,:,None].expand(-1, -1, 2))
#ptime = self.td_state['cur_agent']['cur_time'].clone()
#time2j = torch.pairwise_distance(loc, self.td_state["coords"], eps=0, keepdim = False)
#if self.n_digits is not None:
# time2j = torch.floor(self.n_digits * time2j) / self.n_digits
#arrivej = ptime + time2j
#waitj = torch.clip(self.td_state['tw_low']-arrivej, min=0)
#service_startj = arrivej + waitj
#c1 = service_startj <= self.td_state['tw_high']
#c2 = service_startj + self.td_state['service_time'] + self.td_state['time2depot'] <= self.td_state['end_time'].unsqueeze(-1)
# capacity constraints
c3 = self.td_state['demands'] <= self.td_state['cur_agent']['cur_load']
_mask = _mask * c3 # * c1 * c2
# if agent is done, close all services and open depot
agents_done = self.td_state['agents']['active_agents_mask'].gather(1, self.td_state['cur_agent_idx']).clone()
_mask = _mask * agents_done
_mask.scatter_(1, self.td_state['depot_idx'], True)
# update state
self.td_state['cur_agent'].update({'action_mask': _mask})
self.td_state['agents']['feasible_nodes'].scatter_(1,
self.td_state['cur_agent_idx'][:,:,None].expand(-1,-1,self.num_nodes), _mask.unsqueeze(1))
Now, we have to introduce stochasticity when updating the environment state. While the distribution’s parameters and object are typically defined in the InstanceGenerator class, for simplicity, we’ll make a hardcoded change directly in the environment by modifying the following method:
[8]:
class Environment(Environment):
def _update_state(self, action):
"""
Update environment state.
Args:
action(torch.Tensor): Tensor with agent moves.
Returns:
None.
"""
loc = self.td_state['coords'].gather(1, self.td_state['cur_agent']['cur_node'][:,:,None].expand(-1, -1, 2))
next_loc = self.td_state['coords'].gather(1, action[:,:,None].expand(-1, -1, 2))
ptime = self.td_state['cur_agent']['cur_time'].clone()
time2j = torch.pairwise_distance(loc, next_loc, eps=0, keepdim = False)
#
time2j = time2j + torch.rand(*self.batch_size).unsqueeze(1) * time2j
if self.n_digits is not None:
time2j = torch.floor(self.n_digits * time2j) / self.n_digits
tw = self.td_state['tw_low'].gather(1, action)
service_time = self.td_state['service_time'].gather(1, action)
arrivej = ptime + time2j
waitj = torch.clip(tw-arrivej, min=0)
time_update = arrivej + waitj + service_time
# update agent cur node
self.td_state['cur_agent']['cur_node'] = action
self.td_state['agents']['cur_node'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_node'])
# update agent cur time
self.td_state['cur_agent']['cur_time'] = time_update
agents_done = ~self.td_state['agents']['active_agents_mask'].gather(1, self.td_state['cur_agent_idx']).clone()
# Overwriting `self.early_penalty` and `self.late_penalty` here.
# Note: this should be handled inside the `env` class instead!
self.early_penalty = 0
self.late_penalty = 1
self.end_penalty = 2
tw_low = self.td_state['tw_low'].gather(1, action)
tw_high = self.td_state['tw_high'].gather(1, action)
penalty = - self.late_penalty * (arrivej > tw_high).float()
penalty = torch.where(agents_done, penalty-self.end_penalty * (arrivej > self.td_state['end_time'].unsqueeze(1)).float(),
penalty)
self.td_state['cur_agent']['cur_penalty'] = penalty
self.td_state['cur_agent']['cum_penalty'] += penalty
self.td_state['agents']['cur_penalty'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_penalty'])
self.td_state['agents']['cum_penalty'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cum_penalty'])
# if agent is done set agent time to end_time
self.td_state['cur_agent']['cur_time'] = torch.where(agents_done, self.td_state['end_time'].unsqueeze(-1),
self.td_state['cur_agent']['cur_time'])
self.td_state['agents']['cur_time'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_time'])
# update agent cum traveled time
self.td_state['cur_agent']['cur_ttime'] = time2j
self.td_state['cur_agent']['cum_ttime'] += time2j
self.td_state['agents']['cur_ttime'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_ttime'])
self.td_state['agents']['cum_ttime'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cum_ttime'])
# update agent load and node demands
self.td_state['cur_agent']['cur_load'] -= self.td_state['demands'].gather(1, action)
# is agent is done set agent cur_load to 0
self.td_state['cur_agent']['cur_load'] = torch.where( agents_done, 0.,
self.td_state['cur_agent']['cur_load'])
self.td_state['nodes']['cur_demands'].scatter_(1, action, torch.zeros_like(action, dtype = torch.float))
self.td_state['agents']['cur_load'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_load'])
# update visited nodes
r = torch.arange(*self.td_state.batch_size, device=self.device)
self.td_state['agents']['visited_nodes'][r, self.td_state['cur_agent_idx'].squeeze(-1), action.squeeze(-1)] = True
# update agent step
self.td_state['cur_agent']['cur_step'] = torch.where(~agents_done, self.td_state['cur_agent']['cur_step']+1,
self.td_state['cur_agent']['cur_step'])
self.td_state['agents']['cur_step'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_step'])
self.td_state['cur_node_idx'] = action.clone()
# if all done activate first agent to guarantee batch consistency during agent sampling
self.td_state['agents']['active_agents_mask'][self.td_state['agents']['active_agents_mask'].sum(1).eq(0), 0] = True
[9]:
import yaml
[10]:
feature_list = yaml.safe_load("""
nodes_static:
x_coordinate:
feat: x_coordinate
norm: min_max
x_coordinate:
feat: x_coordinate
norm:
tw_low:
feat: tw_low
norm:
tw_high:
feat: tw_high
norm:
nodes_dynamic:
- time2open_div_end_time
- time2close_div_end_time
agent:
- x_coordinate
- y_coordinate
- frac_current_time
""")
[11]:
obs = Observations(feature_list)
Let’s do an episode rollout and check the reward and penalty through every step:
[12]:
env = Environment(instance_generator_object=gen,
obs_builder_object=obs,
agent_selector_object=sel,
reward_evaluator=rew,
batch_size= 4,
seed=0)
[13]:
td = env.reset(num_agents=5, num_nodes=20)
while not td["done"].all():
td = env.sample_action(td)
td = env.step(td)
step = env.env_nsteps
reward = td['reward']
penalty = td['penalty']
print(f'env step number:{step}, reward: {reward}, penalty: {penalty}')
print(td["done"])
env step number:1, reward: tensor([[-0.7656],
[-0.6015],
[-0.8266],
[-0.5014]]), penalty: tensor([[-0.],
[-0.],
[-0.],
[-0.]])
tensor([False, False, False, False])
env step number:2, reward: tensor([[-0.6651],
[-0.2208],
[-0.6709],
[-1.0395]]), penalty: tensor([[-1.],
[-0.],
[-0.],
[-1.]])
tensor([False, False, False, False])
env step number:3, reward: tensor([[-0.2147],
[-1.0062],
[-0.7760],
[-0.6304]]), penalty: tensor([[-0.],
[-1.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:4, reward: tensor([[-0.7853],
[-1.7492],
[-0.5923],
[-0.8371]]), penalty: tensor([[-1.],
[-1.],
[-1.],
[-3.]])
tensor([False, False, False, False])
env step number:5, reward: tensor([[-0.8125],
[-1.4126],
[-0.9101],
[-0.7897]]), penalty: tensor([[-1.],
[-3.],
[-1.],
[-0.]])
tensor([False, False, False, False])
env step number:6, reward: tensor([[-2.0561],
[-0.7305],
[-1.0155],
[-1.1571]]), penalty: tensor([[-1.],
[-0.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:7, reward: tensor([[-0.7170],
[-0.3220],
[-0.9293],
[-0.9205]]), penalty: tensor([[-1.],
[-0.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:8, reward: tensor([[-0.6861],
[-0.9620],
[-1.4791],
[-1.5058]]), penalty: tensor([[-1.],
[-1.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:9, reward: tensor([[-0.8908],
[-0.3081],
[-0.4904],
[-0.9430]]), penalty: tensor([[-1.],
[-1.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:10, reward: tensor([[-0.1595],
[-0.4244],
[-1.7242],
[-0.6727]]), penalty: tensor([[-1.],
[-3.],
[-3.],
[-1.]])
tensor([False, False, False, False])
env step number:11, reward: tensor([[-1.2807],
[-0.0000],
[-0.3076],
[-0.0768]]), penalty: tensor([[-1.],
[-0.],
[-0.],
[-1.]])
tensor([False, False, False, False])
env step number:12, reward: tensor([[-0.2636],
[-0.6424],
[-0.2109],
[-0.5478]]), penalty: tensor([[-3.],
[-0.],
[-0.],
[-1.]])
tensor([False, False, False, False])
env step number:13, reward: tensor([[-1.2871],
[-1.1685],
[-1.6815],
[-0.6537]]), penalty: tensor([[-0.],
[-0.],
[-0.],
[-1.]])
tensor([False, False, False, False])
env step number:14, reward: tensor([[-0.8789],
[-1.2895],
[-1.2558],
[-0.4680]]), penalty: tensor([[-1.],
[-1.],
[-1.],
[-3.]])
tensor([False, False, False, False])
env step number:15, reward: tensor([[-0.8738],
[-1.1818],
[-0.5148],
[-0.3399]]), penalty: tensor([[-1.],
[-1.],
[-3.],
[-0.]])
tensor([False, False, False, False])
env step number:16, reward: tensor([[-1.4279],
[-1.6755],
[-1.3336],
[-0.5009]]), penalty: tensor([[-1.],
[-1.],
[-0.],
[-1.]])
tensor([False, False, False, False])
env step number:17, reward: tensor([[-0.9529],
[-1.0475],
[-0.7451],
[-0.6007]]), penalty: tensor([[-1.],
[-1.],
[-1.],
[-0.]])
tensor([False, False, False, False])
env step number:18, reward: tensor([[-0.1508],
[-0.5902],
[-1.3093],
[-0.2654]]), penalty: tensor([[-3.],
[-1.],
[-1.],
[-0.]])
tensor([False, False, False, False])
env step number:19, reward: tensor([[-0.3491],
[-0.1075],
[-0.4890],
[-0.5021]]), penalty: tensor([[-0.],
[-1.],
[-1.],
[-0.]])
tensor([False, False, False, False])
env step number:20, reward: tensor([[-0.1878],
[-0.1044],
[-1.0011],
[-0.4731]]), penalty: tensor([[-0.],
[-3.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:21, reward: tensor([[-0.6789],
[-0.4127],
[-0.8434],
[-0.7046]]), penalty: tensor([[-1.],
[-0.],
[-1.],
[-1.]])
tensor([False, False, False, False])
env step number:22, reward: tensor([[-1.1422],
[-0.6898],
[-0.8781],
[-0.3995]]), penalty: tensor([[-3.0000],
[-5.9069],
[-3.0000],
[-1.0000]])
tensor([False, True, False, False])
env step number:23, reward: tensor([[-0.0000],
[-0.0000],
[-0.7991],
[-0.7031]]), penalty: tensor([[-0.],
[-0.],
[-0.],
[-3.]])
tensor([False, True, False, False])
env step number:24, reward: tensor([[-0.0000],
[-0.0000],
[-1.0580],
[-0.0000]]), penalty: tensor([[-0.],
[-0.],
[-0.],
[-0.]])
tensor([True, True, True, True])
Team Orienteering Problem with Time Windows and Stochastic Profits (TOPTWSP)¶
For this example, we will assume that the profit associated with each location is a random variable following a Poisson distribution with mean \(\lambda_i\). This means that each time the environment is restarted, the profits will vary according to this distribution, introducing again another layer of stochasticity to the decision-making process.
[14]:
from maenvs4vrp.environments.toptw.env import Environment
from maenvs4vrp.environments.toptw.env_agent_selector import AgentSelector
from maenvs4vrp.environments.toptw.observations import Observations
from maenvs4vrp.environments.toptw.instances_generator import InstanceGenerator
from maenvs4vrp.environments.toptw.env_agent_reward import DenseReward
Before we begin, let’s initialize the environment for the Team Orientation Problem with Time Windows using the following arguments:
sample_type='augment'n_augment=4
If batch=8, this will generate two instances of the problem, each with 4 copies. Let’s see:
[15]:
gen = InstanceGenerator(batch_size = 8)
obs = Observations()
sel = AgentSelector()
rew = DenseReward()
env = Environment(instance_generator_object=gen,
obs_builder_object=obs,
agent_selector_object=sel,
reward_evaluator=rew,
seed=0)
[16]:
td = env.reset(batch_size = 8,
num_agents=2,
num_nodes=9,
profits = 'uniform',
sample_type='augment',
n_augment=4)
[17]:
env.td_state['profits']
[17]:
tensor([[0., 6., 1., 6., 2., 0., 3., 9., 9.],
[0., 2., 8., 5., 9., 0., 5., 9., 8.],
[0., 6., 1., 6., 2., 0., 3., 9., 9.],
[0., 2., 8., 5., 9., 0., 5., 9., 8.],
[0., 6., 1., 6., 2., 0., 3., 9., 9.],
[0., 2., 8., 5., 9., 0., 5., 9., 8.],
[0., 6., 1., 6., 2., 0., 3., 9., 9.],
[0., 2., 8., 5., 9., 0., 5., 9., 8.]])
As we can see, in the deterministic problem, for the same instance the nodes present the same profit values.
Modifying the Environment for Random Profit Sampling¶
Now let’s change the environment so that, for the same instance, we obtain random samples of the points’ profits.
To achieve this, we will directly modify the ``InstanceGenerator`` class. Specifically:
Assume that the
profitsattribute, created in therandom_generate_instancemethod, establishes the mean values of a Poisson distribution.These mean values will then be used in the
augment_generate_instancemethod when sampling the profits.
[18]:
class InstanceGenerator(InstanceGenerator):
def augment_generate_instance(self, num_agents:int=20,
num_nodes:int=100,
service_times:int=0.2,
profits:str='distance',
batch_size: Optional[torch.Size] = None,
n_augment:int = 2,
seed:int=None)-> TensorDict:
"""
Generate augmentated instance.
Args:
num_agents(int): Total number of agents. Defaults to 20.
num_nodes(int): Total number of nodes. Defaults to 100.
capacity(int): Total capacity for each agent. Defaults to 50.
service_times(int): Service time in the nodes. Defaults to 0.2.
batch_size(torch.Size, optional): Batch size. Defaults to None.
n_augment(int): Data augmentation. Defaults to 2.
seed(int, optional): Random number generator seed. Defaults to None.
Returns:
TensorDict: Instance data.
"""
if seed is not None:
self._set_seed(seed)
if num_agents is not None:
assert num_agents>0, f"number of agents must be grater them 0!"
self.max_num_agents = num_agents
if num_nodes is not None:
assert num_nodes>0, f"number of services must be grater them 0!"
self.max_num_nodes = num_nodes
if service_times is not None:
self.service_times = service_times
if batch_size is not None:
batch_size = [batch_size] if isinstance(batch_size, int) else batch_size
self.batch_size = torch.Size(batch_size)
assert self.batch_size.numel()%n_augment == 0, f"batch_size must be divisible by n_augment"
s_batch_size = self.batch_size.numel() // n_augment
self.s_batch_size = torch.Size([s_batch_size])
instance_info_s = self.random_generate_instance(num_agents=num_agents,
num_nodes=num_nodes,
profits=profits,
service_times=service_times,
batch_size = self.s_batch_size,
seed=seed)
self.batch_size = torch.Size(batch_size)
instance = TensorDict({}, batch_size=self.batch_size, device=self.device)
for key in instance_info_s['data'].keys():
if len(instance_info_s['data'][key].shape) == 3:
instance[key] = instance_info_s['data'][key].repeat(n_augment, 1, 1)
elif len(instance_info_s['data'][key].shape) == 2:
instance[key] = instance_info_s['data'][key].repeat(n_augment, 1)
elif len(instance_info_s['data'][key].shape) == 1:
instance[key] = instance_info_s['data'][key].repeat(n_augment)
# Here we sample the profits:
instance['profits'] = torch.poisson(instance['profits'])
instance_info = {'name':'random_instance',
'num_nodes': self.max_num_nodes,
'num_agents':self.max_num_agents,
'data':instance}
return instance_info
[19]:
stoch_gen = InstanceGenerator(batch_size = 8)
[20]:
stoch_env = Environment(instance_generator_object=stoch_gen,
obs_builder_object=obs,
agent_selector_object=sel,
reward_evaluator=rew,
seed=0)
restarting the environment we get:
[21]:
td = stoch_env.reset(batch_size = 8,
num_agents=2,
num_nodes=9,
profits = 'uniform',
sample_type='augment',
n_augment=4)
[22]:
stoch_env.td_state['profits']
[22]:
tensor([[ 0., 7., 2., 2., 2., 0., 3., 8., 6.],
[ 0., 1., 9., 3., 15., 0., 3., 8., 4.],
[ 0., 5., 1., 5., 1., 0., 2., 17., 10.],
[ 0., 2., 8., 3., 13., 0., 4., 7., 7.],
[ 0., 9., 2., 5., 1., 0., 5., 13., 8.],
[ 0., 2., 11., 6., 5., 0., 6., 11., 8.],
[ 0., 10., 2., 5., 2., 0., 3., 6., 7.],
[ 0., 2., 9., 9., 4., 0., 9., 9., 9.]])
As we can see, for each instance implementation, the profit values differ as expected.
It is important to note that in these two demonstrative examples, our goal is simply to illustrate how deterministic environments can be modified to include stochastic elements.
When creating stochastic environments from scratch, we may use deterministic environments as a base, but this should be done in a more structured manner — introducing new names, attributes, and clear definitions to distinguish them from the deterministic versions.