{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13282c70",
   "metadata": {},
   "source": [
    "# Adapting Deterministic Environments into Stochastic Ones\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "150495c9-3c9c-4d4e-8ac2-508f8136769e",
   "metadata": {},
   "source": [
    "### Install"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67b9bf6a-b916-462f-9a2d-5a5f907b7a54",
   "metadata": {},
   "source": [
    "Uncomment the following cells:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62e47599-a094-4805-9f08-7d824da2f156",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !git clone https://github.com/ricgama/maenvs4vrp_beta.git # When using Colab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8efe0fdb-2534-4e1a-b1a8-a40a2d79451d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# When using Colab\n",
    "# %cd maenvs4vrp_beta/\n",
    "# ! pip install -e .\n",
    "#%cd maenvs4vrp/notebooks/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36902a54-ffaf-4c13-917d-1b76e5ae7f87",
   "metadata": {},
   "outputs": [],
   "source": [
    "# When using Binder\n",
    "#%cd ../../\n",
    "#! pip install -e . "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ee52736-3bdc-4b54-bfc1-70978993887e",
   "metadata": {},
   "source": [
    "This tutorial aims to demonstrate how to adapt the existing **deterministic environments** in the library to create **stochastic variants**. Introducing randomness into environments is crucial for improving robustness and realism in agent training. For instance, stochasticity can better simulate real-world variability in factors like customer demand, travel times, or service durations. \n",
    "For more context on its importance, check [Opportunities for reinforcement learning in stochastic dynamic vehicle routing](https://www.sciencedirect.com/science/article/abs/pii/S030505482200301X).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "65b92310-3db2-422b-8ac6-fa93cd98bba1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The autoreload extension is already loaded. To reload it, use:\n",
      "  %reload_ext autoreload\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import matplotlib.pyplot as plt\n",
    "from typing import Optional\n",
    "from tensordict import TensorDict\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d0ead09",
   "metadata": {},
   "source": [
    "## Capacitated Vehicle Routing Problem with Soft Time Windows and Stochastic Travel Times"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "148bbace",
   "metadata": {},
   "source": [
    "In this example, we will base our approach on the [AI4TSP competition environment](https://paulorocosta.gitbook.io/ai4tsp-competition/), where travel times between nodes are affected by random noise. Specifically, the travel time between nodes $i$ and $j$ is modeled as:\n",
    "\n",
    "$$\n",
    "\\text{dist}_{ij} = \\text{dist\\_euclidean}(i, j) + U\\left(0, \\text{dist\\_euclidean}(i, j)\\right)\n",
    "$$\n",
    "\n",
    "Here, $\\text{dist\\_euclidean}(i, j)$ denotes the Euclidean distance between nodes *i* and *j*, and $U(0, \\text{dist\\_euclidean}(i, j))$ represents uniform noise in the range from 0 to the Euclidean distance. In this environment, violating a time window incurred a penalty of **1 unit**, while exceeding the total available time resulted in a penalty of **2 units** (assuming precedence on tw penalty).\n",
    "\n",
    "\n",
    "\n",
    "To do this, we will change the Capacitated Vehicle Routing Problem with Soft Time Windows environment by introducing these elements of randomness.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "23b56fa7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from maenvs4vrp.environments.cvrpstw.env import Environment\n",
    "from maenvs4vrp.environments.cvrpstw.env_agent_selector import AgentSelector\n",
    "from maenvs4vrp.environments.cvrpstw.observations import Observations\n",
    "from maenvs4vrp.environments.cvrpstw.instances_generator import InstanceGenerator\n",
    "from maenvs4vrp.environments.cvrpstw.env_agent_reward import DenseReward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2dffdef0",
   "metadata": {},
   "outputs": [],
   "source": [
    "gen = InstanceGenerator()\n",
    "sel = AgentSelector()\n",
    "rew = DenseReward()\n",
    "obs = Observations()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b51c863f",
   "metadata": {},
   "source": [
    "Let's start by updating the `_update_feasibility` method. Since the environment involves randomness, the only checks we need to perform are the Capacity constraint and whether nodes have been visited previously, so we comment out everything else (i.e. we remove the time window hard constraints ):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "e1a357f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Environment(Environment):\n",
    " \n",
    "    def _update_feasibility(self):\n",
    "\n",
    "        \"\"\"\n",
    "\n",
    "        \"\"\"\n",
    "\n",
    "        _mask = self.td_state['nodes']['active_nodes_mask'].clone() \n",
    "\n",
    "        # time windows constraints\n",
    "        #loc = self.td_state['coords'].gather(1, self.td_state['cur_agent']['cur_node'][:,:,None].expand(-1, -1, 2))\n",
    "        #ptime = self.td_state['cur_agent']['cur_time'].clone()\n",
    "        #time2j = torch.pairwise_distance(loc, self.td_state[\"coords\"], eps=0, keepdim = False)        \n",
    "        #if self.n_digits is not None:\n",
    "        #    time2j = torch.floor(self.n_digits * time2j) / self.n_digits\n",
    "        \n",
    "        #arrivej = ptime + time2j\n",
    "        #waitj = torch.clip(self.td_state['tw_low']-arrivej, min=0)\n",
    "        #service_startj = arrivej + waitj\n",
    "        #c1 = service_startj <= self.td_state['tw_high']\n",
    "        #c2 = service_startj + self.td_state['service_time'] + self.td_state['time2depot'] <= self.td_state['end_time'].unsqueeze(-1)\n",
    "        # capacity constraints\n",
    "        c3 = self.td_state['demands'] <= self.td_state['cur_agent']['cur_load']\n",
    "        \n",
    "        _mask = _mask * c3 # * c1 * c2\n",
    "        \n",
    "        # if agent is done, close all services and open depot\n",
    "        agents_done = self.td_state['agents']['active_agents_mask'].gather(1, self.td_state['cur_agent_idx']).clone()\n",
    "        _mask = _mask * agents_done\n",
    "        _mask.scatter_(1, self.td_state['depot_idx'], True)\n",
    "        # update state\n",
    "        self.td_state['cur_agent'].update({'action_mask': _mask}) \n",
    "        self.td_state['agents']['feasible_nodes'].scatter_(1, \n",
    "                                            self.td_state['cur_agent_idx'][:,:,None].expand(-1,-1,self.num_nodes), _mask.unsqueeze(1))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cbadf5f-da7f-4903-9403-8cfb66b62d1e",
   "metadata": {},
   "source": [
    "Now, we have to introduce stochasticity when updating the environment state. While the distribution's parameters and object are typically defined in the `InstanceGenerator` class, for simplicity, we\\'ll make a hardcoded change directly in the environment by modifying the following method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "082a9196",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Environment(Environment):\n",
    " \n",
    "    def _update_state(self, action):\n",
    "\n",
    "        \"\"\"\n",
    "        Update environment state.\n",
    "\n",
    "        Args:\n",
    "            action(torch.Tensor): Tensor with agent moves.\n",
    "\n",
    "        Returns:\n",
    "            None.\n",
    "        \"\"\"\n",
    "\n",
    "        loc = self.td_state['coords'].gather(1, self.td_state['cur_agent']['cur_node'][:,:,None].expand(-1, -1, 2))\n",
    "        next_loc = self.td_state['coords'].gather(1, action[:,:,None].expand(-1, -1, 2))\n",
    "\n",
    "        ptime = self.td_state['cur_agent']['cur_time'].clone()\n",
    "        time2j = torch.pairwise_distance(loc, next_loc, eps=0, keepdim = False)\n",
    "        \n",
    "        #\n",
    "        time2j = time2j + torch.rand(*self.batch_size).unsqueeze(1) * time2j\n",
    "        \n",
    "        if self.n_digits is not None:\n",
    "            time2j = torch.floor(self.n_digits * time2j) / self.n_digits\n",
    "        tw = self.td_state['tw_low'].gather(1, action)\n",
    "        service_time = self.td_state['service_time'].gather(1, action)\n",
    "\n",
    "        arrivej = ptime + time2j\n",
    "        waitj = torch.clip(tw-arrivej, min=0)\n",
    "\n",
    "        time_update = arrivej + waitj + service_time\n",
    "        # update agent cur node\n",
    "        self.td_state['cur_agent']['cur_node'] = action\n",
    "        self.td_state['agents']['cur_node'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_node'])\n",
    "        # update agent cur time\n",
    "        self.td_state['cur_agent']['cur_time'] = time_update\n",
    "        \n",
    "        agents_done = ~self.td_state['agents']['active_agents_mask'].gather(1, self.td_state['cur_agent_idx']).clone()\n",
    "        # Overwriting `self.early_penalty` and `self.late_penalty` here.\n",
    "        # Note: this should be handled inside the `env` class instead!\n",
    "        self.early_penalty = 0\n",
    "        self.late_penalty = 1\n",
    "        self.end_penalty = 2\n",
    "\n",
    "        tw_low = self.td_state['tw_low'].gather(1, action)\n",
    "        tw_high = self.td_state['tw_high'].gather(1, action)\n",
    "        \n",
    "        penalty = - self.late_penalty * (arrivej > tw_high).float()        \n",
    "        penalty = torch.where(agents_done, penalty-self.end_penalty * (arrivej > self.td_state['end_time'].unsqueeze(1)).float(), \n",
    "                                                             penalty)\n",
    "        \n",
    "        self.td_state['cur_agent']['cur_penalty'] = penalty\n",
    "        self.td_state['cur_agent']['cum_penalty'] += penalty\n",
    "        self.td_state['agents']['cur_penalty'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_penalty'])\n",
    "        self.td_state['agents']['cum_penalty'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cum_penalty'])\n",
    "        \n",
    "        # if agent is done set agent time to end_time\n",
    "        self.td_state['cur_agent']['cur_time'] = torch.where(agents_done, self.td_state['end_time'].unsqueeze(-1), \n",
    "                                                             self.td_state['cur_agent']['cur_time'])\n",
    "        self.td_state['agents']['cur_time'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_time'])\n",
    "\n",
    "        # update agent cum traveled time\n",
    "        self.td_state['cur_agent']['cur_ttime'] = time2j\n",
    "        self.td_state['cur_agent']['cum_ttime'] += time2j\n",
    "        self.td_state['agents']['cur_ttime'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_ttime'])\n",
    "        self.td_state['agents']['cum_ttime'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cum_ttime'])\n",
    "        \n",
    "        # update agent load and node demands\n",
    "        self.td_state['cur_agent']['cur_load'] -= self.td_state['demands'].gather(1, action)\n",
    "        # is agent is done set agent cur_load to 0\n",
    "        self.td_state['cur_agent']['cur_load'] = torch.where( agents_done, 0., \n",
    "                                                             self.td_state['cur_agent']['cur_load'])\n",
    "        \n",
    "        self.td_state['nodes']['cur_demands'].scatter_(1, action, torch.zeros_like(action, dtype = torch.float))\n",
    "        self.td_state['agents']['cur_load'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_load'])\n",
    "        # update visited nodes\n",
    "        r = torch.arange(*self.td_state.batch_size, device=self.device)\n",
    "        self.td_state['agents']['visited_nodes'][r, self.td_state['cur_agent_idx'].squeeze(-1), action.squeeze(-1)] = True\n",
    "\n",
    "        # update agent step\n",
    "        self.td_state['cur_agent']['cur_step'] = torch.where(~agents_done, self.td_state['cur_agent']['cur_step']+1, \n",
    "                                                             self.td_state['cur_agent']['cur_step'])\n",
    "        self.td_state['agents']['cur_step'].scatter_(1, self.td_state['cur_agent_idx'], self.td_state['cur_agent']['cur_step'])\n",
    "        self.td_state['cur_node_idx'] = action.clone()\n",
    "\n",
    "        # if all done activate first agent to guarantee batch consistency during agent sampling\n",
    "        self.td_state['agents']['active_agents_mask'][self.td_state['agents']['active_agents_mask'].sum(1).eq(0), 0] = True\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b177e9de-6efe-4e9f-b416-72fcb29e782c",
   "metadata": {},
   "source": [
    "When implementing observations, we must keep in mind that we can only use **statistics from the distribution** used.  \n",
    "The exact values of distance/travel time are **not available** during observation, since they are only revealed **after `env.step(td)` is performed**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "a0fc8d65",
   "metadata": {},
   "outputs": [],
   "source": [
    "import yaml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "6f62722a",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_list = yaml.safe_load(\"\"\"\n",
    "    nodes_static:\n",
    "        x_coordinate:\n",
    "            feat: x_coordinate\n",
    "            norm: min_max\n",
    "        x_coordinate: \n",
    "            feat: x_coordinate\n",
    "            norm: \n",
    "        tw_low:\n",
    "            feat: tw_low\n",
    "            norm: \n",
    "        tw_high:\n",
    "            feat: tw_high\n",
    "            norm: \n",
    "\n",
    "    nodes_dynamic:\n",
    "        - time2open_div_end_time\n",
    "        - time2close_div_end_time\n",
    "\n",
    "    agent:\n",
    "        - x_coordinate\n",
    "        - y_coordinate\n",
    "        - frac_current_time\n",
    "\n",
    "\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "54e8d248",
   "metadata": {},
   "outputs": [],
   "source": [
    "obs = Observations(feature_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cf46792",
   "metadata": {},
   "source": [
    "Let's do an episode rollout and check the `reward` and `penalty` through every step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "f1e6aa62-1c16-41fc-9845-a327a4125a44",
   "metadata": {},
   "outputs": [],
   "source": [
    "env = Environment(instance_generator_object=gen,  \n",
    "                  obs_builder_object=obs,\n",
    "                  agent_selector_object=sel,\n",
    "                  reward_evaluator=rew,\n",
    "                  batch_size= 4,\n",
    "                  seed=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "8da8a4b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env step number:1, reward: tensor([[-1.5527],\n",
      "        [-0.8674],\n",
      "        [-0.7198],\n",
      "        [-0.2406]]), penalty: tensor([[-0.],\n",
      "        [-0.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:2, reward: tensor([[-0.5187],\n",
      "        [-0.7278],\n",
      "        [-0.5661],\n",
      "        [-0.2973]]), penalty: tensor([[-0.],\n",
      "        [-1.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:3, reward: tensor([[-0.4253],\n",
      "        [-0.4610],\n",
      "        [-1.8126],\n",
      "        [-0.6411]]), penalty: tensor([[-1.],\n",
      "        [-0.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:4, reward: tensor([[-0.6622],\n",
      "        [-0.3110],\n",
      "        [-0.3367],\n",
      "        [-0.4018]]), penalty: tensor([[-1.],\n",
      "        [-1.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:5, reward: tensor([[-0.6630],\n",
      "        [-0.6971],\n",
      "        [-0.7571],\n",
      "        [-0.2861]]), penalty: tensor([[-1.],\n",
      "        [-1.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:6, reward: tensor([[-0.7132],\n",
      "        [-0.4512],\n",
      "        [-0.5210],\n",
      "        [-0.0133]]), penalty: tensor([[-3.],\n",
      "        [-1.],\n",
      "        [-1.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:7, reward: tensor([[-0.7232],\n",
      "        [-0.3410],\n",
      "        [-0.3882],\n",
      "        [-0.6682]]), penalty: tensor([[-0.],\n",
      "        [-3.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:8, reward: tensor([[-0.6049],\n",
      "        [-1.1623],\n",
      "        [-1.0539],\n",
      "        [-1.1353]]), penalty: tensor([[-1.],\n",
      "        [-0.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:9, reward: tensor([[-1.0882],\n",
      "        [-0.5806],\n",
      "        [-1.0106],\n",
      "        [-0.3545]]), penalty: tensor([[-0.],\n",
      "        [-0.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:10, reward: tensor([[-0.3974],\n",
      "        [-0.7468],\n",
      "        [-0.7105],\n",
      "        [-0.4945]]), penalty: tensor([[-1.],\n",
      "        [-1.],\n",
      "        [-3.],\n",
      "        [-3.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:11, reward: tensor([[-0.5061],\n",
      "        [-0.7505],\n",
      "        [-0.3817],\n",
      "        [-0.8687]]), penalty: tensor([[-1.],\n",
      "        [-1.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:12, reward: tensor([[-0.4072],\n",
      "        [-0.9129],\n",
      "        [-1.0107],\n",
      "        [-0.6193]]), penalty: tensor([[-1.],\n",
      "        [-1.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:13, reward: tensor([[-1.1827],\n",
      "        [-0.8541],\n",
      "        [-1.2194],\n",
      "        [-1.3943]]), penalty: tensor([[-1.],\n",
      "        [-1.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:14, reward: tensor([[-0.5243],\n",
      "        [-0.8865],\n",
      "        [-0.6411],\n",
      "        [-1.2067]]), penalty: tensor([[-1.],\n",
      "        [-3.],\n",
      "        [-1.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:15, reward: tensor([[-1.0377],\n",
      "        [-0.4616],\n",
      "        [-0.3696],\n",
      "        [-0.4966]]), penalty: tensor([[-3.],\n",
      "        [-0.],\n",
      "        [-3.],\n",
      "        [-1.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:16, reward: tensor([[-1.2262],\n",
      "        [-0.4578],\n",
      "        [-0.8096],\n",
      "        [-1.5891]]), penalty: tensor([[-0.],\n",
      "        [-0.],\n",
      "        [-0.],\n",
      "        [-3.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:17, reward: tensor([[-0.9060],\n",
      "        [-0.6889],\n",
      "        [-0.4772],\n",
      "        [-1.0585]]), penalty: tensor([[-1.],\n",
      "        [-0.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:18, reward: tensor([[-0.4192],\n",
      "        [-0.2531],\n",
      "        [-0.8415],\n",
      "        [-0.8915]]), penalty: tensor([[-1.],\n",
      "        [-0.],\n",
      "        [-1.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:19, reward: tensor([[-1.4608],\n",
      "        [-0.5816],\n",
      "        [-0.5594],\n",
      "        [-0.8171]]), penalty: tensor([[-3.],\n",
      "        [-0.],\n",
      "        [-3.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:20, reward: tensor([[-0.0873],\n",
      "        [-0.2623],\n",
      "        [-0.5152],\n",
      "        [-0.8023]]), penalty: tensor([[-0.],\n",
      "        [-0.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([False, False, False, False])\n",
      "env step number:21, reward: tensor([[-0.2555],\n",
      "        [-0.2408],\n",
      "        [-0.9307],\n",
      "        [-0.8137]]), penalty: tensor([[ -1.0000],\n",
      "        [-14.4489],\n",
      "        [ -1.0000],\n",
      "        [ -1.0000]])\n",
      "tensor([False,  True, False, False])\n",
      "env step number:22, reward: tensor([[-0.3009],\n",
      "        [-0.0000],\n",
      "        [-0.4686],\n",
      "        [-0.6228]]), penalty: tensor([[-0.],\n",
      "        [-0.],\n",
      "        [-1.],\n",
      "        [-3.]])\n",
      "tensor([False,  True, False, False])\n",
      "env step number:23, reward: tensor([[-0.0000],\n",
      "        [-0.0000],\n",
      "        [-0.9884],\n",
      "        [-0.8501]]), penalty: tensor([[-2.9564],\n",
      "        [-0.0000],\n",
      "        [-3.0000],\n",
      "        [-0.0000]])\n",
      "tensor([ True,  True, False, False])\n",
      "env step number:24, reward: tensor([[-0.0000],\n",
      "        [-0.0000],\n",
      "        [-0.0000],\n",
      "        [-0.9275]]), penalty: tensor([[-0.],\n",
      "        [-0.],\n",
      "        [-0.],\n",
      "        [-0.]])\n",
      "tensor([True, True, True, True])\n"
     ]
    }
   ],
   "source": [
    "td = env.reset(num_agents=5, num_nodes=20)\n",
    "while not td[\"done\"].all():  \n",
    "    td = env.sample_action(td) \n",
    "    td = env.step(td)\n",
    "    step = env.env_nsteps\n",
    "    reward = td['reward']\n",
    "    penalty = td['penalty']\n",
    "    print(f'env step number:{step}, reward: {reward}, penalty: {penalty}')\n",
    "    print(td[\"done\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80c2d747-7252-4a4e-8c04-185d1fbab20a",
   "metadata": {},
   "source": [
    "## Team Orienteering Problem with Time Windows and Stochastic Profits (TOPTWSP)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4634a70-f779-4417-8dba-8d14f407a507",
   "metadata": {},
   "source": [
    "For this example, we will assume that the profit associated with each location is a **random variable** following a **Poisson distribution** with mean $\\lambda_i$. This means that each time the environment is restarted, the profits will vary according to this distribution, introducing again another layer of **stochasticity** to the decision-making process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "2097a205-8aeb-42a1-9fb8-7ebb4e467b8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from maenvs4vrp.environments.toptw.env import Environment\n",
    "from maenvs4vrp.environments.toptw.env_agent_selector import AgentSelector\n",
    "from maenvs4vrp.environments.toptw.observations import Observations\n",
    "from maenvs4vrp.environments.toptw.instances_generator import InstanceGenerator\n",
    "from maenvs4vrp.environments.toptw.env_agent_reward import DenseReward"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30193794-782d-4871-ae7a-e550f86263fa",
   "metadata": {},
   "source": [
    "Before we begin, let's initialize the environment for the **Team Orientation Problem with Time Windows** using the following arguments:\n",
    "\n",
    "- `sample_type='augment'`\n",
    "- `n_augment=4`\n",
    "\n",
    "If `batch=8`, this will generate **two instances of the problem**, each with **4 copies**. Let's see:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "1b07f250-0190-4b8f-95c0-39adecf6344f",
   "metadata": {},
   "outputs": [],
   "source": [
    "gen = InstanceGenerator(batch_size = 8)\n",
    "obs = Observations()\n",
    "sel = AgentSelector()\n",
    "rew = DenseReward()\n",
    "\n",
    "env = Environment(instance_generator_object=gen,  \n",
    "                  obs_builder_object=obs,\n",
    "                  agent_selector_object=sel,\n",
    "                  reward_evaluator=rew,\n",
    "                  seed=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "2f204000-f54a-4482-b385-47f7e624549c",
   "metadata": {},
   "outputs": [],
   "source": [
    "td = env.reset(batch_size = 8, \n",
    "               num_agents=2, \n",
    "               num_nodes=9,\n",
    "               profits = 'uniform',\n",
    "               sample_type='augment',\n",
    "               n_augment=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "4ded9773-4858-4fbb-bd76-a4bda1eac6ac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 6., 1., 6., 2., 0., 3., 9., 9.],\n",
       "        [0., 2., 8., 5., 9., 0., 5., 9., 8.],\n",
       "        [0., 6., 1., 6., 2., 0., 3., 9., 9.],\n",
       "        [0., 2., 8., 5., 9., 0., 5., 9., 8.],\n",
       "        [0., 6., 1., 6., 2., 0., 3., 9., 9.],\n",
       "        [0., 2., 8., 5., 9., 0., 5., 9., 8.],\n",
       "        [0., 6., 1., 6., 2., 0., 3., 9., 9.],\n",
       "        [0., 2., 8., 5., 9., 0., 5., 9., 8.]])"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "env.td_state['profits']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6fdd539-753e-4420-9483-ccc0337d0445",
   "metadata": {},
   "source": [
    "As we can see, in the deterministic problem, for the same instance the nodes present the same profit values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42829095-2055-4a49-b0b5-dd7d944402d8",
   "metadata": {},
   "source": [
    "#### Modifying the Environment for Random Profit Sampling\n",
    "\n",
    "Now let's change the environment so that, for the **same instance**, we obtain **random samples of the points' profits**.  \n",
    "\n",
    "To achieve this, we will **directly modify the `InstanceGenerator` class**. Specifically:\n",
    "\n",
    "- Assume that the `profits` attribute, created in the `random_generate_instance` method, establishes the **mean values** of a **Poisson distribution**.\n",
    "- These mean values will then be used in the `augment_generate_instance` method when sampling the profits.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "98e53828-4631-489e-a729-9b3f2adb6b8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "class InstanceGenerator(InstanceGenerator):\n",
    "\n",
    "\n",
    "    def augment_generate_instance(self, num_agents:int=20, \n",
    "                                 num_nodes:int=100, \n",
    "                                 service_times:int=0.2, \n",
    "                                 profits:str='distance',\n",
    "                                 batch_size: Optional[torch.Size] = None,\n",
    "                                 n_augment:int = 2,\n",
    "                                 seed:int=None)-> TensorDict:\n",
    "        \"\"\"\n",
    "        Generate augmentated instance.\n",
    "\n",
    "        Args:\n",
    "            num_agents(int): Total number of agents. Defaults to 20.\n",
    "            num_nodes(int):  Total number of nodes. Defaults to 100.\n",
    "            capacity(int): Total capacity for each agent. Defaults to 50.\n",
    "            service_times(int): Service time in the nodes. Defaults to 0.2.\n",
    "            batch_size(torch.Size, optional): Batch size. Defaults to None.\n",
    "            n_augment(int): Data augmentation. Defaults to 2.\n",
    "            seed(int, optional): Random number generator seed. Defaults to None.\n",
    "\n",
    "        Returns:\n",
    "            TensorDict: Instance data.\n",
    "        \"\"\"\n",
    "        if seed is not None:\n",
    "            self._set_seed(seed)\n",
    "\n",
    "        if num_agents is not None:\n",
    "            assert num_agents>0, f\"number of agents must be grater them 0!\"\n",
    "            self.max_num_agents = num_agents\n",
    "        if num_nodes is not None:\n",
    "            assert num_nodes>0, f\"number of services must be grater them 0!\"\n",
    "            self.max_num_nodes = num_nodes\n",
    "        if service_times is not None:\n",
    "            self.service_times = service_times\n",
    "\n",
    "        if batch_size is not None:\n",
    "            batch_size = [batch_size] if isinstance(batch_size, int) else batch_size\n",
    "            self.batch_size = torch.Size(batch_size)\n",
    "\n",
    "        assert self.batch_size.numel()%n_augment == 0, f\"batch_size must be divisible by n_augment\"\n",
    "        s_batch_size = self.batch_size.numel() // n_augment\n",
    "        self.s_batch_size = torch.Size([s_batch_size])\n",
    "        \n",
    "        instance_info_s = self.random_generate_instance(num_agents=num_agents, \n",
    "                                                     num_nodes=num_nodes, \n",
    "                                                     profits=profits, \n",
    "                                                     service_times=service_times,\n",
    "                                                     batch_size = self.s_batch_size,\n",
    "                                                     seed=seed)\n",
    "        \n",
    "        self.batch_size = torch.Size(batch_size)\n",
    "\n",
    "        instance = TensorDict({}, batch_size=self.batch_size, device=self.device)\n",
    "        for key in instance_info_s['data'].keys():\n",
    "            if len(instance_info_s['data'][key].shape) == 3:\n",
    "                instance[key] = instance_info_s['data'][key].repeat(n_augment, 1, 1)\n",
    "            elif len(instance_info_s['data'][key].shape) == 2:\n",
    "                instance[key] = instance_info_s['data'][key].repeat(n_augment, 1)\n",
    "            elif len(instance_info_s['data'][key].shape) == 1:\n",
    "                instance[key] = instance_info_s['data'][key].repeat(n_augment)\n",
    "\n",
    "        # Here we sample the profits:\n",
    "        instance['profits'] = torch.poisson(instance['profits'])\n",
    "\n",
    "        instance_info = {'name':'random_instance',\n",
    "                         'num_nodes': self.max_num_nodes,\n",
    "                         'num_agents':self.max_num_agents,\n",
    "                         'data':instance}\n",
    "        return instance_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "f13e199c-400d-4cb3-94fe-f43b447a1d7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "stoch_gen = InstanceGenerator(batch_size = 8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "fc18e36b-dfd4-4f52-9fd8-eb32414d9a4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "stoch_env = Environment(instance_generator_object=stoch_gen,  \n",
    "                  obs_builder_object=obs,\n",
    "                  agent_selector_object=sel,\n",
    "                  reward_evaluator=rew,\n",
    "                  seed=0)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66aee634-1580-410a-aab8-1632db006674",
   "metadata": {},
   "source": [
    "restarting the environment we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "98033b27-e5db-4e49-bb4f-790dcee34e10",
   "metadata": {},
   "outputs": [],
   "source": [
    "td = stoch_env.reset(batch_size = 8, \n",
    "               num_agents=2, \n",
    "               num_nodes=9,\n",
    "               profits = 'uniform',\n",
    "               sample_type='augment',\n",
    "               n_augment=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "99d63681-4701-4227-91f8-859a4d2e2c58",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[ 0.,  7.,  2.,  2.,  2.,  0.,  3.,  8.,  6.],\n",
       "        [ 0.,  1.,  9.,  3., 15.,  0.,  3.,  8.,  4.],\n",
       "        [ 0.,  5.,  1.,  5.,  1.,  0.,  2., 17., 10.],\n",
       "        [ 0.,  2.,  8.,  3., 13.,  0.,  4.,  7.,  7.],\n",
       "        [ 0.,  9.,  2.,  5.,  1.,  0.,  5., 13.,  8.],\n",
       "        [ 0.,  2., 11.,  6.,  5.,  0.,  6., 11.,  8.],\n",
       "        [ 0., 10.,  2.,  5.,  2.,  0.,  3.,  6.,  7.],\n",
       "        [ 0.,  2.,  9.,  9.,  4.,  0.,  9.,  9.,  9.]])"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stoch_env.td_state['profits']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d804d25a-60f5-47d2-a82c-0a0284dd32c1",
   "metadata": {},
   "source": [
    "As we can see, for each instance implementation, the **profit values differ** as expected.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2f74db8-e3e6-471f-bd21-f85887f4ae34",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "It is important to note that in these two demonstrative examples, our goal is simply to **illustrate how deterministic environments can be modified to include stochastic elements**.  \n",
    "\n",
    "When creating **stochastic environments from scratch**, we may use deterministic environments as a base, but this should be done in a **more structured manner** — introducing new names, attributes, and clear definitions to distinguish them from the deterministic versions."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "maenvs4vrp",
   "language": "python",
   "name": "maenvs4vrp"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}