reward_nets module¶
Constructs deep network reward models. Code adopted from https://github.com/HumanCompatibleAI/imitation.git
- class reward_nets.BasicPotentialMLP(observation_space: gym.spaces.space.Space, hid_sizes: Iterable[int], **kwargs)¶
Bases:
torch.nn.modules.module.Module
Simple implementation of a potential using an MLP.
- forward(state: torch.Tensor) torch.Tensor ¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool¶
- class reward_nets.BasicRewardNet(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, use_state: bool = True, use_action: bool = True, use_next_state: bool = False, use_done: bool = False, **kwargs)¶
Bases:
reward_nets.RewardNet
MLP that takes as input the state, action, next state and done flag.
These inputs are flattened and then concatenated to one another. Each input can enabled or disabled by the use_* constructor keyword arguments.
- forward(state, action, next_state, done)¶
Compute rewards for a batch of transitions and keep gradients.
- training: bool¶
- class reward_nets.BasicShapedRewardNet(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, *, reward_hid_sizes: Sequence[int] = (32,), potential_hid_sizes: Sequence[int] = (32, 32), use_state: bool = True, use_action: bool = True, use_next_state: bool = True, use_done: bool = True, discount_factor: float = 0.99, **kwargs)¶
Bases:
reward_nets.ShapedRewardNet
Shaped reward net based on MLPs.
This is just a very simple convenience class for instantiating a BasicRewardNet and a BasicPotentialShaping and wrapping them inside a ShapedRewardNet. Mainly exists for backwards compatibility after https://github.com/HumanCompatibleAI/imitation/pull/311 to keep the scripts working.
- TODO(ejnnr): if we ever modify AIRL so that it takes in a RewardNet instance
directly (instead of a class and kwargs) and instead instantiate the RewardNet inside the scripts, then it probably makes sense to get rid of this class.
- training: bool¶
- class reward_nets.NormalizedRewardNet(base: reward_nets.RewardNet, normalize_output_layer: Type[torch.nn.modules.module.Module])¶
Bases:
reward_nets.RewardNetWrapper
A reward net that normalizes the output of its base network.
- forward(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor, done: torch.Tensor)¶
Compute rewards for a batch of transitions and keep gradients.
- predict_processed(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray, update_stats: bool = True) numpy.ndarray ¶
Compute normalized rewards for a batch of transitions without gradients.
- Args:
state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,). update_stats: Whether to update the running stats of the normalization
layer.
- Returns:
Computed normalized rewards of shape (batch_size,).
- training: bool¶
- class reward_nets.RewardNet(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, normalize_images: bool = True)¶
Bases:
torch.nn.modules.module.Module
,abc.ABC
Minimal abstract reward network.
Only requires the implementation of a forward pass (calculating rewards given a batch of states, actions, next states and dones).
- property device: torch.device¶
Heuristic to determine which device this module is on.
- property dtype: torch.dtype¶
Heuristic to determine dtype of module.
- abstract forward(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor, done: torch.Tensor) torch.Tensor ¶
Compute rewards for a batch of transitions and keep gradients.
- predict(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) numpy.ndarray ¶
Compute rewards for a batch of transitions without gradients.
Converting th.Tensor rewards from predict_th to NumPy arrays.
- Args:
state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,).
- Returns:
Computed rewards of shape (batch_size,).
- predict_processed(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) numpy.ndarray ¶
Compute the processed rewards for a batch of transitions without gradients.
Defaults to calling predict. Subclasses can override this to normalize or otherwise modify the rewards in ways that may help RL training or other applications of the reward function.
- Args:
state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,).
- Returns:
Computed processed rewards of shape (batch_size,).
- predict_th(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) torch.Tensor ¶
Compute th.Tensor rewards for a batch of transitions without gradients.
Preprocesses the inputs, output th.Tensor reward arrays.
- Args:
state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,).
- Returns:
Computed th.Tensor rewards of shape (batch_size,).
- preprocess(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor] ¶
Preprocess a batch of input transitions and convert it to PyTorch tensors.
The output of this function is suitable for its forward pass, so a typical usage would be
model(*model.preprocess(transitions))
.- Args:
- state: The observation input. Its shape is
(batch_size,) + observation_space.shape.
- action: The action input. Its shape is
(batch_size,) + action_space.shape. The None dimension is expected to be the same as None dimension from obs_input.
- next_state: The observation input. Its shape is
(batch_size,) + observation_space.shape.
done: Whether the episode has terminated. Its shape is (batch_size,).
- Returns:
Preprocessed transitions: a Tuple of tensors containing observations, actions, next observations and dones.
- training: bool¶
- class reward_nets.RewardNetWrapper(base: reward_nets.RewardNet)¶
Bases:
reward_nets.RewardNet
An abstract RewardNet wrapping a base network.
A concrete implementation of the forward method is needed. Note: by default, predict, predict_th, preprocess, predict_processed, device and all the PyTorch nn.Module methods will be inherited from RewardNet and not passed through to the base network. If any of these methods is overridden in the base RewardNet, this will not affect RewardNetWrapper.
- property base: reward_nets.RewardNet¶
- training: bool¶
- class reward_nets.ShapedRewardNet(base: reward_nets.RewardNet, potential: Callable[torch.Tensor, torch.Tensor], discount_factor: float)¶
Bases:
reward_nets.RewardNetWrapper
A RewardNet consisting of a base network and a potential shaping.
- forward(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor, done: torch.Tensor)¶
Compute rewards for a batch of transitions and keep gradients.
- training: bool¶