reward_nets module

Constructs deep network reward models. Code adopted from https://github.com/HumanCompatibleAI/imitation.git

class reward_nets.BasicPotentialMLP(observation_space: gym.spaces.space.Space, hid_sizes: Iterable[int], **kwargs)

Bases: torch.nn.modules.module.Module

Simple implementation of a potential using an MLP.

forward(state: torch.Tensor) torch.Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class reward_nets.BasicRewardNet(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, use_state: bool = True, use_action: bool = True, use_next_state: bool = False, use_done: bool = False, **kwargs)

Bases: reward_nets.RewardNet

MLP that takes as input the state, action, next state and done flag.

These inputs are flattened and then concatenated to one another. Each input can enabled or disabled by the use_* constructor keyword arguments.

forward(state, action, next_state, done)

Compute rewards for a batch of transitions and keep gradients.

training: bool
class reward_nets.BasicShapedRewardNet(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, *, reward_hid_sizes: Sequence[int] = (32,), potential_hid_sizes: Sequence[int] = (32, 32), use_state: bool = True, use_action: bool = True, use_next_state: bool = True, use_done: bool = True, discount_factor: float = 0.99, **kwargs)

Bases: reward_nets.ShapedRewardNet

Shaped reward net based on MLPs.

This is just a very simple convenience class for instantiating a BasicRewardNet and a BasicPotentialShaping and wrapping them inside a ShapedRewardNet. Mainly exists for backwards compatibility after https://github.com/HumanCompatibleAI/imitation/pull/311 to keep the scripts working.

TODO(ejnnr): if we ever modify AIRL so that it takes in a RewardNet instance

directly (instead of a class and kwargs) and instead instantiate the RewardNet inside the scripts, then it probably makes sense to get rid of this class.

training: bool
class reward_nets.NormalizedRewardNet(base: reward_nets.RewardNet, normalize_output_layer: Type[torch.nn.modules.module.Module])

Bases: reward_nets.RewardNetWrapper

A reward net that normalizes the output of its base network.

forward(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor, done: torch.Tensor)

Compute rewards for a batch of transitions and keep gradients.

predict_processed(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray, update_stats: bool = True) numpy.ndarray

Compute normalized rewards for a batch of transitions without gradients.

Args:

state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,). update_stats: Whether to update the running stats of the normalization

layer.

Returns:

Computed normalized rewards of shape (batch_size,).

training: bool
class reward_nets.RewardNet(observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, normalize_images: bool = True)

Bases: torch.nn.modules.module.Module, abc.ABC

Minimal abstract reward network.

Only requires the implementation of a forward pass (calculating rewards given a batch of states, actions, next states and dones).

property device: torch.device

Heuristic to determine which device this module is on.

property dtype: torch.dtype

Heuristic to determine dtype of module.

abstract forward(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor, done: torch.Tensor) torch.Tensor

Compute rewards for a batch of transitions and keep gradients.

predict(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) numpy.ndarray

Compute rewards for a batch of transitions without gradients.

Converting th.Tensor rewards from predict_th to NumPy arrays.

Args:

state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,).

Returns:

Computed rewards of shape (batch_size,).

predict_processed(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) numpy.ndarray

Compute the processed rewards for a batch of transitions without gradients.

Defaults to calling predict. Subclasses can override this to normalize or otherwise modify the rewards in ways that may help RL training or other applications of the reward function.

Args:

state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,).

Returns:

Computed processed rewards of shape (batch_size,).

predict_th(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) torch.Tensor

Compute th.Tensor rewards for a batch of transitions without gradients.

Preprocesses the inputs, output th.Tensor reward arrays.

Args:

state: Current states of shape (batch_size,) + state_shape. action: Actions of shape (batch_size,) + action_shape. next_state: Successor states of shape (batch_size,) + state_shape. done: End-of-episode (terminal state) indicator of shape (batch_size,).

Returns:

Computed th.Tensor rewards of shape (batch_size,).

preprocess(state: numpy.ndarray, action: numpy.ndarray, next_state: numpy.ndarray, done: numpy.ndarray) Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Preprocess a batch of input transitions and convert it to PyTorch tensors.

The output of this function is suitable for its forward pass, so a typical usage would be model(*model.preprocess(transitions)).

Args:
state: The observation input. Its shape is

(batch_size,) + observation_space.shape.

action: The action input. Its shape is

(batch_size,) + action_space.shape. The None dimension is expected to be the same as None dimension from obs_input.

next_state: The observation input. Its shape is

(batch_size,) + observation_space.shape.

done: Whether the episode has terminated. Its shape is (batch_size,).

Returns:

Preprocessed transitions: a Tuple of tensors containing observations, actions, next observations and dones.

training: bool
class reward_nets.RewardNetWrapper(base: reward_nets.RewardNet)

Bases: reward_nets.RewardNet

An abstract RewardNet wrapping a base network.

A concrete implementation of the forward method is needed. Note: by default, predict, predict_th, preprocess, predict_processed, device and all the PyTorch nn.Module methods will be inherited from RewardNet and not passed through to the base network. If any of these methods is overridden in the base RewardNet, this will not affect RewardNetWrapper.

property base: reward_nets.RewardNet
training: bool
class reward_nets.ShapedRewardNet(base: reward_nets.RewardNet, potential: Callable[torch.Tensor, torch.Tensor], discount_factor: float)

Bases: reward_nets.RewardNetWrapper

A RewardNet consisting of a base network and a potential shaping.

forward(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor, done: torch.Tensor)

Compute rewards for a batch of transitions and keep gradients.

training: bool