Behavioral Cloning

In Behavioral Cloning agent receives as training data both the encountered states and actions of the demonstrator, and then uses a classifier or regressor to replicate the expert’s policy. The steps involved in BC, involves: a) Collect demonstrations from expert. b) Assuming the expert trajectories as an i.i.d state-action pairs, learn a policy using supervised learning by minimizing the loss function.

Classes and Functions

Behavioural Cloning (BC).

Trains policy by applying supervised learning to a fixed dataset of (observation, action) pairs generated by some expert demonstrator.

Code adopted from https://github.com/HumanCompatibleAI/imitation.git

class bc.BC(*, observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, policy: typing.Optional[stable_baselines3.common.policies.ActorCriticPolicy] = None, demonstrations: typing.Optional[typing.Union[typing.Iterable[types_unique.Trajectory], typing.Iterable[typing.Mapping[str, typing.Union[numpy.ndarray, torch.Tensor]]], typing.TransitionKind]] = None, batch_size: int = 32, optimizer_cls: typing.Type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_kwargs: typing.Optional[typing.Mapping[str, typing.Any]] = None, ent_weight: float = 0.001, l2_weight: float = 0.0, device: typing.Union[str, torch.device] = 'auto', custom_logger: typing.Optional[logger.HierarchicalLogger] = None)

Bases: base.DemonstrationAlgorithm

Behavioral cloning (BC).

Recovers a policy via supervised learning from observation-action pairs.

allow_variable_horizon: bool

If True, allow variable horizon trajectories; otherwise error if detected.

property policy: stable_baselines3.common.policies.ActorCriticPolicy

Returns a policy imitating the demonstration data.

save_policy(policy_path: Union[str, bytes, os.PathLike]) None

Save policy to a path. Can be reloaded by .reconstruct_policy().

Args:

policy_path: path to save policy to.

set_demonstrations(demonstrations: Union[Iterable[types_unique.Trajectory], Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]], TransitionKind]) None

Sets the demonstration data.

Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.

Args:
demonstrations: Either a Torch DataLoader, any other iterator that

yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.

train(*, n_epochs: Optional[int] = None, n_batches: Optional[int] = None, on_epoch_end: Optional[Callable[None]] = None, on_batch_end: Optional[Callable[None]] = None, log_interval: int = 500, log_rollouts_venv: Optional[stable_baselines3.common.vec_env.base_vec_env.VecEnv] = None, log_rollouts_n_episodes: int = 5, progress_bar: bool = True, reset_tensorboard: bool = False)

Train with supervised learning for some number of epochs.

Here an ‘epoch’ is just a complete pass through the expert data loader, as set by self.set_expert_data_loader(). Note, that when you specify n_batches smaller than the number of batches in an epoch, the on_epoch_end callback will never be called.

Args:
n_epochs: Number of complete passes made through expert data before ending

training. Provide exactly one of n_epochs and n_batches.

n_batches: Number of batches loaded from dataset before ending training.

Provide exactly one of n_epochs and n_batches.

on_epoch_end: Optional callback with no parameters to run at the end of each

epoch.

on_batch_end: Optional callback with no parameters to run at the end of each

batch.

log_interval: Log stats after every log_interval batches. log_rollouts_venv: If not None, then this VecEnv (whose observation and

actions spaces must match self.observation_space and self.action_space) is used to generate rollout stats, including average return and average episode length. If None, then no rollouts are generated.

log_rollouts_n_episodes: Number of rollouts to generate when calculating

rollout stats. Non-positive number disables rollouts.

progress_bar: If True, then show a progress bar during training. reset_tensorboard: If True, then start plotting to Tensorboard from x=0

even if .train() logged to Tensorboard previously. Has no practical effect if .train() is being called for the first time.

class bc.BCLogger(logger: logger.HierarchicalLogger)

Bases: object

Utility class to help logging information relevant to Behavior Cloning.

log_batch(batch_num: int, batch_size: int, num_samples_so_far: int, training_metrics: bc.BCTrainingMetrics, rollout_stats: Mapping[str, float])
log_epoch(epoch_number)
reset_tensorboard_steps()
class bc.BCTrainingMetrics(neglogp: torch.Tensor, entropy: torch.Tensor, ent_loss: torch.Tensor, prob_true_act: torch.Tensor, l2_norm: torch.Tensor, l2_loss: torch.Tensor, loss: torch.Tensor)

Bases: object

Container for the different components of behavior cloning loss.

ent_loss: torch.Tensor
entropy: torch.Tensor
l2_loss: torch.Tensor
l2_norm: torch.Tensor
loss: torch.Tensor
neglogp: torch.Tensor
prob_true_act: torch.Tensor
class bc.BatchIteratorWithEpochEndCallback(batch_loader: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]], n_epochs: Optional[int], n_batches: Optional[int], on_epoch_end: Optional[Callable[int, None]])

Bases: object

Loops through batches from a batch loader and calls a callback after every epoch.

Will throw an exception when an epoch contains no batches.

batch_loader: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]]
n_batches: Optional[int]
n_epochs: Optional[int]
on_epoch_end: Optional[Callable[int, None]]
class bc.BehaviorCloningLossCalculator(ent_weight: float, l2_weight: float)

Bases: object

Functor to compute the loss used in Behavior Cloning.

ent_weight: float
l2_weight: float
class bc.BehaviorCloningTrainer(loss: bc.BehaviorCloningLossCalculator, optimizer: torch.optim.optimizer.Optimizer, policy: stable_baselines3.common.policies.ActorCriticPolicy)

Bases: object

Functor to fit a policy to expert demonstration data.

loss: bc.BehaviorCloningLossCalculator
optimizer: torch.optim.optimizer.Optimizer
policy: stable_baselines3.common.policies.ActorCriticPolicy
class bc.RolloutStatsComputer(venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, n_episodes: int)

Bases: object

Computes statistics about rollouts.

Args:

venv: The vectorized environment in which to compute the rollouts. n_episodes: The number of episodes to base the statistics on.

n_episodes: int
venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv
bc.enumerate_batches(batch_it: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]]) Iterable[Tuple[Tuple[int, int, int], Mapping[str, Union[numpy.ndarray, torch.Tensor]]]]

Prepends batch stats before the batches of a batch iterator.

bc.reconstruct_policy(policy_path: str, device: Union[torch.device, str] = 'auto') stable_baselines3.common.policies.ActorCriticPolicy

Reconstruct a saved policy.

Args:

policy_path: path where .save_policy() has been run. device: device on which to load the policy.

Returns:

policy: policy with reloaded weights.