Behavioral Cloning¶

In Behavioral Cloning agent receives as training data both the encountered states and actions of the demonstrator, and then uses a classifier or regressor to replicate the expert’s policy. The steps involved in BC, involves: a) Collect demonstrations from expert. b) Assuming the expert trajectories as an i.i.d state-action pairs, learn a policy using supervised learning by minimizing the loss function.

Classes and Functions¶

Behavioural Cloning (BC).

Trains policy by applying supervised learning to a fixed dataset of (observation, action) pairs generated by some expert demonstrator.

Code adopted from https://github.com/HumanCompatibleAI/imitation.git

class bc.BC(*, observation_space: gym.spaces.space.Space, action_space: gym.spaces.space.Space, policy: typing.Optional[stable_baselines3.common.policies.ActorCriticPolicy] = None, demonstrations: typing.Optional[typing.Union[typing.Iterable[types_unique.Trajectory], typing.Iterable[typing.Mapping[str, typing.Union[numpy.ndarray, torch.Tensor]]], typing.TransitionKind]] = None, batch_size: int = 32, optimizer_cls: typing.Type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_kwargs: typing.Optional[typing.Mapping[str, typing.Any]] = None, ent_weight: float = 0.001, l2_weight: float = 0.0, device: typing.Union[str, torch.device] = 'auto', custom_logger: typing.Optional[logger.HierarchicalLogger] = None)¶

Bases: base.DemonstrationAlgorithm

Behavioral cloning (BC).

Recovers a policy via supervised learning from observation-action pairs.

allow_variable_horizon: bool¶: If True, allow variable horizon trajectories; otherwise error if detected.

property policy: stable_baselines3.common.policies.ActorCriticPolicy¶: Returns a policy imitating the demonstration data.

save_policy(policy_path: Union[str, bytes, os.PathLike]) → None¶

Save policy to a path. Can be reloaded by .reconstruct_policy().

Args:: policy_path: path to save policy to.

set_demonstrations(demonstrations: Union[Iterable[types_unique.Trajectory], Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]], TransitionKind]) → None¶

Sets the demonstration data.

Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.

Args:

demonstrations: Either a Torch DataLoader, any other iterator that: yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.

train(*, n_epochs: Optional[int] = None, n_batches: Optional[int] = None, on_epoch_end: Optional[Callable[None]] = None, on_batch_end: Optional[Callable[None]] = None, log_interval: int = 500, log_rollouts_venv: Optional[stable_baselines3.common.vec_env.base_vec_env.VecEnv] = None, log_rollouts_n_episodes: int = 5, progress_bar: bool = True, reset_tensorboard: bool = False)¶

Train with supervised learning for some number of epochs.

Here an ‘epoch’ is just a complete pass through the expert data loader, as set by self.set_expert_data_loader(). Note, that when you specify n_batches smaller than the number of batches in an epoch, the on_epoch_end callback will never be called.

Args:

n_epochs: Number of complete passes made through expert data before ending: training. Provide exactly one of n_epochs and n_batches.
n_batches: Number of batches loaded from dataset before ending training.: Provide exactly one of n_epochs and n_batches.
on_epoch_end: Optional callback with no parameters to run at the end of each: epoch.
on_batch_end: Optional callback with no parameters to run at the end of each: batch.

log_interval: Log stats after every log_interval batches. log_rollouts_venv: If not None, then this VecEnv (whose observation and

actions spaces must match self.observation_space and self.action_space) is used to generate rollout stats, including average return and average episode length. If None, then no rollouts are generated.

log_rollouts_n_episodes: Number of rollouts to generate when calculating: rollout stats. Non-positive number disables rollouts.

progress_bar: If True, then show a progress bar during training. reset_tensorboard: If True, then start plotting to Tensorboard from x=0

even if .train() logged to Tensorboard previously. Has no practical effect if .train() is being called for the first time.

class bc.BCLogger(logger: logger.HierarchicalLogger)¶

Bases: object

Utility class to help logging information relevant to Behavior Cloning.

log_batch(batch_num: int, batch_size: int, num_samples_so_far: int, training_metrics: bc.BCTrainingMetrics, rollout_stats: Mapping[str, float])¶

log_epoch(epoch_number)¶

reset_tensorboard_steps()¶

class bc.BCTrainingMetrics(neglogp: torch.Tensor, entropy: torch.Tensor, ent_loss: torch.Tensor, prob_true_act: torch.Tensor, l2_norm: torch.Tensor, l2_loss: torch.Tensor, loss: torch.Tensor)¶

Bases: object

Container for the different components of behavior cloning loss.

ent_loss: torch.Tensor¶

entropy: torch.Tensor¶

l2_loss: torch.Tensor¶

l2_norm: torch.Tensor¶

loss: torch.Tensor¶

neglogp: torch.Tensor¶

prob_true_act: torch.Tensor¶

class bc.BatchIteratorWithEpochEndCallback(batch_loader: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]], n_epochs: Optional[int], n_batches: Optional[int], on_epoch_end: Optional[Callable[int, None]])¶

Bases: object

Loops through batches from a batch loader and calls a callback after every epoch.

Will throw an exception when an epoch contains no batches.

batch_loader: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]]¶

n_batches: Optional[int]¶

n_epochs: Optional[int]¶

on_epoch_end: Optional[Callable[int, None]]¶

class bc.BehaviorCloningLossCalculator(ent_weight: float, l2_weight: float)¶

Bases: object

Functor to compute the loss used in Behavior Cloning.

ent_weight: float¶

l2_weight: float¶

class bc.BehaviorCloningTrainer(loss: bc.BehaviorCloningLossCalculator, optimizer: torch.optim.optimizer.Optimizer, policy: stable_baselines3.common.policies.ActorCriticPolicy)¶

Bases: object

Functor to fit a policy to expert demonstration data.

loss: bc.BehaviorCloningLossCalculator¶

optimizer: torch.optim.optimizer.Optimizer¶

policy: stable_baselines3.common.policies.ActorCriticPolicy¶

class bc.RolloutStatsComputer(venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, n_episodes: int)¶

Bases: object

Computes statistics about rollouts.

Args:: venv: The vectorized environment in which to compute the rollouts. n_episodes: The number of episodes to base the statistics on.

n_episodes: int¶

venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv¶

bc.enumerate_batches(batch_it: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]]) → Iterable[Tuple[Tuple[int, int, int], Mapping[str, Union[numpy.ndarray, torch.Tensor]]]]¶: Prepends batch stats before the batches of a batch iterator.

bc.reconstruct_policy(policy_path: str, device: Union[torch.device, str] = 'auto') → stable_baselines3.common.policies.ActorCriticPolicy¶

Reconstruct a saved policy.

Args:: policy_path: path where .save_policy() has been run. device: device on which to load the policy.
Returns:: policy: policy with reloaded weights.

Behavioral Cloning¶

Classes and Functions¶

Table of Contents

Previous topic

Next topic

This Page