Behavioral Cloning¶
In Behavioral Cloning agent receives as training data both the encountered states and actions of the demonstrator, and then uses a classifier or regressor to replicate the expert’s policy. The steps involved in BC, involves: a) Collect demonstrations from expert. b) Assuming the expert trajectories as an i.i.d state-action pairs, learn a policy using supervised learning by minimizing the loss function.
Classes and Functions¶
Behavioural Cloning (BC).
Trains policy by applying supervised learning to a fixed dataset of (observation, action) pairs generated by some expert demonstrator.
Code adopted from
- class bc.BC(*, observation_space:, action_space:, policy: typing.Optional[stable_baselines3.common.policies.ActorCriticPolicy] = None, demonstrations: typing.Optional[typing.Union[typing.Iterable[types_unique.Trajectory], typing.Iterable[typing.Mapping[str, typing.Union[numpy.ndarray, torch.Tensor]]], typing.TransitionKind]] = None, batch_size: int = 32, optimizer_cls: typing.Type[torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_kwargs: typing.Optional[typing.Mapping[str, typing.Any]] = None, ent_weight: float = 0.001, l2_weight: float = 0.0, device: typing.Union[str, torch.device] = 'auto', custom_logger: typing.Optional[logger.HierarchicalLogger] = None)¶
Behavioral cloning (BC).
Recovers a policy via supervised learning from observation-action pairs.
- allow_variable_horizon: bool¶
If True, allow variable horizon trajectories; otherwise error if detected.
- property policy: stable_baselines3.common.policies.ActorCriticPolicy¶
Returns a policy imitating the demonstration data.
- save_policy(policy_path: Union[str, bytes, os.PathLike]) None ¶
Save policy to a path. Can be reloaded by .reconstruct_policy().
- Args:
policy_path: path to save policy to.
- set_demonstrations(demonstrations: Union[Iterable[types_unique.Trajectory], Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]], TransitionKind]) None ¶
Sets the demonstration data.
Changing the demonstration data on-demand can be useful for interactive algorithms like DAgger.
- Args:
- demonstrations: Either a Torch DataLoader, any other iterator that
yields dictionaries containing “obs” and “acts” Tensors or NumPy arrays, TransitionKind instance, or a Sequence of Trajectory objects.
- train(*, n_epochs: Optional[int] = None, n_batches: Optional[int] = None, on_epoch_end: Optional[Callable[None]] = None, on_batch_end: Optional[Callable[None]] = None, log_interval: int = 500, log_rollouts_venv: Optional[stable_baselines3.common.vec_env.base_vec_env.VecEnv] = None, log_rollouts_n_episodes: int = 5, progress_bar: bool = True, reset_tensorboard: bool = False)¶
Train with supervised learning for some number of epochs.
Here an ‘epoch’ is just a complete pass through the expert data loader, as set by self.set_expert_data_loader(). Note, that when you specify n_batches smaller than the number of batches in an epoch, the on_epoch_end callback will never be called.
- Args:
- n_epochs: Number of complete passes made through expert data before ending
training. Provide exactly one of n_epochs and n_batches.
- n_batches: Number of batches loaded from dataset before ending training.
Provide exactly one of n_epochs and n_batches.
- on_epoch_end: Optional callback with no parameters to run at the end of each
- on_batch_end: Optional callback with no parameters to run at the end of each
log_interval: Log stats after every log_interval batches. log_rollouts_venv: If not None, then this VecEnv (whose observation and
actions spaces must match self.observation_space and self.action_space) is used to generate rollout stats, including average return and average episode length. If None, then no rollouts are generated.
- log_rollouts_n_episodes: Number of rollouts to generate when calculating
rollout stats. Non-positive number disables rollouts.
progress_bar: If True, then show a progress bar during training. reset_tensorboard: If True, then start plotting to Tensorboard from x=0
even if .train() logged to Tensorboard previously. Has no practical effect if .train() is being called for the first time.
- class bc.BCLogger(logger: logger.HierarchicalLogger)¶
Utility class to help logging information relevant to Behavior Cloning.
- log_batch(batch_num: int, batch_size: int, num_samples_so_far: int, training_metrics: bc.BCTrainingMetrics, rollout_stats: Mapping[str, float])¶
- log_epoch(epoch_number)¶
- reset_tensorboard_steps()¶
- class bc.BCTrainingMetrics(neglogp: torch.Tensor, entropy: torch.Tensor, ent_loss: torch.Tensor, prob_true_act: torch.Tensor, l2_norm: torch.Tensor, l2_loss: torch.Tensor, loss: torch.Tensor)¶
Container for the different components of behavior cloning loss.
- ent_loss: torch.Tensor¶
- entropy: torch.Tensor¶
- l2_loss: torch.Tensor¶
- l2_norm: torch.Tensor¶
- loss: torch.Tensor¶
- neglogp: torch.Tensor¶
- prob_true_act: torch.Tensor¶
- class bc.BatchIteratorWithEpochEndCallback(batch_loader: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]], n_epochs: Optional[int], n_batches: Optional[int], on_epoch_end: Optional[Callable[int, None]])¶
Loops through batches from a batch loader and calls a callback after every epoch.
Will throw an exception when an epoch contains no batches.
- batch_loader: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]]¶
- n_batches: Optional[int]¶
- n_epochs: Optional[int]¶
- on_epoch_end: Optional[Callable[int, None]]¶
- class bc.BehaviorCloningLossCalculator(ent_weight: float, l2_weight: float)¶
Functor to compute the loss used in Behavior Cloning.
- ent_weight: float¶
- l2_weight: float¶
- class bc.BehaviorCloningTrainer(loss: bc.BehaviorCloningLossCalculator, optimizer: torch.optim.optimizer.Optimizer, policy: stable_baselines3.common.policies.ActorCriticPolicy)¶
Functor to fit a policy to expert demonstration data.
- optimizer: torch.optim.optimizer.Optimizer¶
- policy: stable_baselines3.common.policies.ActorCriticPolicy¶
- class bc.RolloutStatsComputer(venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, n_episodes: int)¶
Computes statistics about rollouts.
- Args:
venv: The vectorized environment in which to compute the rollouts. n_episodes: The number of episodes to base the statistics on.
- n_episodes: int¶
- venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv¶
- bc.enumerate_batches(batch_it: Iterable[Mapping[str, Union[numpy.ndarray, torch.Tensor]]]) Iterable[Tuple[Tuple[int, int, int], Mapping[str, Union[numpy.ndarray, torch.Tensor]]]] ¶
Prepends batch stats before the batches of a batch iterator.
- bc.reconstruct_policy(policy_path: str, device: Union[torch.device, str] = 'auto') stable_baselines3.common.policies.ActorCriticPolicy ¶
Reconstruct a saved policy.
- Args:
policy_path: path where .save_policy() has been run. device: device on which to load the policy.
- Returns:
policy: policy with reloaded weights.