DAgger Imitation Learning¶
Due to the i.i.d. assumption in the behavior cloning, if a classifier makes a mistake under the distribution of states faced by the demonstrator, then results following it faces compounded errors. DAgger proposes a new meta-algorithm which learns a stationary deterministic policy guaranteed to perform efficiently with the induced distribution of states. starts by extracting dataset at each iteration under the current policy and trains the next policy under the aggregate of all the collected datasets. The intuition behind this algorithm is that over the iterations, it is building up the set of inputs that the learned policy is likely to encounter during its execution based on previous experience (training iterations).
Classes and Functions¶
DAgger (https://arxiv.org/pdf/1011.0686.pdf).
Interactively trains policy by collecting some demonstrations, doing BC, collecting more demonstrations, doing BC again, etc. Initially the demonstrations just come from the expert’s policy; over time, they shift to be drawn more and more from the imitator’s policy.
Code adopted from https://github.com/HumanCompatibleAI/imitation.git
- class dagger.BetaSchedule¶
Bases:
abc.ABC
Computes beta (% of time demonstration action used) from training round.
- class dagger.DAggerTrainer(*, venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, scratch_dir: Union[str, bytes, os.PathLike], beta_schedule: Optional[Callable[int, float]] = None, bc_trainer: bc.BC, custom_logger: Optional[logger.HierarchicalLogger] = None)¶
Bases:
base.BaseImitationAlgorithm
DAgger training class with low-level API suitable for interactive human feedback.
In essence, this is just BC with some helpers for incrementally resuming training and interpolating between demonstrator/learnt policies. Interaction proceeds in “rounds” in which the demonstrator first provides a fresh set of demonstrations, and then an underlying BC is invoked to fine-tune the policy on the entire set of demonstrations collected in all rounds so far. Demonstrations and policy/trainer checkpoints are stored in a directory with the following structure:
scratch-dir-name/ checkpoint-001.pkl checkpoint-002.pkl … checkpoint-XYZ.pkl checkpoint-latest.pkl demos/ round-000/ demos_round_000_000.npz demos_round_000_001.npz … round-001/ demos_round_001_000.npz … … round-XYZ/ …
- DEFAULT_N_EPOCHS: int = 4¶
The default number of BC training epochs in extend_and_update.
- property batch_size: int¶
- create_trajectory_collector() dagger.InteractiveTrajectoryCollector ¶
Create trajectory collector to extend current round’s demonstration set.
- Returns:
A collector configured with the appropriate beta, imitator policy, etc. for the current round. Refer to the documentation for InteractiveTrajectoryCollector to see how to use this.
- extend_and_update(bc_train_kwargs: Optional[Mapping] = None) int ¶
Extend internal batch of data and train BC.
Specifically, this method will load new transitions (if necessary), train the model for a while, and advance the round counter. If there are no fresh demonstrations in the demonstration directory for the current round, then this will raise a NeedsDemosException instead of training or advancing the round counter. In that case, the user should call .create_trajectory_collector() and use the returned InteractiveTrajectoryCollector to produce a new set of demonstrations for the current interaction round.
- Arguments:
- bc_train_kwargs: Keyword arguments for calling BC.train(). If
the log_rollouts_venv key is not provided, then it is set to self.venv by default. If neither of the n_epochs and n_batches keys are provided, then n_epochs is set to self.DEFAULT_N_EPOCHS.
- Returns:
New round number after advancing the round counter.
- property logger¶
- property policy: stable_baselines3.common.policies.BasePolicy¶
- save_policy(policy_path: Union[str, bytes, os.PathLike]) None ¶
Save the current policy only (and not the rest of the trainer).
- Args:
policy_path: path to save policy to.
- save_trainer() Tuple[pathlib.Path, pathlib.Path] ¶
Create a snapshot of trainer in the scratch/working directory.
The created snapshot can be reloaded with reconstruct_trainer(). In addition to saving one copy of the policy in the trainer snapshot, this method saves a second copy of the policy in its own file. Having a second copy of the policy is convenient because it can be loaded on its own and passed to evaluation routines for other algorithms.
- Returns:
checkpoint_path: a path to one of the created DAggerTrainer checkpoints. policy_path: a path to one of the created DAggerTrainer policies.
- class dagger.InteractiveTrajectoryCollector(venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, get_robot_acts: Callable[numpy.ndarray, numpy.ndarray], beta: float, save_dir: Union[str, bytes, os.PathLike])¶
Bases:
stable_baselines3.common.vec_env.base_vec_env.VecEnvWrapper
DAgger VecEnvWrapper for querying and saving expert actions.
Every call to .step(actions) accepts and saves expert actions to self.save_dir, but only forwards expert actions to the wrapped VecEnv with probability self.beta. With probability 1 - self.beta, a “robot” action (i.e an action from the imitation policy) is forwarded instead.
Demonstrations are saved as TrajectoryWithRew to self.save_dir at the end of every episode.
- reset() numpy.ndarray ¶
Resets the environment.
- Returns:
obs: first observation of a new trajectory.
- seed(seed=typing.Union[int, NoneType]) List[Union[None, int]] ¶
Set the seed for the DAgger random number generator and wrapped VecEnv.
The DAgger RNG is used along with self.beta to determine whether the expert or robot action is forwarded to the wrapped VecEnv.
- Args:
seed: The random seed. May be None for completely random seeding.
- Returns:
A list containing the seeds for each individual env. Note that all list elements may be None, if the env does not return anything when seeded.
- step_async(actions: numpy.ndarray) None ¶
Steps with a 1 - beta chance of using self.get_robot_acts instead.
DAgger needs to be able to inject imitation policy actions randomly at some subset of time steps. This method has a self.beta chance of keeping the actions passed in as an argument, and a 1 - self.beta chance of forwarding actions generated by self.get_robot_acts instead. “robot” (i.e. imitation policy) action if necessary.
At the end of every episode, a TrajectoryWithRew is saved to self.save_dir, where every saved action is the expert action, regardless of whether the robot action was used during that timestep.
- Args:
- actions: the _intended_ demonstrator/expert actions for the current
state. This will be executed with probability self.beta. Otherwise, a “robot” (typically a BC policy) action will be sampled and executed instead via self.get_robot_act.
- step_wait() Tuple[Union[numpy.ndarray, Dict[str, numpy.ndarray], Tuple[numpy.ndarray, ...]], numpy.ndarray, numpy.ndarray, List[Dict]] ¶
Returns observation, reward, etc after previous step_async() call.
Stores the transition, and saves trajectory as demo once complete.
- Returns:
Observation, reward, dones (is terminal?) and info dict.
- class dagger.LinearBetaSchedule(rampdown_rounds: int)¶
Bases:
dagger.BetaSchedule
Linearly-decreasing schedule for beta.
- exception dagger.NeedsDemosException¶
Bases:
Exception
Signals demos need to be collected for current round before continuing.
- class dagger.SimpleDAggerTrainer(*, venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, scratch_dir: Union[str, bytes, os.PathLike], expert_policy: stable_baselines3.common.policies.BasePolicy, expert_trajs: Optional[Sequence[types_unique.Trajectory]] = None, **dagger_trainer_kwargs)¶
Bases:
dagger.DAggerTrainer
Simpler subclass of DAggerTrainer for training with synthetic feedback.
- allow_variable_horizon: bool¶
If True, allow variable horizon trajectories; otherwise error if detected.
- train(total_timesteps: int, *, rollout_round_min_episodes: int = 3, rollout_round_min_timesteps: int = 500, bc_train_kwargs: Optional[dict] = None) None ¶
Train the DAgger agent.
The agent is trained in “rounds” where each round consists of a dataset aggregation step followed by BC update step.
During a dataset aggregation step, self.expert_policy is used to perform rollouts in the environment but there is a 1 - beta chance (beta is determined from the round number and self.beta_schedule) that the DAgger agent’s action is used instead. Regardless of whether the DAgger agent’s action is used during the rollout, the expert action and corresponding observation are always appended to the dataset. The number of environment steps in the dataset aggregation stage is determined by the rollout_round_min* arguments.
During a BC update step, BC.train() is called to update the DAgger agent on all data collected so far.
- Args:
- total_timesteps: The number of timesteps to train inside the environment.
In practice this is a lower bound, because the number of timesteps is rounded up to finish the minimum number of episdoes or timesteps in the last DAgger training round, and the environment timesteps are executed in multiples of self.venv.num_envs.
- rollout_round_min_episodes: The number of episodes the must be completed
completed before a dataset aggregation step ends.
- rollout_round_min_timesteps: The number of environment timesteps that must
be completed before a dataset aggregation step ends. Also, that any round will always train for at least self.batch_size timesteps, because otherwise BC could fail to receive any batches.
- bc_train_kwargs: Keyword arguments for calling BC.train(). If
the log_rollouts_venv key is not provided, then it is set to self.venv by default. If neither of the n_epochs and n_batches keys are provided, then n_epochs is set to self.DEFAULT_N_EPOCHS.
- dagger.reconstruct_trainer(scratch_dir: Union[str, bytes, os.PathLike], venv: stable_baselines3.common.vec_env.base_vec_env.VecEnv, custom_logger: Optional[logger.HierarchicalLogger] = None, device: Union[torch.device, str] = 'auto') dagger.DAggerTrainer ¶
Reconstruct trainer from the latest snapshot in some working directory.
Requires vectorized environment and (optionally) a logger, as these objects cannot be serialized.
- Args:
- scratch_dir: path to the working directory created by a previous run of
this algorithm. The directory should contain checkpoint-latest.pt and policy-latest.pt files.
venv: Vectorized training environment. custom_logger: Where to log to; if None (default), creates a new logger. device: device on which to load the trainer.
- Returns:
A deserialized DAggerTrainer.