Max Margin IRL using linear function approximator¶
- class linear_func_approx.Gridworld(grid_size, wind, discount)¶
Bases:
object
Gridworld MDP.
- average_reward(n_trajectories, trajectory_length, policy)¶
Calculate the average total reward obtained by following a given policy over n_paths paths.
policy: Map from state integers to action integers. n_trajectories: Number of trajectories. int. trajectory_length: Length of an episode. int. -> Average reward, standard deviation.
- feature_matrix(feature_map='ident')¶
Get the feature matrix for this gridworld.
- feature_map: Which feature map to use (default ident). String in {ident,
coord, proxi}.
-> NumPy array with shape (n_states, d_states).
- feature_vector(i, feature_map='ident')¶
Get the feature vector associated with a state integer.
i: State int. feature_map: Which feature map to use (default ident). String in {ident,
coord, proxi}.
-> Feature vector.
- generate_trajectories(n_trajectories, trajectory_length, policy, random_start=False)¶
Generate n_trajectories trajectories with length trajectory_length, following the given policy.
n_trajectories: Number of trajectories. int. trajectory_length: Length of an episode. int. policy: Map from state integers to action integers. random_start: Whether to start randomly (default False). bool. -> [[(state int, action int, reward float)]]
- int_to_point(i)¶
Convert a state int into the corresponding coordinate.
i: State int. -> (x, y) int tuple.
- neighbouring(i, k)¶
Get whether two points neighbour each other. Also returns true if they are the same point.
i: (x, y) int tuple. k: (x, y) int tuple. -> bool.
- optimal_policy(state_int)¶
The optimal policy for this gridworld.
state_int: What state we are in. int. -> Action int.
- optimal_policy_deterministic(state_int)¶
Deterministic version of the optimal policy for this gridworld.
state_int: What state we are in. int. -> Action int.
- point_to_int(p)¶
Convert a coordinate into the corresponding state int.
p: (x, y) tuple. -> State int.
- reward(state_int)¶
Reward for being in state state_int.
state_int: State integer. int. -> Reward.
- linear_func_approx.find_policy(n_states, n_actions, transition_probabilities, reward, discount, threshold=0.01, v=None, stochastic=True)¶
Find the optimal policy.
n_states: Number of states. int. n_actions: Number of actions. int. transition_probabilities: Function taking (state, action, state) to
transition probabilities.
reward: Vector of rewards for each state. discount: MDP discount factor. float. threshold: Convergence threshold, default 1e-2. float. v: Value function (if known). Default None. stochastic: Whether the policy should be stochastic. Default True. -> Action probabilities for each state or action int for each state
(depending on stochasticity).
- linear_func_approx.irl(n_states, n_actions, transition_probability, policy, discount, Rmax, l1)¶
Find a reward function with inverse RL as described in Ng & Russell, 2000.
n_states: Number of states. int. n_actions: Number of actions. int. transition_probability: NumPy array mapping (state_i, action, state_k) to
the probability of transitioning from state_i to state_k under action. Shape (N, A, N).
policy: Vector mapping state ints to action ints. Shape (N,). discount: Discount factor. float. Rmax: Maximum reward. float. l1: l1 regularisation. float. -> Reward vector
- linear_func_approx.large_irl(value, transition_probability, feature_matrix, n_states, n_actions, policy)¶
Find the reward in a large state space.
- value: NumPy matrix for the value function. The (i, j)th component
represents the value of the jth state under the ith basis function.
- transition_probability: NumPy array mapping (state_i, action, state_k) to
the probability of transitioning from state_i to state_k under action. Shape (N, A, N).
- feature_matrix: Matrix with the nth row representing the nth state. NumPy
array with shape (N, D) where N is the number of states and D is the dimensionality of the state.
n_states: Number of states sampled. int. n_actions: Number of actions. int. policy: NumPy array mapping state ints to action ints. -> Reward for each state in states.
- linear_func_approx.large_network_test(grid_size, discount)¶
Run large state space linear programming inverse reinforcement learning on the gridworld MDP.
Plots the reward function.
grid_size: Grid size. int. discount: MDP discount factor. float.
- linear_func_approx.optimal_value(n_states, n_actions, transition_probabilities, reward, discount, threshold=0.01)¶
Find the optimal value function.
n_states: Number of states. int. n_actions: Number of actions. int. transition_probabilities: Function taking (state, action, state) to
transition probabilities.
reward: Vector of rewards for each state. discount: MDP discount factor. float. threshold: Convergence threshold, default 1e-2. float. -> Array of values for each state
- linear_func_approx.small_network_test(grid_size, discount)¶
Run linear programming inverse reinforcement learning on the gridworld MDP. Plots the reward function. grid_size: Grid size. int. discount: MDP discount factor. float.
- linear_func_approx.v_tensor(value, transition_probability, feature_dimension, n_states, n_actions, policy)¶
Finds the v tensor used in large linear IRL.
- value: NumPy matrix for the value function. The (i, j)th component
represents the value of the jth state under the ith basis function.
- transition_probability: NumPy array mapping (state_i, action, state_k) to
the probability of transitioning from state_i to state_k under action. Shape (N, A, N).
feature_dimension: Dimension of the feature matrix. int. n_states: Number of states sampled. int. n_actions: Number of actions. int. policy: NumPy array mapping state ints to action ints. -> v helper tensor.
- linear_func_approx.value(policy, n_states, transition_probabilities, reward, discount, threshold=0.01)¶
Find the value function associated with a policy.
policy: List of action ints for each state. n_states: Number of states. int. transition_probabilities: Function taking (state, action, state) to
transition probabilities.
reward: Vector of rewards for each state. discount: MDP discount factor. float. threshold: Convergence threshold, default 1e-2. float. -> Array of values for each state