Max Margin IRL using linear function approximator

class linear_func_approx.Gridworld(grid_size, wind, discount)

Bases: object

Gridworld MDP.

average_reward(n_trajectories, trajectory_length, policy)

Calculate the average total reward obtained by following a given policy over n_paths paths.

policy: Map from state integers to action integers. n_trajectories: Number of trajectories. int. trajectory_length: Length of an episode. int. -> Average reward, standard deviation.

feature_matrix(feature_map='ident')

Get the feature matrix for this gridworld.

feature_map: Which feature map to use (default ident). String in {ident,

coord, proxi}.

-> NumPy array with shape (n_states, d_states).

feature_vector(i, feature_map='ident')

Get the feature vector associated with a state integer.

i: State int. feature_map: Which feature map to use (default ident). String in {ident,

coord, proxi}.

-> Feature vector.

generate_trajectories(n_trajectories, trajectory_length, policy, random_start=False)

Generate n_trajectories trajectories with length trajectory_length, following the given policy.

n_trajectories: Number of trajectories. int. trajectory_length: Length of an episode. int. policy: Map from state integers to action integers. random_start: Whether to start randomly (default False). bool. -> [[(state int, action int, reward float)]]

int_to_point(i)

Convert a state int into the corresponding coordinate.

i: State int. -> (x, y) int tuple.

neighbouring(i, k)

Get whether two points neighbour each other. Also returns true if they are the same point.

i: (x, y) int tuple. k: (x, y) int tuple. -> bool.

optimal_policy(state_int)

The optimal policy for this gridworld.

state_int: What state we are in. int. -> Action int.

optimal_policy_deterministic(state_int)

Deterministic version of the optimal policy for this gridworld.

state_int: What state we are in. int. -> Action int.

point_to_int(p)

Convert a coordinate into the corresponding state int.

p: (x, y) tuple. -> State int.

reward(state_int)

Reward for being in state state_int.

state_int: State integer. int. -> Reward.

linear_func_approx.find_policy(n_states, n_actions, transition_probabilities, reward, discount, threshold=0.01, v=None, stochastic=True)

Find the optimal policy.

n_states: Number of states. int. n_actions: Number of actions. int. transition_probabilities: Function taking (state, action, state) to

transition probabilities.

reward: Vector of rewards for each state. discount: MDP discount factor. float. threshold: Convergence threshold, default 1e-2. float. v: Value function (if known). Default None. stochastic: Whether the policy should be stochastic. Default True. -> Action probabilities for each state or action int for each state

(depending on stochasticity).

linear_func_approx.irl(n_states, n_actions, transition_probability, policy, discount, Rmax, l1)

Find a reward function with inverse RL as described in Ng & Russell, 2000.

n_states: Number of states. int. n_actions: Number of actions. int. transition_probability: NumPy array mapping (state_i, action, state_k) to

the probability of transitioning from state_i to state_k under action. Shape (N, A, N).

policy: Vector mapping state ints to action ints. Shape (N,). discount: Discount factor. float. Rmax: Maximum reward. float. l1: l1 regularisation. float. -> Reward vector

linear_func_approx.large_irl(value, transition_probability, feature_matrix, n_states, n_actions, policy)

Find the reward in a large state space.

value: NumPy matrix for the value function. The (i, j)th component

represents the value of the jth state under the ith basis function.

transition_probability: NumPy array mapping (state_i, action, state_k) to

the probability of transitioning from state_i to state_k under action. Shape (N, A, N).

feature_matrix: Matrix with the nth row representing the nth state. NumPy

array with shape (N, D) where N is the number of states and D is the dimensionality of the state.

n_states: Number of states sampled. int. n_actions: Number of actions. int. policy: NumPy array mapping state ints to action ints. -> Reward for each state in states.

linear_func_approx.large_network_test(grid_size, discount)

Run large state space linear programming inverse reinforcement learning on the gridworld MDP.

Plots the reward function.

grid_size: Grid size. int. discount: MDP discount factor. float.

linear_func_approx.optimal_value(n_states, n_actions, transition_probabilities, reward, discount, threshold=0.01)

Find the optimal value function.

n_states: Number of states. int. n_actions: Number of actions. int. transition_probabilities: Function taking (state, action, state) to

transition probabilities.

reward: Vector of rewards for each state. discount: MDP discount factor. float. threshold: Convergence threshold, default 1e-2. float. -> Array of values for each state

linear_func_approx.small_network_test(grid_size, discount)

Run linear programming inverse reinforcement learning on the gridworld MDP. Plots the reward function. grid_size: Grid size. int. discount: MDP discount factor. float.

linear_func_approx.v_tensor(value, transition_probability, feature_dimension, n_states, n_actions, policy)

Finds the v tensor used in large linear IRL.

value: NumPy matrix for the value function. The (i, j)th component

represents the value of the jth state under the ith basis function.

transition_probability: NumPy array mapping (state_i, action, state_k) to

the probability of transitioning from state_i to state_k under action. Shape (N, A, N).

feature_dimension: Dimension of the feature matrix. int. n_states: Number of states sampled. int. n_actions: Number of actions. int. policy: NumPy array mapping state ints to action ints. -> v helper tensor.

linear_func_approx.value(policy, n_states, transition_probabilities, reward, discount, threshold=0.01)

Find the value function associated with a policy.

policy: List of action ints for each state. n_states: Number of states. int. transition_probabilities: Function taking (state, action, state) to

transition probabilities.

reward: Vector of rewards for each state. discount: MDP discount factor. float. threshold: Convergence threshold, default 1e-2. float. -> Array of values for each state