intro to Dynamic Programming

Solutions to MDPs
Centrepiece of MDP: The bellman Equation
if we look carefully, this can be used to solve for V(s) Directly
I S I equations, I S I unknowns(linear Problem)
Many entries(进入(指行动)) will be 0, since transitions s -> s’ are sparse
Instead, we will use Iterative Policy evaluation

1. Iterative Policy evaluation

def iterative_policy_evaluation(π):
  initialize V(s) = 0 for all s ∈ S
  while true :
    ∆ = 0
    for each s ∈ S:
      old_v = V(s)

     ∆ = max(∆, I V(s) - old_v I )
     if ∆ < Threshold: break

     return V(s)

Technically, it’s defined where V(s) at iteration k+1 is updated from V(s) at iteration k
But we can update V(s) “in place”, to use the most recently updated values
Converges(汇集) Faster

Definition

What we just did(Finding V(s) given a policy) is called the Prediction Problem
Finding the optimal policy is called the Control Problem

2. Designing our RL Program

Let’s recap how to do this in supervised learning

class MyModel;
  def fit(x,y):
    # our job
  def predict(x)
    # our job

# boilerplate
Xtrain, Ytrain, Xtest, Ytest = get_data() # 1. Get Data
Model = MyModel() # 2. Instantiate Model
model.fit(Xtrain,Ytrain) # 3. Train Model
model.score(Xtest, ytest) # 4. Evaluation Model

RL Program is not supervised learning but there is still a pattern to be followed
Same as bandit section: several algorithms, but all have the same interface
Only the algorithm is different, not the layout
Applies to all the RL algorithms

Designing our RL Program, there are t types of problems

Prediction Problem： Given a policy, find V(s)

Goal: Find V(s)

given: policy
V(s) = initial value
for t in range(max_iterations):
states, actions, rewards = play_game(policy)
update V(s) given (state, actions, rewards)
print useful info (change in V(s) vs time, final V(s), policy)

2. Control Problem : Find the optimal policy and the corresponding value function

Goal : find the optimal Policy

note : Policy may not be explicitly represented

initialize value function and Policy
for t in range（max_iterations):
states, actions, rewards = play_game(policy)
update Value function and policy to (states, actions,rewards) using the algorithm
print useful info (change in V(s) vs time, final V(s), final policy)

3. Iterative Policy Evaluation

we will look at 2 different Policies
First: Completely random(Uniform) policy
which is relavant?
- π(a I s)
- p(s’,r I s,a)
Answer : π(a I s)
for a uniformly random policy, this will be 1/IA(s)I
A(s) = set of possible actions from state s
p(s’,r I s,a) is only relevant when state transitions are random
Second: policy we we’ll look at is Completely deterministic
from start position, we go Directly to goal
otherwise, we go Directly to losing state

4. Policy Improvement

THe control Problem
How to find better policies -> optimal policy
what we know so far : how to find V/Q given a fixed policy
using the current policy, we simply get the state-value function
can we change just this one action for s? i.e. choose a != π(s)
Yes
we have a finite set of actions, so just go through each one until we get a better Q
this looks complicated, but it’s very simple
all it’s saying is, if the policy for state s is to currently go “up”
just check “left”, “right”, and “down” to see if we can get a bigger Q
if so, change our policy for state s to the new action
Formally, we’re finding a new policy π’, that gives us a givver value than we had before:
if we have Q:
if we have V:
Notice : it’s greedy
we never consider globally the value function at all states
only look at current state s
notice: it uses an imperfect version of $V_π$(s)
once we change π. $V_π$(s) also changes
when are we finished changing the policy, it doesn’t change when we try policy Improvement, V(s) also doesn’t change
”<” when still improving = when finished
if we found optimal policy, Value function is always same in state

5. Policy Iteration

this is what use to find the optimal policy
problem we encountered last lecture: when we change the policy, the value function becomes out of date
How do we fix the out-of-date value function?
simply recalculate it given the new policy
we already know how to find V given π:
- iterative policy Evaluation
High-level: alternate between policy evaluation and policy improvement
keep doing this until policy doesn’t change
Don’t need to check value function for convergence, because once policy becomes constant so will value

Step of Policy Iteration

Randomly initialize V(s) and the policy π(s)
V(s) = iterative_policy_evaluation(π)
algorithm

policy_changed = False
for s in all_state:
  old_a = policy(s)
  policy(s) = argmax[a]{sum[s',r]{p(s',r I s, a )[ r+ gamma*V(s')]}}
  if policy(s) != old_a: policy_changed = true
if policy_changed: go back to step 2

6. Example (Windy Gridworld)

recall, we have 2 probability distributions to deal with:
- π(a I s)
- p(s’,r I s,a)
so far, p(s’,r I s,a) has been deterministic
in windy Gridworld, it’s not
imagine we are walking on a windy street. we try to walk straight but wind push we back or left.
same thins in windy Gridworld
if agent tries to go “up”, it will do so with probability 0.5
But it can also go “left”, “down”, or “right” with probability 0.5/3

code

# https://deeplearningcourses.com/c/artificial-intelligence-reinforcement-learning-in-python
# https://www.udemy.com/artificial-intelligence-reinforcement-learning-in-python
from __future__ import print_function, division
from builtins import range
# Note: you may need to update your version of future
# sudo pip install -U future


import numpy as np
from grid_world import standard_grid, negative_grid
from iterative_policy_evaluation import print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

# next state and reward will now have some randomness
# you'll go in your desired direction with probability 0.5
# you'll go in a random direction a' != a with probability 0.5/3

if __name__ == '__main__':
  # this grid gives you a reward of -0.1 for every non-terminal state
  # we want to see if this will encourage finding a shorter path to the goal
  grid = negative_grid(step_cost=-1.0)
# what is the step_cost? for example, if we are three steps away from the goal
# we get minus three reward before we even have a chance to get to the goal
# if we are only one step away from the losing state, then we only get minus one reward
  # grid = negative_grid(step_cost=-0.1)
  # grid = standard_grid()

  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)

  # state -> action
  # we'll randomly choose an action and update as we learn
  policy = {}
  for s in grid.actions.keys():
    policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

  # initial policy
  print("initial policy:")
  print_policy(policy, grid)

  # initialize V(s)
  V = {}
  states = grid.all_states()
  for s in states:
    # V[s] = 0
    if s in grid.actions:
      V[s] = np.random.random()
    else:
      # terminal state
      V[s] = 0

  # repeat until convergence - will break out when policy does not change
  while True:

    # policy evaluation step - we already know how to do this!
    while True:
      biggest_change = 0
      for s in states:
        old_v = V[s]
#         print(old_v)

        # V(s) only has value if it's not a terminal state
        new_v = 0
        if s in policy:
          for a in ALL_POSSIBLE_ACTIONS:
            if a == policy[s]:
              p = 0.5
            else:
              p = 0.5/3
            grid.set_state(s)
            r = grid.move(a)
            new_v += p*(r + GAMMA * V[grid.current_state()])
          V[s] = new_v
          biggest_change = max(biggest_change, np.abs(old_v - V[s]))

      if biggest_change < SMALL_ENOUGH:
        break

    # policy improvement step
    is_policy_converged = True
    for s in states:
      if s in policy:
        old_a = policy[s]
        new_a = None
        best_value = float('-inf')
        # loop through all possible actions to find the best current action
        for a in ALL_POSSIBLE_ACTIONS: # chosen action
          v = 0
          for a2 in ALL_POSSIBLE_ACTIONS: # resulting action
            if a == a2:
              p = 0.5
            else:
              p = 0.5/3
            grid.set_state(s)
            r = grid.move(a2)
            v += p*(r + GAMMA * V[grid.current_state()])
          if v > best_value:
            best_value = v
            new_a = a
        policy[s] = new_a
        if new_a != old_a:
          is_policy_converged = False

    if is_policy_converged:
      break

  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)
  # result: every move is as bad as losing, so lose as quickly as possible

6. Value Iteration

Alternative technique for solving the control Problem called value iteration
Previous technique: policy Iteration
Disadvantage of policy iteration:
- iterative algorithm
  - inside another iterative algorithm
Value Iteration is that Policy evaluation step ends when V coverages
is there a point before V Converges, s.t the resulting greedy policy wouldn’t change?
Yes
therefore, we don’t need to wait for policy evaluation to finish. just do a few steps
because the policy improvement step will find the same policy anyway
Value iteration takes this one step further
it combines policy evaluation and policy improvement into one step:
what is the difference? taking the max over all possible action
Iterative, but don’t need to wait for (k)th iteration of V finish before calculating (k+1)th
just update it “in-place(在正确的位置)” as before
since policy improvement uses argmax, by taking the max, we’re just doing the next policy evaluation step without calculating policy explicitly

7. Summary

Last section : Defined Markov Decision Process
This section : one method for finding solutions to MDP: Dynamic Programming
Prediction Problem： Iterative Policy evaluation
Control Problem: Policy iteration, Value iteration
- finding the optimal policy and optimal value function

Asynchronous(不同时存在) Dynamic Programming

Every DP algorithm we looked at involved looping through entire set of states
Recall that for many realistic games, state space is ridiculously large
thus, even one iteration can take a long time
one way to speed up: update V(s) “in-place”
we can take that one step further: Asynchronous(不同时存在) Dynamic Programming
instead of looping through all states, loop through a few or only one
Choose based one which states are most-visited
- can be learned by playing the game

Generalized Policy iteration

Main concept behind policy iteration: we iteratively alternate between 2 steps - policy evaluation and policy improvement
Converge when bellman’s equation becomes true(i.e. V(s) = right-hand side)

Efficiency of DP

Consider how long it would take to do brute force search
. # States = N, # actions = M
if we assume we can go from start to goal state in O(N) time, then we want to explore action sequences of length O(N)
M x M x … x M
we did this problem earlier in tic-tac-toe section
of possible permutation(排列(方式)) is O(M^n)
Once we generate all possible sequences, do policy evaluation on all to find the best V(s)
exponential in # of states

Model-based vs Model-free

Notice how DP requires full model of the environment
In Particular, p(s’,r I s,a)
In the real world, these may be hard to measuring, especially if ISI is large
next sections will look at methods which don’t require such a model - called model-free methods
also notice there iterative methods requires an initial estimate
we make estimates from other estimates (V(s) and policy )
Making an initial estimate is called Bootstrapping
Monte Carlo (MC) Does not require boot Bootstrapping
Temporal difference(TD) learning does

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

11. Introduction to Dynamic Programming

29 Sep 2019 | Reinforcement Learning

intro to Dynamic Programming

1. Iterative Policy evaluation

Definition

2. Designing our RL Program

3. Iterative Policy Evaluation

4. Policy Improvement

5. Policy Iteration

Step of Policy Iteration

6. Example (Windy Gridworld)

6. Value Iteration

7. Summary

Asynchronous(不同时存在) Dynamic Programming

Generalized Policy iteration

Efficiency of DP

of possible permutation(排列(方式)) is O(M^n)

Model-based vs Model-free

Comments

for Robot Artificial Inteligence

11. Introduction to Dynamic Programming

29 Sep 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

intro to Dynamic Programming

1. Iterative Policy evaluation

Definition

2. Designing our RL Program

3. Iterative Policy Evaluation

4. Policy Improvement

5. Policy Iteration

Step of Policy Iteration

6. Example (Windy Gridworld)

6. Value Iteration

7. Summary

Asynchronous(不同时存在) Dynamic Programming

Generalized Policy iteration

Efficiency of DP

of possible permutation(排列(方式)) is O(M^n)

Model-based vs Model-free

Comments

29 Sep 2019 | Reinforcement Learning