TD(λ)

Generalize N-step method
λ is associated with something called the []”eligibility trace”(전격 흔적)](https://sumniya.tistory.com/14)
- eligibility trace라는 개념 : 이는 과거에 방문했던 state 중에서 현재 얻게 되는 reward에 영향을 주는 state를 판단하여, 현재 얻게 되는 reward을 해당 state에 나누어주는 것입니다. 이때, 영향을 주었다고 판단하는 index를 credit이라고하고 이 credit을 assign할 때, 두 가지 기준을 씁니다. 1) Frequency heuristic(얼마나 자주 방문했는가?) 2) Recency heuristic(얼마나 최근에 방문 했는가?)
N-step method code is complicated - not trivial to keep track of all rewards(then flush them once episode is over)
TD(λ) allows us a more elegant method to trade-off between TD(0) and MC
can update after just 1 step
λ = 0 gives us TD(0), λ=1 gives us Monte Carlo
N-step method
Recall:

strange idea: combine these :

TD(λ)

let’s take this strangeness to the next level - let’s combine more G’s, even an infinite number of G’s

λ must sum to 1 so that the result is on the same scale as individual G’s
in TD(λ). we make the coefficients decrease geometrically

Problem: these don’t sum to 1
we are most interested in the case when n - > ∞
we know this from calculus:

therefore, we should scale the sum by thins amount
we call this the **

we assume our episode will end at some point(say, time step T)
when we reach step T - t, the episode is over, so any N-step return beyond this is the Full MC return, G(t)
separate the partial N-step returns and the full returns

manipulate 2nd term to simplify the sum

simplify further

when λ=0, we get the TD(0) return since $0^0$ = 1
when λ=1, we get the full return(MC)
any λ between 0 and 1 gives us a combination of individual returns(with geometrically decreasing weight)

isn’t this much more computational effort than both MC and N-step methods?
yes, we would need to calculate G(t), $G^1$(t),…,$G^t$(t)
lots of work (benefit is not clear)
the algorithm we’re about to learn is an approximation to calculate the true λ-return(since that would be computationally infeasible)

TD(0)

let’s go back to TD(0) for a moment
we call target - prediction the TD error

parameter update is still gradient descent

Eligibility trace is a vector the same size as parameter vector:

Eligibility Trace/ vector keeps track of old gradients, much like momentum from deep learning
λ tell us “how much” of the past we want to keep
- $e_0$ = 0, $e_t$ = $∇0$V($s_t$) + ɣλe(t_1)

back to TD(λ)

redefine the parameter update to use e instead of only gradient

recall how momentum works
update only depends on next state
we can do updates in 1 step rather than waiting N steps
but still, just an approximation to using true λ-return
N-step method and true λ-return require waiting for future rewards
- we call this the “forward view”
in TD(λ), we update the current param based on past errors
- we call this the “backward view”

Code

# https://deeplearningcourses.com/c/deep-reinforcement-learning-in-python
# https://www.udemy.com/deep-reinforcement-learning-in-python
from __future__ import print_function, division
from builtins import range
# Note: you may need to update your version of future
# sudo pip install -U future
#
# Note: gym changed from version 0.7.3 to 0.8.0
# MountainCar episode length is capped at 200 in later versions.
# This means your agent can't learn as much in the earlier episodes
# since they are no longer as long.
#
# Adapt Q-Learning script to use TD(lambda) method instead

import gym
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
from gym import wrappers
from datetime import datetime

# code we already wrote
from q_learning import plot_cost_to_go, FeatureTransformer, plot_running_avg

Base model

class BaseModel:
    # this is eligience traces
  def __init__(self, D):
    self.w = np.random.randn(D) / np.sqrt(D)

  def partial_fit(self, input_, target, eligibility, lr=1e-2):
    self.w += lr*(target - input_.dot(self.w))*eligibility

  def predict(self, X):
    X = np.array(X)
    return X.dot(self.w)

model

# Holds one BaseModel for each action
class Model:
  def __init__(self, env, feature_transformer):
    self.env = env
    self.models = []
    self.feature_transformer = feature_transformer

    D = feature_transformer.dimensions # 2000 D
    self.eligibilities = np.zeros((env.action_space.n, D))
    for i in range(env.action_space.n):
      model = BaseModel(D)
      self.models.append(model)

  def predict(self, s):
    X = self.feature_transformer.transform([s])
    assert(len(X.shape) == 2)
    result = np.stack([m.predict(X) for m in self.models]).T
    assert(len(result.shape) == 2)
    return result

  def update(self, s, a, G, gamma, lambda_):
    X = self.feature_transformer.transform([s])
    assert(len(X.shape) == 2)
    self.eligibilities *= gamma*lambda_
    self.eligibilities[a] += X[0]
    self.models[a].partial_fit(X[0], G, self.eligibilities[a])

  def sample_action(self, s, eps):
    if np.random.random() < eps:
      return self.env.action_space.sample()
    else:
      return np.argmax(self.predict(s))

play

# returns a list of states_and_rewards, and the total reward
def play_one(model, env, eps, gamma, lambda_):
  observation = env.reset()
  done = False
  totalreward = 0
  iters = 0
  # while not done and iters < 200:
  while not done and iters < 10000:
    action = model.sample_action(observation, eps)
    prev_observation = observation
    observation, reward, done, info = env.step(action)

    # update the model
    next = model.predict(observation)
    assert(next.shape == (1, env.action_space.n))
    G = reward + gamma*np.max(next[0])
    model.update(prev_observation, action, G, gamma, lambda_)

    totalreward += reward
    iters += 1

  return totalreward

main

if __name__ == '__main__':
  env = gym.make('MountainCar-v0')
  ft = FeatureTransformer(env)
  model = Model(env, ft)
  gamma = 0.9999
  lambda_ = 0.7

  if 'monitor' in sys.argv:
    filename = os.path.basename(__file__).split('.')[0]
    monitor_dir = './' + filename + '_' + str(datetime.now())
    env = wrappers.Monitor(env, monitor_dir)


  N = 300
  totalrewards = np.empty(N)
  costs = np.empty(N)
  for n in range(N):
    # eps = 1.0/(0.1*n+1)
    eps = 0.1*(0.97**n)
    # eps = 0.5/np.sqrt(n+1)
    totalreward = play_one(model, env, eps, gamma, lambda_)
    totalrewards[n] = totalreward
#     print("episode:", n, "total reward:", totalreward)
  print("avg reward for last 100 episodes:", totalrewards[-100:].mean())
  print("total steps:", -totalrewards.sum())

  plt.plot(totalrewards)
  plt.title("Rewards")
  plt.show()

  plot_running_avg(totalrewards)

  # plot the optimal state-value function
  plot_cost_to_go(env, model)

TD LAMDA summary

we learned techniques that can be applied anywhere we would use Qlearning or SARSA
general tools that we can test in a variety of situations
“technically” got more practice with RBF networks too

N-step Methods

we saw how N-step methods provide a “bridge” between TD(0) and MC
we saw that we can combine all the N-step returns
this gives us the λ-return
we saw that λ=0 gives us TD(0). λ=1 gives us MC

Eligibility Traces

we looked at the use of eligibility traces to approximate TD(λ) using the true λ-return(since actually calculating it would be lots of effort)
we saw that it looks a lot like deep learning momentum
new tools for our toolbox
not guaranteed that any particular algorithm will work for our problem, but more tools = more thing to try

``` Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

24. TD LAMDA

08 Oct 2019 | Reinforcement Learning

TD(λ)

N-step method

TD(λ)

TD(0)

back to TD(λ)

TD LAMDA summary

N-step Methods

Eligibility Traces

Comments

for Robot Artificial Inteligence

24. TD LAMDA

08 Oct 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

TD(λ)

N-step method

TD(λ)

TD(0)

back to TD(λ)

TD LAMDA summary

N-step Methods

Eligibility Traces

Comments

08 Oct 2019 | Reinforcement Learning