Temporal Difference (TD) Learning

this section is a 3rd technique for solving MDPs
TD = Temporal Difference(TD) Learning
Combines ideas from DP and MC
Disadvantage of DP: requires full model of environment, never learns from experience
MC and TD learn from experience
MC can only update after completing episode, but DP uses Bootstrapping(Making an initial estimate)
We will see that TD also uses Bootstrapping and is fully online, can update value during an episode
1. TD Learning
Same approach as before
First Predict Problem
them control problem
2 control methods:
- SARSA
- Q-Learning
Model-Free Reinforcement Learning
TD methods learn directly from episode of experience.
no Knowledge of MDP Transition / rewards
TD learns from incomplete episodes by Bootstrapping
TD updates a guess towards a guess
MC와 다른점은, MC는 실제 에피소드가 끝나고 받게 되는 보상을 사용하여 Value Function을 업데이트 하였다
TD에서는 실제 보상과 다음 step에 대한 미래추정가치를 사용해서 학습한다.
이때 사용하는 보상과 Value Function의 합을 TD Target -그리고 TD Target 과 실제 V(S)와의 차이를 TD error라고 표현한다.
MC에서의 Value function이 업데이트 되는 과정이 왼쪽(에피소드가 전체적으로 끝나서 그의 보상을 나누어 단계별로 업데이트)
TD는 각 단계별로 업데이트가 되는 과정으로 오른쪽 그림
TD의 장점은 에피소드 중간에서도 학습을 하게 된다는 것이다.
MC에서는 에피소드가 끝날때까지 기다렸다가 업데이트가 발생하고 학습하기 떄문이다.
TD는 종료 없는 연속적인 에피소드에서도 학습할 수 있다.
Return $G_t$ = R(t+1)+ɣR(t+2)+…+ɣ^(t-1)$R_T) is unbiased estimate of $v_π$($S_t$)
True TD Target R(t+1)+ɣV(S(t+1)) is biased estimate of $v_π$($S_t$)
TD Target is much lower variance than the return:
- return depends on many random actions, transitions, rewards
- TD Target depends on one random action, transition, reward
V policy가 실제 G에 대해서 unbias라 할때는 TD Target도 V policy를 추종하기 unbias이다. 하지만 TD Target에 V policy를 추정하는 V(St+1)를 사용하기에 실제값이 아니라 실제값을 추정하는 값임으로 bias가 발생한다. 그리고 TD Target은 단지 하나의 step에서만 계산하기에 noise가 작게 되므로 상대적으로 variance가 낮게 나타난다.
MC has high variance, zero bias
- good convergence properties
- even with function approximation
- not very sensitive to initial value
- very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to $v_π$(s)
- but not always with function approximation
- More sensitive to initial value
Compare on variance between MC and TD
Bootstrapping이 더 학습하는데 효율적이다.
MC and TD converge : V(s) -> $v_π$(s) as experience -> ∞ (에피소드는 무한 반복하게 되면 결국 수렴하게 되어있다.)
but what about batch solution for finite experience
e.g repeatedly sample episode k ∈ [1,K]
Apply MC or TD(0) to episode k
For example, first episode A got reward “0” and B got reward “0”
from Second Episode B got reward “1” to seven episode. and B got 0 at the last episode.
then A will go B within 100 % and then reward will be “0”
in the MC methods, A will get reward “0” because the episode pass Through A is one that final reward is zero 6- in the TD methods, running another episode because of B value update to “6” then A value also going to be updated
MC converges to solution with minimum mean-square
- best fit to the observe returns :
- in the AB example, V(A)=0
TD(0) converges to solution of max likelihood Markov Model
- Solution to the MDP (S,A,P^,R^,ɣ) that best fits the data
- in the AB example, V(A) = 0.75
TD(0)방식은 max likelihood Markv 방식을 사용하여 수렴하는 알고리즘이다. MDP 기반으로 실제적인 값을 찾아가게 되기 때문에 V(A)의 값이 6/8 평균가치가 계산되어 0.75값으로 업데이트가 된다.
TD exploits Markov property
- Usually more efficient in Markov environment
MC does not exploit Markov property
- Usually more effective in Non-Markov environment
MC 알고르짐
Bootstrapping을 사용하여 states에 대한 value들을 추정하여 사용하는 방법 TD
DP 방식
DP 방식에서는 모델을 통해서 이미 MDP를 알고 있고 이에 대한 value 와 reward를 통해 학습을 하기 때문에 위에와 같이 나옵니다.
Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstrap
Sampling update samples an expectation
- MC samples
- DP does not sample
- TD samples
DP와 TP에서 사용하는 Bootstrapping은 추정 값을 기반으로 하여 업데이트
MC에서 사용하는 샘플링은 expectation을 샘플하여 업데이트 한다.
TD도 샘플링을 한지만 DP처럼 full backup을 하지 않는다.

2. TD(0) Prediction

Apply TD to prediction problem
algorithm is called TD(0)
there is also TD(1) and TD(λ)
it is related to Q-Learning and approximation methods

MC

Recall: one Disadvantage of MC is we need to wait until the episode is finished then we calculate return
also recall: Multiple ways of calculating averages
General “average-finding” equation
Does not require us to store all returns
Constant alpha is moving average/exponential decay

Value function

now recall: the definition of V
We can define it recursively

TD(0)

can we just combine these(MC & Value Function)
instead of sampling the return, use the recursive definition
this is TD(0)
why this is fully online? we can update V(s) as soon as we know s’

sources of randomness

In Mc, randomness arises when an episode can play out in different ways(due to stochastic policy, or stochastic state transitions)
in TD(0), we have another source of randomness
- G is an exact sample in MC
- r + ɣV(s’) is itself an estimate of G
we are estimating from other estimates(bootstrapping)

Summary

TD(0) is advantageous in comparison to MC/DP
Unlike DP, we don’t require a model of the environment, and only update V for states we visit
unlike MC, we don’t need to wait for an episode to finish
advantageous for very long episodes
also applicable to continuous(non-episodic)tasks

3. TD Learning for Control

Apply TD(0) to Control
we can probably guess what we’re going to do by now
use Generalized policy iteration, alternate between TD(0) for prediction, policy improvement using greedy action selection
Use Value iteration method: Update Q in-place, improve policy after every change
skip the part where we do full policy evaluation

4. SARSA

Recall from MC: we want to use Q because it’s indexed by a, V is only indexed by s
Q has the same recursive form
Same limitation as MC: need lots of samples in order to converge
require the 5-tuuple:(s,a,r,s’,a’)
Hence the name
TD방식과 다른점은 Value function을 쓰지 않고 Q Fucntion을 쓴다.
Like MC, still requires us to know Q(s,a) for all a, in order to choose argmax
problem: if we follow a deterministic policy, we would only fill in 1/IAI values on each episode
Leave most of Q untouched
Remedy(处理方法): Exploring starts or policy that includes exploration
use epsilon-greedy

Pseudocode

Q(s,a) = arbitrary, Q(terminal,a) = 0
for t=1..N
  s = start_state, a = epsilon_greedy_from(Q(s))
  while not game over:
    s', r = do_action(a)
    a' = episode_greedy_from(Q(s'))
    Q(s,a) = Q(s,a) + alpha + [r + gamma*Q(s',a')-Q(s,a)]
    s = s', a = q'

Interesting fact : convergence proof has never been published
has been stated informally that it will converge of policy converges to greedy
we can achieve this by seeing ε = 1/t
Or ε = c/t or ε= c/$t^a$
Hyperparameters

Learning Rate

Recall that learning rate can also decay
problem:
- if we set α = 1/t
- at every iteration, only one (s,a) pair will be updated for Q(s,a)
- Learning rate will decay even for values we have never updated
- Could we just only decrease α after every episode?
- No
- Many elements of Q(s,a) are not updated during an episode
we take inspiration from deep learning: AdaGrad and RMSprop(adaptive learning rates)
Effective Learning rate Decays more when previous gradient has been larger
in other words: the more it has changed in the past, the less it will change in the future
our version is simpler: keep a count of every(s,a) pair seen:
- α(s,a) = $α_0$ / count(s,a)
Equivalently
- α(s,a) = $α_0$ / (k+m*count(s,a))

5. Q-Learning

Main Theme: Generalize policy iteration
- Policy evaluation
- Policy Improvement(greedy wrt(with regard to) current value)
what we’ve been studying: on-policy methods
we always follow the current best policy
Q-Learning is an off-policy methds
do any random action, and still find Q*
Looks similar SARSA
instead of choosing a’ based on argmax of Q, we update Q(s,a) directly with max over Q(s’,a’)
isn’t that the same, since a’=argmax[a’]{Q(s’,a’)}?
we don’t need to actually do action a’ as the next move
therefore, we use Q(s’,a’) in the update for Q(s,a), even if we don’t do a’ next
Doesn’t matter what policy we follow
Reality: Random actions -> suboptimal(then use greed) -> takes longer for episode to finish
takeaway: doesn’t matter what policy we use
Under What circumstance is Q-learning == SARSA?
if policy used for Q-learning is greedy
then we’ll be doing Sarsa, but we also be doing Q-Learning

6.Summary

TD combines aspects of MC and DP
MC: Learn From experience / play the game
Generalized idea of taking sample mean of returns
- Multi-armed bandit
MC is not fully online
DP: bootstrapping, recursive from of value function
TD(0) = MC + DP (combines)
Instead of taking sample mean of returns, we take sample mean of estimated returns, based on r and V(s’)

TD Summary

Control
On-policy: SARSA
Off-policy : Q-Learning

TD Disadvantage

Need Q(s,a)
state space can easily become infeasible to enumerate
need to enumerate every action for every state
Q may not even fit into memory
Measuring Q(s,a) for all s and a is called the tabular(表格式的) method
Next, we will learn about function approximated methods which allow us to compress the amount of space needed to represent Q

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

13. TD Temporal Difference Learning

29 Sep 2019 | Reinforcement Learning

Temporal Difference (TD) Learning

1. TD Learning

2. TD(0) Prediction

MC

Value function

TD(0)

sources of randomness

Summary

3. TD Learning for Control

4. SARSA

Learning Rate

5. Q-Learning

6.Summary

TD Summary

TD Disadvantage

Comments

for Robot Artificial Inteligence

13. TD Temporal Difference Learning

29 Sep 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

Temporal Difference (TD) Learning

1. TD Learning

2. TD(0) Prediction

MC

Value function

TD(0)

sources of randomness

Summary

3. TD Learning for Control

4. SARSA

Learning Rate

5. Q-Learning

6.Summary

TD Summary

TD Disadvantage

Comments

29 Sep 2019 | Reinforcement Learning