The Value function

1. Assigning Rewards

the programmers are like “coaches(教练)” to the AI
Pet Owner is a good analogy
We define how to give rewards to agent (Ex. Give same reward no matter what agent does - agent will always just behave randomly)(Ex2. Give our dog a treat when it behaves badly - encourages bad behavior)

2. Maze Example

Robot Trying to solve maze
Reward of 1 for finding the maze exit, else 0
But robot unlikely to solve the maze with this structure
if robot only sees reward is the max reward if all we see is 0
better: Every step yields reward of -1
Now robot is encouraged so solve maze as quickly as possible

3. Something like

be careful not to build our own prior knowledge into the AI ( Ex. Chess)
Agent should be rewarded for winning, not taking opponent’s pieces
No reward for implementing strategy us read about in chess book
Free the agent to find its own solution
OK to lose all but one piece and then win
tell the agent what we want it to achieve, not how we want it to be achieved

4. Planning

Scenario: we are thinking about studying for tomorrow’s exam. we would rather hang out with friends.
- hangout with friends -> dopamine hit -> happy
- Study -> feel tired and bored
- why study?
we don’t think about immediate rewards, but future rewards too. we want to assign some value to the current state that reflects the future too. call this the “VALUE FUNCTION”

5. Credit Assignment Problem

we receive a reward - getting hired for our dream job
what previous action led to this success
that called the “Credit Assignment Problem”
Ask the Question: “What did I do in the past that led to the reward I’m receiving now “
what action gets the credit

6. Attribution

Related to online advertising concept of attribution
if we show the user the same ad 10 times before they buy, which ad gets the credit?
In RL we don’t just assign ad-hoc(点对点) like this

7. Delayed Rewards

Delayed Rewards: another way of thinking of the same thing
Credit assignment: Present -> Past
Delayed from the other direction: Present -> Future
Related to field known as “planning”
the idea of delayed rewards tell us that an AI needs to have the ability of foresight or planning, planning is actually a field of study that crosses over with reinforcement learning

8. Scenario

2 possible next states from A: B or C
50% probability of ending up in either
Reasonable value of A?
Value(A) = 0.5x1 + 0.5x0 = 0.5

1 Possible next state form A:B
100% Probability of ending up in B
Reasonable value for A ?
Value(A) = 1 x 1 = 1
Value tells us the “Future goodness” of a state

9. Value Function

V(s) - The Value(Taking into Account the probability of all possible future rewards) of a state
Value is a measure of possible future rewards we may get from being in this state
reward is immediate(Ex. Jumping on a Goomba will immediately increase our score)(Ex2. Standing in front of a Goomba will not increase our score, but will put us in a position to jump in the next few states)
Estimating the value function is a central task in RL
Not RL algorithms require it(Ex, Evolutionary algorithm that mutates & spawns offspring, only those who survive the longest make it to next generation)
by pure evolution + natural selection we can breed better and better agents
but not the type of algorithm we are interested in for RL most of the time

10. Efficiency

the Value function is a fast & efficient way of searching the game tree
time consuming to enumerate every possible state transition and their probabilities of occurring.
tic-tac-toe: 3^(3*3) = 19683
Connect 4： 3^(4*4) = 43Millon
Exponential Growth is never good
Curse of dimensionality
V(S) gives answer instantly O(1)

11. Finding V(s)

V(s) = E[all future rewards S(t) = s]
E[x] = average of X
this is generic algorithm here
so we need to introduce some constraints here
iterative algorithm
initialize V(s):
- V(s) = 1 if s = winning rate
- V(s) = 0 if s = lose or draw
- V(s) = 0.5 otherwise
when we study “real” algorithm, we won’t need such careful initialization
V(s) can be interpreted(说明) as probability of winning after arriving in s( for this game only )(S.t V(s) is winning rate )
After we initialize V(s), we update it as follows
s = current state, s’=next state
s represents every state we encounter in an episode
Means we need to actually play an episode and keep track of state history
terminal state never updated since it doesn’t have a next state
we’ll do this over many episode

pseudocode

for t in range(max_iterations):
  state_history = play_game
  for(s,s') in state_history from end to start:
    V(s) = V(s) + learning_rate*(V(s')-V(s))

12. Playing the Game

how do we actually play the game/generate an episode
Take random action? No!
we don’t need to, we have the value function

maxV = 0
maxA = None
for a,s' in possible_next_states:
  if V(s') > maxV :
    maxV = V(s') # 좋은 max Value를 골랐으니 다음에 나오는 station이 된다.
    maxA = a
perform action max A

Note : would taking random actions even give us the right value functions? No, Because a game tree w/ random actions has different probabilities than a game tree w/ “best” actions

13. Problem

Problem with previous approach: Value function isn’t accurate
if we had the true value function, we wouldn’t need to do any of this work
Example of the explore-exploit dilemma
Random actions lead us to states we may not have otherwise visited
we can thus improve our value estimate for those states
but to win, we need to do the action that yields maximum value
we will use upsilon-greedy

14. Intuition

should remind you of the low-pass-filter / average-value-finding equation we saw earlier(+ gradient descent if we’ve seen that)
since we visit the states stochastically(随机地), V(s) will try to get close to V(s’) for all possible next s’
by playing infinitely many episodes, the proportion of time we spend in each’s will approach the true probabilities
Extremely Important detail, hard to discern(识别) from update equation alone
what order to update V(s)?
Key: We’re moving V(s) closer to V(s’)
Therefore we want V(s’) to be more accurate than V(s)
V(terminal) = 0, 1
For all others, if V(s’) is no better than V(s), this update doesn’t help
therefore, update goes backwards along state history

15. Summary

Credit assignment problem/ delayed rewards
Value function for representing future reward
value function efficiency vs searching game tree
iterative algorithm to find the value function
warning! Not the “formal” value function.
Everything in tic-tac-toe section is informal, designed to get us acquainted(熟悉) with solving a RL Problem

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

7. The Value function

24 Sep 2019 | Reinforcement Learning

The Value function

1. Assigning Rewards

2. Maze Example

3. Something like

4. Planning

5. Credit Assignment Problem

6. Attribution

7. Delayed Rewards

8. Scenario

9. Value Function

10. Efficiency

11. Finding V(s)

12. Playing the Game

13. Problem

14. Intuition

15. Summary

Comments

for Robot Artificial Inteligence

7. The Value function

24 Sep 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

The Value function

1. Assigning Rewards

2. Maze Example

3. Something like

4. Planning

5. Credit Assignment Problem

6. Attribution

7. Delayed Rewards

8. Scenario

9. Value Function

10. Efficiency

11. Finding V(s)

12. Playing the Game

13. Problem

14. Intuition

15. Summary

Comments

24 Sep 2019 | Reinforcement Learning