Review

Multi-armed bandit Review(Bayesian machine learning: A/B Testing)
Explore-exploit dilemma
4 Algorithms:
- Epsilon-greedy
- Optimistic initial Value
- UCB1
- Thompson Sampling
Basic definitions in RL
Tic-Tac-Toe
MDPs
Policies state-value functions, action-value funtions
Return
3 methods:
- Dynamic Programming(direct application of bellman’s Equation)
  - Policy iteration, Value iteration
- Monte Carlo
  - Learning from experience
  - Not fully online
- Temporal Difference Learning
  - Fully online with bootstrapping
  - Also learn from experience
Approximation Methods
Tabular methods can be infeasible for large state spaces
- action value funtion을 Q-table로 작성하여 푸는 방법
differential models(feature engineering)
1. Review of MDPs
Markov decision Processes
MDPs a collection of 5 things:
- set of all states
- set of all actions
- set of all rewards
- state transition probabilities
- Discount factor(gamma)
States
- State represents what the sensors of our agent measure from the environment
- In GridWorld, that would be our position on the board
- in tic-tac-toe: specific configuration of pieces on the board
- Video game: pixel on the screen
- Maybe also : # of lives we have left, health, etc
- For an AI to be as human as possible maybe we should require it to learn that only from pixels on the screen
Actions
- Anything the agent can do while in a state
- tic-tac-toe: placing a piece on the board
- Video game: moving up/down/left/right, pressing an action button
Rewards
- Agent receives a reward at every time step
- reward are real-valued
- goal of agent is to maximize total future reward
- careful to define rewards the right way
- Ex. Robot trying to solve a maze receives a reward of 0 at every step and 1 for solving the maze
- possible that robot will never solve the maze, or solve it very inefficiently
- it has only experience 0 reward, thinks that is the best it can do, no incentive to not act randomly
- better solution is -1 reward at every time step
- now it has incentive to solve the maze as quickly as possible
- “Negative” and “Positive” don’t have connotations(含义) when it comes to RL agents
- just a number on a scale
- E.g -3 is better reward than -300
- been debated whether or not we should override the default rewards
- in real world environment, we would be the ones defining rewards
State-transition probabilities
- At first glance, might seem unnecessary
- p(s’,r I s,a)
- if we do action a while in state s, won’t i always go to s?
- GridWorld: action “up”, should always take us to the square above
- Not all environments are deterministic - have a source of randomness
- reading of the state can be imperfect
- may only reflect partial knowledge
Makrov propertys
- p[s(t+1),r(t+1)I s(t), a(t), s(t-1), a(t-1),…,(s1),a(1)]=p[s(t+1),r(t+1) I s(t), a(t)]
- as usual by “Markov” we mean first-order Markov
- A.K.a Markov assumption
- p(s’,r I s,a)
- Note: ‘ don’t mean t+1
Discount factor
- we would rather get $100 now than $100 years from now
- we would like to maximize total future reward
- the further we look into the future, the harder it is to predict
- therefore, discount future rewards to make them “matter less”
- we can then define the return:
Value Function
- Because rewards are probabilistic, and return are sums of rewards, they are also probabilistic
- can define the expected value - call this the Value Funtion
- The expected return from being in state s
State-Value and Action-Value
- State-Value: V(s), Action-Value: Q(s,a)
- Typically use Q for learning the optimal policy (control problem)
- Finding V or Q given a fixed policy is called the prediction problem
- Policy denoted by π, can be deterministic or probabilistic
When to use Q(s,a)
- Q(s,a) is better for the control problem since it is indexed by action, tells us what the best action is given a state
- With V(s) we need to actually do the action, and see what the next state s’ is, to determine the best action
- Not feasible to “reset” a state in the real world
Episodes
- E.g one run of tic-tac-toe
- Typically an agent will need to play many episodes to learn an optimal policy
- Tasks that end are called episodic tasks
- Tasks that last forever are called continuous task
- Terminal state - the state at which an episode ends
- Since the value function is the expected future reward after arriving in a state, the value of any terminal state is 0

2. Dynamic Programming

Pioneered by Richard Bellman
The bellman Equations for MDPs allow us to define the value function recursively
if we look carefully, this is actually a linear system of equations:
- V(s1)= $a_11$V(s1) + $a_12$V(s1) + … + $a_1N$V(sN)
- V(s2)= $a_21$V(s2) + $a_22$V(s1) + … + $a_2N$V(sN)
but since this is machine learning we are more interested in iterative solutions
prediction problem: Given a polic, find the value function
simply keep recalculating the right side and assign it to the left until there is no more change
bellman’s equation says they should be equal
one “trick” : don’t update set of V(s)’s only from previous set of V(s)’s, just update V(s) in-place -it’s faster
Policy Iteration
- For solving the control problem
```
while not converged:
Step 1) Policy evaluation on current policy
Step 2) Policy Improvement (take the argmax over Q(s,a))
```
- Policy iteration is inefficient
- outer loop is iterative
- policy evaluation(inner loop) is also iterative
- iterative algorithm inside another
Value Iteration
- Instead of waiting for policy evaluation to converge just do iteration
- overall it will still converge
- Furthermore, we don’t need to explicitly do the policy improvement step at all
- Since it’s the argmax, then taking the max in the policy evaluation step in equivalent
- Taking the max appears again in Q-Learning
lays the ground work, but is not very practical
we need to loop through all state on every iteration
state space may be very large or infinite
requires us to know p(s’,r I s,a)
calculating that can be infeasible
doesn’t learn from experience
MC and TD learning do, no model of the environment needed

3. Monte Carlo Methods(MC)

Unlike DP, MC is all about learning from experience
Expected values can be approximated by sample means
Play a bund of episodes, gather returns, average them
only gives us values for states we encountered
if we never encountered a state, its value is unknown
what about the control problem?
we return to the ide of policy iteration
Montel Carlo control
1. initialize random policy
2. while not converged:
  - play an episode, calculate returns for each state
  - Do policy improvement based on current Q(s,a)(take the argmax)
  - Look-ahead search with V(s) is not practical if we don’t have full control over environment, so we use Q(s,a)
  - Unusual/interesting: the averaged returns are for difference policies, yet it still converged
Problem
- MC control as given won’t always work
- requires many episodes
- what if we’re not doing an episodic task?
- what if the current policy is to bump into a wall or walk around in a circle? the episode will never end
- one solution: end the episode if we catch ourselves in a loop, give a large negative reward for that action
- Agent won’t do it again, since policy is chosen as argmax
Another Problem
- MC can leave may states unexplored
- we had never know if going to these states is better/worse than our current policy, because we have no data on them
- one solution: the “exploring stats” method-start from a random state each episode
- requires full control over environment, not always feasible
- Another solution: epsilon-greedy
- by moving randomly with small probability epsilon, we’ll eventually explore all states sufficiently

4. Temporal difference Learning(TD)

Unique to RL
MC: sample returns based on an episode
TD: estimate return based on current value function estimate
We looked at TD(0)
instead of using G, we use r+ɣV(s’)
Calculating means
- basic naive way:
- collect all samples in an array, sum them, divide by N
- Major disadvantage - it requires us to store all samples in memory - space inefficient
- can calculate current sample mean from last sample mean
- Not only is it space efficient, it also looks curiously like gradient descent(in fact, it is ) we can Generalized it:
Back to TD(0)
- remember, instead of using G, we use r and V
- Allows us to do true online learning
- only need to wait until t+1 to update value for state at time t
- MC - need to wait until entire epsiode is over
- can learn during the episode, can be used for long or non-episodic tasks
TD control
- we learned 2 algorithms
- Sarsa

Pseudocode

Q(s,a) = arbitrary, Q(terminal,a) = 0
for t=1..N
  s = start_state, a = epsilon_greedy_from(Q(s))
  while not game over:
    s', r = do_action(a)
    a' = episode_greedy_from(Q(s'))
    Q(s,a) = Q(s,a) + alpha + [r + gamma*Q(s',a')-Q(s,a)]
    s = s', a = q'

Q-Learning
- similar to SARSA, but an off-policy algorithm
- it doesn’t matter what action we take next, the target remains the same since it’s a max

5. Approximation Methods

we studied DP, MC, and TD using tabular methods
we stored V(s) and Q(s,a) as dictionaries
this work ok but only for small problems
it won’t work for billion or trillion of states, and certainly not an infinite number of states
we know that supervised learning can be used for funtion approximation
we want to estimate V or Q
Since reward is a real number, so is , therefore we do regression, and the appropriate loss is squared error
Linear Models
- we need a differential model so we can do gradient descent
- theta is the model parameter, so that’s what we need to update
General Models
- In general, the model need not be linear, just differentiable like a neural network
- All deep learning techinque can be applied batch gradient descent, momentum, dropout and other regulation, etc.
- in general form:
  Next Steps
Continuous state-spaces
- light intensity in 3-D space
- 3-D space is Continuous, light intensity is Continuous
Continuous action-space
- Amount of force applied to a motor
parameterized V
parameterized π
- it called “policy gradient” method
learned from our current experience only
as we play, we accumulate(input, target) pairs
- E.g accumulate training data
save these in a file and do more training
in deep learning, we loop through the same training data multiple times (called ‘epochs’)
No reason we can’t learn from our previous experience as well
Replay a previous episode, update params based on old episodes
CNNs
RNNs
Regression, Classification

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

15. Review

01 Oct 2019 | Reinforcement Learning

Review

1. Review of MDPs

2. Dynamic Programming

3. Monte Carlo Methods(MC)

4. Temporal difference Learning(TD)

5. Approximation Methods

Next Steps

Comments

for Robot Artificial Inteligence

15. Review

01 Oct 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

Review

1. Review of MDPs

2. Dynamic Programming

3. Monte Carlo Methods(MC)

4. Temporal difference Learning(TD)

5. Approximation Methods

Next Steps

Comments

01 Oct 2019 | Reinforcement Learning