Markov Decision Processes

this section: formalize some RL concepts we already know about
Agent, Environment, action, state, reward, episode
Formal Framework: Markov Decision Processes(MDPs)

1. Typical Game by solving MDPs is GridWord

possible actions:
up,down,left,right
(1,1) -> wall,can’t go here
(0,3) -> Terminal(+1 Reward)
(1,3) -> Terminal(-1 Reward)
12 Positions(w x h = 3 x 4 = 12)
11 states (where the robot is)
4 actions

2. Markov Property

Given a sequence:
Generally, this can’t be simplified:
First-order Markov:
second-order Markov:
Simple Example ``` Consider the sentence : “Let’s do a simple example” Given:”let’s do a simple” Predict the new word? Easy

Given: “simple” Predict the next word? not as easy

Given: “a” predict the next word? very difficult

is the Markov Property limiting? Not necessarily ```

3. Markov Property in RL

{S(t), A(t)} produces 2 things -> {S(t+1),R(t+1)}
Markov Property:
Convenience notation
joint on s’ and r, conditioned on 2 other variables
different from “usual” Markov: 1 RV(Random Variable) Conditioned on 1 other RV

4. Other Conditional distributions

can be found using rules of probability
For pretty much all cases we’ll consider these will be deterministic
i.e. states will always give us the same reward
Action will always bring us to same next state
But, these distributions are part of the core theory of RL

5. Is the Markov Assumption limiting?

Not necessarily
Recent application: DeepMind used concatenation(一系列相关联的事物) of 4 most recent frames to represent state when playing Atari Games
State can be made up of anything from anytime past to current
Typically think of state right now = something we measure right now
Also, don’t need to use raw data(state can be features transformed from raw data)
Any input from agent’s sensors can be used to form state

6. Markov Decision Processes(MDPs)

Any RL task with a set of States, actions, and rewards, that follows the Markov Property, is a MDP
MDP is defined as the collection of
- set of states
- set of actions
- set of rewards
- State-Transition probability, Reward probability(as defined jointly(连带地) earlier)
- Discount factor
Often written as a 5 tuples

7. Policy

One more piece to complete the puzzle - the policy(denoted by π)
Technically π is not part of the MDP itself, but it, along with the value function, form the solution
Left out until now because it’s a weird symbol
there’s no “equation” for it
how do we write epsilon-greedy as an equation? it’s more like an algorithm
the only exception is the Optimal policy, which can be defined in terms of the value function
think of π as shorthand for the algorithm the agent is using to navigate the environment
8. State-Transition probability - p(s’|s,a)
State Diagram
p(s’ I s,a)
why is this stochastic? if i press “jump” button, doesn’t it always do the same thing?
Recall: state is only derived from what agent senses, it’s not the environment itself
State can be imperfect representation of environment
Ex) state could represent multiple configuration of environment
Ex) Blackjack - if we’re the agent, the dealer’s next card is not part of our state(but it is part of the environment)

9. Actions vs Environment

Typically we think of action like joystick inputs(up/down/left/right/jump) or Blackjack Moves(hit/stand)
Actions can be very board: how to distribute government funding
we are navigating an environment, we are the agent - what constitutes(구성하다) “us”?
Are our body? No
our body is part of the environment, our body doesn’t make decisions/learn
brain/mind does the learning

10. Total Reward & Future Reward

We are Interested in measuring total future reward
Everything from t+1 onward(继续的)
We call this the Return, G(t)
Note: does not count current reward R(t)

Future Reward

Imagine a very long task(thousands of steps)
is there a difference between getting a reward now, and getting the same reward 10 years from now?
think finance
$1000 today is worth less than $ 1000 10 years ago
Would you rather get $1000 today or $1000 10 years from now? choose today

11. Discount Factor

Gamma = 1: don’t care how far is the future reward is, weight all equally
Gamma = 0: truly greedy, only try to maximize immediate reward
Usually we choose something close to 1, i.e 0.9
Short episode task: maybe don’t discount at all
“the further we look into the future, the harder it is to predict”

12. Merging Continuous and episode tasks

why count up to infinity? Aren’t we doing episodic tasks?
yes, but math is easier with infinity. so can make them technically equivalent(相等的)

번외

(Epsilon Greedy는 주사위 한번 던져서 결정하고 그 action 다음번에 사용하는 것이 on Policy)
(Q-Learning은 다음 Step에서 실제로 사용할 action과 상관 없이 Max Q 취하기 때문에 off policy)
- 각 지점에서 계속 최적값을 찾으면서 Future Reward 计算
- 벨멘 방정식을 이용하여 반복적으로 Q함수를 근사시킬 수 있음
(Sarsa, 다음번에 게임에 넣어줄 action을 미리 계산해서 사용)
Reinforcement Learning Two Problem
- Credit Assignment Problem
- Exploration-exploitation
DQN Property
- Target Q function
  - 학습 대상이 되는 Q함수가 학습이 되면서 계속 바뀌는 문제
  - Learning Q-Fucntion from Gradient Descent
  - 일정 스텝 될 때마다 Q 함수의 가중치를 타켓 Q함수에 업데이트
- Replay Memory
  - 학습의 재료가 되는 Sample 저장소
  - 즉시 훈련하지 않고, 메로리 저장
  - 일정 수의 Samplr을 랜덤으로 꺼내 학습
- Q-function made by ANN
  - DQN은 인공신경망으로 Policy function을 Approximate
- Q function of DQN 특징
  1. Model Free :
    - 모델이 없고 샘플로 부터 직접적으로 정책을 근사화
    - 대부분 ANN을 활용한 훈련으로 Dimension Curse에 벗어난다.
  2. Off-Policy
    - Learn thing from another agent’s action
    - 타겟 정책과 행동 정책 나눈다.
    - Target policy : 우리가 강화학습 에이전트에게 가르치기 위한 기준이 되는 정책
    - 행동 Policy : 탐험을 하며 새로운 행동을 만들어 내는 정책 - 두가지 폴리시를 다루어야 함으로 구현하기 어려움
  3. MiniBatch
    - 많은 데이터 중 임의로 샘플을 뽑아 학습시키는 것
    - 연속적인 샘플들 간의 강한 상관관계를 제거
  4. Value-Based Reinforcement Learning
    - at first, Let Value Function Approximating to make a Policy
  5. Decaying Epsilon-Greedy
    - at first, Random Act -> random act to be reduced gradually -> when it became 1% of random action, it stop
    - Two Hyper parameter: (1) Exploration_fraction : 언제까지 감소 시킬 것인가? Default 0.5(Timesteps의 50%가 될 때까지 랜덤 액션 취할 확률 이 줄어든다) (2) Exploration_final_eps : 최종 입실론 값, 기본값 0.01(0.01이 되면 감소가 줄어들고, 값을 유지한다.)
    Reference:
    
    Artificial Intelligence Reinforcement Learning
    
    Advance AI : Deep-Reinforcement Learning
    
    Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

9. Markov Decision Processes

28 Sep 2019 | Reinforcement Learning

Markov Decision Processes

1. Typical Game by solving MDPs is GridWord

2. Markov Property

3. Markov Property in RL

4. Other Conditional distributions

5. Is the Markov Assumption limiting?

6. Markov Decision Processes(MDPs)

7. Policy

8. State-Transition probability - p(s’|s,a)

9. Actions vs Environment

10. Total Reward & Future Reward

Future Reward

11. Discount Factor

12. Merging Continuous and episode tasks

번외

Comments

for Robot Artificial Inteligence

9. Markov Decision Processes

28 Sep 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

Markov Decision Processes

1. Typical Game by solving MDPs is GridWord

2. Markov Property

3. Markov Property in RL

4. Other Conditional distributions

5. Is the Markov Assumption limiting?

6. Markov Decision Processes(MDPs)

7. Policy

8. State-Transition probability - p(s’|s,a)

9. Actions vs Environment

10. Total Reward & Future Reward

Future Reward

11. Discount Factor

12. Merging Continuous and episode tasks

번외

Comments

28 Sep 2019 | Reinforcement Learning