for Robot Artificial Inteligence

1. Reinforcement Learning Introduction

|

What is Reinforcement Learning?

There are 3 big part in AI

  • Supervised Learning (i.e. Spam Detection, Image Classification)
  • Unsupervised Learning(i.e Topic Modeling Web Pages, Clustering Genetic Sequences)
  • Reinforcement Learning(i.e Tic-Tac-Toe,Go,Chess, Walking, Super mario, Doom, Starcraft)

1. Supervised/Unsupervised Interfaces

  • as we know about ML Theory and Function
    class SupervisedModel:
    def fit(X,y):
    def predict(x):
    
    class UnsupervisedModel:
    def fit(x):
    def transform(x): #(e.g cluster assignment)
    
  • common theme is “training data”
  • input data: X(N x D Matrix)
  • Tragets: Y (N x 1 vector)
  • “all data is the same”
  • Format doesn’t change whether you’re in biology, finance, economics, etc.
  • fits neatly into one library: scikit-Learn
  • “simplistic”, but still useful:
    • Face Detection
    • Speech Recognition

2. Reinforcement Learning

  • Not just a static table of data
  • An agent interacts with the world(environment)
  • Can be simulated or real(E.g Vacuum robot)
  • Data comes from sensors
    • Cameras, Microphones, GPS, accelerometer
  • Continuous stream of data
    • Consider past and future
  • RL agent is “thing” with a lifetime
    • At each step, decide what to do
  • An (un)supervised model is just a static object -> input -> output

3. isn’t it still supervised learning?

  • X can represent the state i’m in, Y represent the Target (ideal action to perform in that state)
    • state = sensor recording from self-driving car
    • state = Video Game ScreenShot
    • state = Chess Board Positions
  • Yes it is, but for example, consider GO: $ N = 8 * 10^100 $
  • ImageNet, the image Classification benchmark, has $ N=10^6 $ images
    • Go is 94 orders of magnitude larger
    • Takes ~1 day with good hardware
  • 1 order of magnitude Larger -> 10 days
  • 2 order of magnitude Larger -> 100 days

4. Rewards

  • sometimes you’ll see reference to psychology; RL has been used to model animal behavior
  • RL agent’s goal is in the future
    • in contrast, a supervised model simply tried to get good accuracy / minimized cost on current input
  • Feedback Signals(Rewards) come from the environment (i.e the agent experiences them)

5. Rewards vs Targets

  • you might think of supervised Targets/labels as something like rewards. but these handmade labels are coded by humans - they do not come from environment
  • Supervised inputs/targets are just database tables
  • Supervised models instantly know if it is wrong/right, because inputs + targets are provided simultaneously
  • RL is dynamic - if an agent solves a maze, it only know its decisions were correct if it eventually solves the maze

6. On Unusual or Unexpected Strategies of RL

  • Goal of AlphaGO is to win Go, and the goal of a video game agent is high score/live as long as possible
  • what is the goal of an animal/human?
  • Evolutionary psychologists believe in the “selfish Gene” Theory
    • Rechard Dawkins - The Selfish Gene
  • Genes simply want to make more of themselves
  • We humans(conscious living beings) are totally unaware of this
  • we can’t ask our genes how they feel
  • we are simply a vessel for our genes’ proliferation(급증)
  • is consciousness just an illusion?
  • Disconnect between what we think we want vs “true goal”

  • Like Alphago, we’ve found roundabout and unlikely ways of achieving our Goal
  • The action taken doesn’t necessarily have to have an obvious / explicit relationship to the Goal
  • we might desire riches/money -but why? Maybe natural selection or leads to better health and social status. there are no laws physics which govern riches and gene replication
  • it’s a novel solution to the problem
  • AI can also find such strange or unusual ways to achieve a goal
  • we can replace “getting rich” with any trait(특성) we want
    • being healthy and strong
    • Having strong analytical skills
  • That’s a sociologist(사회학자)’s job
  • Our interest lies in the fact that there are multiple novel strangies of achieving the same goal(gene replication)
  • What is considered a good strategy can fluctuate
  • Ex. Sugar:
    • our brain runs on sugar, it gives us energy
    • today, it causes disease and death
  • Thus, a Strategy that seems good right now may not be globally optimal

7. Speed of Learning and Adaption

Comments