for Robot Artificial Inteligence

3. Multi Armed Bandit

|

Multi Armed Bandit

  • First saw in: bayesian Machine Learning(Probability Algorithm): A/B Testing (Ex. go to a Casino; choice between 3 slot machines)
  • Each (for now) gives reward of 0 or 1
  • win rate is unknown(Ex. Could be 0.1 0.3 0.5)
  • Goal: Maximize our winnings
  • Problem: can only discover best bandit by collecting data
  • if we’re psychic, we could just play the bandit with highest win rate
  • Balance explore(collectiong data) + exploit(playing “best-so-far” machines)

1.Traditional A/B Testing

  • Predetermine # of times we need to play/collect data in order to establish statistical significance
  • . # of times needed is dependent on numerous things like the difference between win rates of each bandit(but if we know that, we wouldn’t be doing this test, weird traditional statistics ideas)
  • important rule: Don’t stop our test early. Even if we set N=1000, but after 500 trials, we discover 90% win-rate vs 10% win rate. we still may not stop
  • Much different from Human behavior
  • suppose we’re playing in a real casino, we get 2/3 on one bandit, 0/3 on the other.
  • we have a strong urge to play the first Bandit
  • we know “ as a data scientist” that 3 is a small sample size
  • as a gambler we do not “feel” that way
  • we look at ways to systematically make explore-exploit trade-off(explain upcoming next)
  • better than these 2 suboptimal solutions: A/B testing, human emotion

2. Applications of Explore-Exploit

  • Already discussed in bayesian Machine Learning
    • if you don’t know let’s check it here
  • Lots of scary math! How does it apply in the “real world”?
  • the casino example seems contrived(预谋的)
  • The explore-exploit dilemma(and the subsequent Algorithm which are intended to “solve” it) are one of the most, if not the most “real world” concepts in Machine Learning
  • Almost every business and online entrepreneur can make use of them

3. Comparing things

  • Quantitatively comparing things applies to almost any business(E.g. we are samsung, selling note8)
  • we have 2 ads we really Like
  • which one is better? the one more likely to lead the user to purchase the phone(naturally)
  • Can measure the click-through rate(CTR) of each ad ( E.g. 10 Clicks/1000 impressions = 1 % CTR)
  • How to measure the CTR? Do an Experiment
  • Show the first ad to 1 million users and Show the second ad to 1 million users
  • Calculate the CTRs and compare
  • How do i know 1 million is the correct number?

4. The nature of Probability

  • We would need an infinite number of samples to get an absolutely precise estimate of the CTR
  • as we collect more samples, the confidence interval shrinks
  • therefore, the more samples we collect, the better
  • but, if one advertisement is better, that means the other is worse
  • if i show the suboptimal(未达最佳标准的) ad 1 million times, i’ve wasted 1million impressions to get a suboptimal CTR
  • my desire to show only the best advertisement(hence make more money) is fundamentally at odds with my desire to have an accurate CTR
    • which is needed to know which advertisement is “best” in the first place

5. Who cares?

  • apple is not the only company selling things
  • almost every modern business uses online advertising
  • but we don’t have to be one, we can simply be someone who owns a website
    • Compare viewing time on old design vs new design
    • is “Buy Now” button better at the top or bottom of sales pages?
    • which “price ending” leads to the most conversions? ₩2000 vs ₩1999
  • these are all the same problem numerically
  • Note: Click Through and conversion rate is the same thing to us, just a generic “success rate”

6. Gaussian vs Bernoulli

  • Bernoulli good for measuring success rates
  • Gaussian for Continuous variables

7. Epsilon-Greedy

  • Solution to the explore-exploit dilemma
  • Choose small number ε as Probability of exploration
  • Typical values: 5%, 10%
p = random()
if p < eps:
  pull random arm
else:
  pull current-best  arm
  • Eventually, we’ll discover which arm is the true best, since this allows us to update every arm’s estimate
  • in the long run, allows us to explore each arm an infinite number of times
  • Problem: we get to a point where we explore when we don’t need to
  • For epsilon = 10%, we’ll continue to spend 10% of the time doing suboptimal things
  • Could do A/B test at some Predetermined time, then set epsilon=0
  • But we’ll learn better ways to adapt

8. Estimating Bandit Rewards

  • Assuming bandit rewards are not just coin tosses.
  • The best way to keep track of reward is General Method “Mean”
  • Works for $ 1/0 $ too
\[P(k=1) = (#of1's)/(#of1's+#of0's) = (#of1's)/N\]
  • P is Probability
  • Need to keep Track of All X
  • Can we be more efficient? YES!
  • New mean can be calculated from old mean like that
  • this “form” is very important to Reinforcement Learning

9. Let’s Recap how do this in supervised learning(Naive Bayes, Decision Trees, Neural Networks)

class Mymodel:
  def fit(x,y):
    # our job
  def Predict(x):
    # our job
  • For example boilerplate
Xtrain, Ytrain, Xtest, Ytest  = get_data() # 1. Get data
model = Mymodel() # 2. Instantiate Model
model.fit(Xtrain,Ytrain)
model.socre(Xtest,Ytest)

10. Designing our bandit Program

  • Since this is not Supervised learning, there is still a pattern to be followed
  • Reminder: A “casino Machine” is just an analogy for real-life applications
    • Serving advertisements you hope users will click on
    • Testing different website design to see which keeps our users engaged longest
class MyAwesomeCasinoMachine:
  def pull():
    # simulates drawing from the true distribution
    # which we wouldnt know in "real life"
for t in range(max_iterations):
  # pick casino machine to play based on Algorithm
  # update Algorithm parameters
# plot useful info (ave reward, best machine, etc)
import numpy as np
import matplotlib.pyplot as plt

class bandit:
    """docstring for ."""

    def __init__(self, m):
        self.m = m
        self.mean = 0
        #the mean instance variable is our estimate of the bandit's mean
        self.N =0

    def pull(self):
        return np.random.randn() + self.m
    # the pull function which is simulates pulling the bandit's arm
    # every bandit's reward will be a gassium with unit variant


    def update(self, x):
        self.N +=1
        self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x
    #Finally, we have the update function which takes an X, which is the latest sample recieved from the bandit
def run_experiment(m1,m2,m3,eps,N):
    # we are going to compare three different bandits
    bandits = [bandit(m1), bandit(m2),bandit(m3)]
    # Three difference means

    data = np.empty(N)

    for i in range(N):
        # epsilon greedy
        p = np.random.random()
        if p < eps:
            j = np.random.choice(3)
            # we choose a bandit at random
        else:
            j = np.argmax([b.mean for b in bandits])
            # we choose the bandit with the best current sample mean
        x=bandits[j].pull()
        bandits[j].update(x)
        # for the plot
        data[i] = x
    cumulative_average = np.cumsum(data) / (np.arange(N)+1)
    # cumulative_ averager after avery play

    #plt.plot(cumulative_average)
    plt.plot(cumulative_average)
    plt.plot(np.ones(N)*m1)
    plt.plot(np.ones(N)*m2)
    plt.plot(np.ones(N)*m3)
    plt.xscale('log')
    plt.show()

    for b in bandits:
        print (b.mean)

    return cumulative_average

if __name__ == '__main__':
    c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)
    c_05 = run_experiment(1.0, 2.0, 3.0, 0.05, 100000)
    c_01 = run_experiment(1.0, 2.0, 3.0, 0.01, 100000)

    # log sacle plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(c_05, label='eps = 0.05')
    plt.plot(c_01, label='eps = 0.01')
    plt.legend()
    plt.xscale('log')
    plt.show()

    # linear plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(c_05, label='eps = 0.05')
    plt.plot(c_01, label='eps = 0.01')
    plt.legend()
    plt.show()    

11.Optimistic Initial Values

  • Another Simple way of solving the explore-exploit dilemma
  • Suppose we know the true mean of each bandit is « 10
  • pick a high ceiling as the initial mean estimate
  • before we set initial mean to “0”, but this time we set initial value to “10”
  • initial sample mean is “too good to be true “
  • All collected data will cause it to go down
  • if the true mean is 1, the sample mean will still approach 1 as I collect more samples
  • All means will eventually settle into their true values(exploitation)
  • Helps exploration as well
  • if we haven’t explored a bandit, its mean will remain high, causing us to explore it more
  • in the main loop: greedy strategy only
  • but using optimistic means
  • it make more looks like optimistic than “Mean”

Code

import numpy as np
import matplotlib.pyplot as plt
from Comparing_epsilons import run_experiment as run_experiment_eps

class bandit:
    """docstring for ."""

    def __init__(self, m, upper_limit = 10):
        self.m = m
        self.mean = upper_limit
        #the mean instance variable is our estimate of the bandit's mean
        self.N = 1

    def pull(self):
        return np.random.randn() + self.m
    # the pull function which is simulates pulling the bandit's arm
    # every bandit's reward will be a gassium with unit variant


    def update(self, x):
        self.N +=1
        self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x

    #Finally, we have the update function which takes an X, which is the latest sample recieved from the bandit
def run_experiment(m1,m2,m3,N):
    # we are going to compare three different bandits
    bandits = [bandit(m1), bandit(m2),bandit(m3)]
    # Three difference means

    data = np.empty(N)

    for i in range(N):
        j = np.argmax([b.mean for b in bandits])
        # we choose the bandit with the best current sample mean
        x=bandits[j].pull()
        bandits[j].update(x)
        # for the plot
        data[i] = x
    cumulative_average = np.cumsum(data) / (np.arange(N)+1)
    # cumulative_ averager after avery play

    #plt.plot(cumulative_average)
    plt.plot(cumulative_average)
    plt.plot(np.ones(N)*m1)
    plt.plot(np.ones(N)*m2)
    plt.plot(np.ones(N)*m3)
    plt.xscale('log')
    plt.show()

    for b in bandits:
        print (b.mean)

    return cumulative_average
if __name__ == '__main__':
    c_1 = run_experiment_eps(1.0, 2.0, 3.0, 0.1, 100000)
    oiv = run_experiment(1.0, 2.0, 3.0, 100000)

    # log scale plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(oiv, label='optimistic')
    plt.legend()
    plt.xscale('log')
    plt.show()

    # linear plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(oiv, label='optimistic')
    plt.legend()
    plt.show()

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

Comments