Multi Armed Bandit

First saw in: bayesian Machine Learning(Probability Algorithm): A/B Testing (Ex. go to a Casino; choice between 3 slot machines)
Each (for now) gives reward of 0 or 1
win rate is unknown(Ex. Could be 0.1 0.3 0.5)
Goal: Maximize our winnings
Problem: can only discover best bandit by collecting data
if we’re psychic, we could just play the bandit with highest win rate
Balance explore(collectiong data) + exploit(playing “best-so-far” machines)

1.Traditional A/B Testing

Predetermine # of times we need to play/collect data in order to establish statistical significance
. # of times needed is dependent on numerous things like the difference between win rates of each bandit(but if we know that, we wouldn’t be doing this test, weird traditional statistics ideas)
important rule: Don’t stop our test early. Even if we set N=1000, but after 500 trials, we discover 90% win-rate vs 10% win rate. we still may not stop
Much different from Human behavior
suppose we’re playing in a real casino, we get 2/3 on one bandit, 0/3 on the other.
we have a strong urge to play the first Bandit
we know “ as a data scientist” that 3 is a small sample size
as a gambler we do not “feel” that way
we look at ways to systematically make explore-exploit trade-off(explain upcoming next)
better than these 2 suboptimal solutions: A/B testing, human emotion

2. Applications of Explore-Exploit

Already discussed in bayesian Machine Learning
- if you don’t know let’s check it here
Lots of scary math! How does it apply in the “real world”?
the casino example seems contrived(预谋的)
The explore-exploit dilemma(and the subsequent Algorithm which are intended to “solve” it) are one of the most, if not the most “real world” concepts in Machine Learning
Almost every business and online entrepreneur can make use of them

3. Comparing things

Quantitatively comparing things applies to almost any business(E.g. we are samsung, selling note8)
we have 2 ads we really Like
which one is better? the one more likely to lead the user to purchase the phone(naturally)
Can measure the click-through rate(CTR) of each ad ( E.g. 10 Clicks/1000 impressions = 1 % CTR)
How to measure the CTR? Do an Experiment
Show the first ad to 1 million users and Show the second ad to 1 million users
Calculate the CTRs and compare
How do i know 1 million is the correct number?

4. The nature of Probability

We would need an infinite number of samples to get an absolutely precise estimate of the CTR
as we collect more samples, the confidence interval shrinks
therefore, the more samples we collect, the better
but, if one advertisement is better, that means the other is worse
if i show the suboptimal(未达最佳标准的) ad 1 million times, i’ve wasted 1million impressions to get a suboptimal CTR
my desire to show only the best advertisement(hence make more money) is fundamentally at odds with my desire to have an accurate CTR
- which is needed to know which advertisement is “best” in the first place

5. Who cares?

apple is not the only company selling things
almost every modern business uses online advertising
but we don’t have to be one, we can simply be someone who owns a website
- Compare viewing time on old design vs new design
- is “Buy Now” button better at the top or bottom of sales pages?
- which “price ending” leads to the most conversions? ￦2000 vs ￦1999
these are all the same problem numerically
Note: Click Through and conversion rate is the same thing to us, just a generic “success rate”

6. Gaussian vs Bernoulli

Bernoulli good for measuring success rates
Gaussian for Continuous variables

7. Epsilon-Greedy

Solution to the explore-exploit dilemma
Choose small number ε as Probability of exploration
Typical values: 5%, 10%

p = random()
if p < eps:
  pull random arm
else:
  pull current-best  arm

Eventually, we’ll discover which arm is the true best, since this allows us to update every arm’s estimate
in the long run, allows us to explore each arm an infinite number of times
Problem: we get to a point where we explore when we don’t need to
For epsilon = 10%, we’ll continue to spend 10% of the time doing suboptimal things
Could do A/B test at some Predetermined time, then set epsilon=0
But we’ll learn better ways to adapt

8. Estimating Bandit Rewards

Assuming bandit rewards are not just coin tosses.
The best way to keep track of reward is General Method “Mean”
Works for $ 1/0 $ too

\[P(k=1) = (#of1's)/(#of1's+#of0's) = (#of1's)/N\]

P is Probability
Need to keep Track of All X
Can we be more efficient? YES!
New mean can be calculated from old mean like that
this “form” is very important to Reinforcement Learning

9. Let’s Recap how do this in supervised learning(Naive Bayes, Decision Trees, Neural Networks)

class Mymodel:
  def fit(x,y):
    # our job
  def Predict(x):
    # our job

For example boilerplate

Xtrain, Ytrain, Xtest, Ytest  = get_data() # 1. Get data
model = Mymodel() # 2. Instantiate Model
model.fit(Xtrain,Ytrain)
model.socre(Xtest,Ytest)

10. Designing our bandit Program

Since this is not Supervised learning, there is still a pattern to be followed
Reminder: A “casino Machine” is just an analogy for real-life applications
- Serving advertisements you hope users will click on
- Testing different website design to see which keeps our users engaged longest

class MyAwesomeCasinoMachine:
  def pull():
    # simulates drawing from the true distribution
    # which we wouldnt know in "real life"
for t in range(max_iterations):
  # pick casino machine to play based on Algorithm
  # update Algorithm parameters
# plot useful info (ave reward, best machine, etc)

import numpy as np
import matplotlib.pyplot as plt

class bandit:
    """docstring for ."""

    def __init__(self, m):
        self.m = m
        self.mean = 0
        #the mean instance variable is our estimate of the bandit's mean
        self.N =0

    def pull(self):
        return np.random.randn() + self.m
    # the pull function which is simulates pulling the bandit's arm
    # every bandit's reward will be a gassium with unit variant


    def update(self, x):
        self.N +=1
        self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x
    #Finally, we have the update function which takes an X, which is the latest sample recieved from the bandit

def run_experiment(m1,m2,m3,eps,N):
    # we are going to compare three different bandits
    bandits = [bandit(m1), bandit(m2),bandit(m3)]
    # Three difference means

    data = np.empty(N)

    for i in range(N):
        # epsilon greedy
        p = np.random.random()
        if p < eps:
            j = np.random.choice(3)
            # we choose a bandit at random
        else:
            j = np.argmax([b.mean for b in bandits])
            # we choose the bandit with the best current sample mean
        x=bandits[j].pull()
        bandits[j].update(x)
        # for the plot
        data[i] = x
    cumulative_average = np.cumsum(data) / (np.arange(N)+1)
    # cumulative_ averager after avery play

    #plt.plot(cumulative_average)
    plt.plot(cumulative_average)
    plt.plot(np.ones(N)*m1)
    plt.plot(np.ones(N)*m2)
    plt.plot(np.ones(N)*m3)
    plt.xscale('log')
    plt.show()

    for b in bandits:
        print (b.mean)

    return cumulative_average

if __name__ == '__main__':
    c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)
    c_05 = run_experiment(1.0, 2.0, 3.0, 0.05, 100000)
    c_01 = run_experiment(1.0, 2.0, 3.0, 0.01, 100000)

    # log sacle plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(c_05, label='eps = 0.05')
    plt.plot(c_01, label='eps = 0.01')
    plt.legend()
    plt.xscale('log')
    plt.show()

    # linear plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(c_05, label='eps = 0.05')
    plt.plot(c_01, label='eps = 0.01')
    plt.legend()
    plt.show()    

11.Optimistic Initial Values

Another Simple way of solving the explore-exploit dilemma
Suppose we know the true mean of each bandit is « 10
pick a high ceiling as the initial mean estimate
before we set initial mean to “0”, but this time we set initial value to “10”
initial sample mean is “too good to be true “
All collected data will cause it to go down
if the true mean is 1, the sample mean will still approach 1 as I collect more samples
All means will eventually settle into their true values(exploitation)
Helps exploration as well
if we haven’t explored a bandit, its mean will remain high, causing us to explore it more
in the main loop: greedy strategy only
but using optimistic means
it make more looks like optimistic than “Mean”

Code

import numpy as np
import matplotlib.pyplot as plt
from Comparing_epsilons import run_experiment as run_experiment_eps

class bandit:
    """docstring for ."""

    def __init__(self, m, upper_limit = 10):
        self.m = m
        self.mean = upper_limit
        #the mean instance variable is our estimate of the bandit's mean
        self.N = 1

    def pull(self):
        return np.random.randn() + self.m
    # the pull function which is simulates pulling the bandit's arm
    # every bandit's reward will be a gassium with unit variant


    def update(self, x):
        self.N +=1
        self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x

    #Finally, we have the update function which takes an X, which is the latest sample recieved from the bandit

def run_experiment(m1,m2,m3,N):
    # we are going to compare three different bandits
    bandits = [bandit(m1), bandit(m2),bandit(m3)]
    # Three difference means

    data = np.empty(N)

    for i in range(N):
        j = np.argmax([b.mean for b in bandits])
        # we choose the bandit with the best current sample mean
        x=bandits[j].pull()
        bandits[j].update(x)
        # for the plot
        data[i] = x
    cumulative_average = np.cumsum(data) / (np.arange(N)+1)
    # cumulative_ averager after avery play

    #plt.plot(cumulative_average)
    plt.plot(cumulative_average)
    plt.plot(np.ones(N)*m1)
    plt.plot(np.ones(N)*m2)
    plt.plot(np.ones(N)*m3)
    plt.xscale('log')
    plt.show()

    for b in bandits:
        print (b.mean)

    return cumulative_average

if __name__ == '__main__':
    c_1 = run_experiment_eps(1.0, 2.0, 3.0, 0.1, 100000)
    oiv = run_experiment(1.0, 2.0, 3.0, 100000)

    # log scale plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(oiv, label='optimistic')
    plt.legend()
    plt.xscale('log')
    plt.show()

    # linear plot
    plt.plot(c_1, label='eps = 0.1')
    plt.plot(oiv, label='optimistic')
    plt.legend()
    plt.show()

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

3. Multi Armed Bandit

20 Sep 2019 | Reinforcement Learning

Multi Armed Bandit

1.Traditional A/B Testing

2. Applications of Explore-Exploit

3. Comparing things

4. The nature of Probability

5. Who cares?

6. Gaussian vs Bernoulli

7. Epsilon-Greedy

8. Estimating Bandit Rewards

9. Let’s Recap how do this in supervised learning(Naive Bayes, Decision Trees, Neural Networks)

10. Designing our bandit Program

11.Optimistic Initial Values

Comments

for Robot Artificial Inteligence

3. Multi Armed Bandit

20 Sep 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

Multi Armed Bandit

1.Traditional A/B Testing

2. Applications of Explore-Exploit

3. Comparing things

4. The nature of Probability

5. Who cares?

6. Gaussian vs Bernoulli

7. Epsilon-Greedy

8. Estimating Bandit Rewards

9. Let’s Recap how do this in supervised learning(Naive Bayes, Decision Trees, Neural Networks)

10. Designing our bandit Program

11.Optimistic Initial Values

Comments

20 Sep 2019 | Reinforcement Learning