Nonstationary Bandits

there equations will show up again and again
what is stationary?
A stationary process is one whose statistics don’t change over rime(e.g mean)
weak-sense stationary: mean(1st order statistic) and autocovariance(2nd order statistic) don’t change overtime
- if don’t know covariance let’s check it in here
strong-sense stationary: entire PDF doesn’t change over time
SSS -> WSS
what if our bandits are not stationary? does it make sense to calculate the mean as we have been?
Mean Update Equation
from earlier:
some rearranging for convenience:
Replace 1/t:
Alpha can be anything(even constant). looks kind of like gradient descent
One adaptive learning rate used 1/t

Low pass filter

should look familiar:
Low-pass filter studied in ML related to RNN(i will post ML too in future)
Get rid of recurrence(复发):
Q has exponentially decaying dependence on X
More emphasis on recent value

Convergence Criteria for Q

Q will converge if :
think back to calculus 2(두번쨰)
constant alpha does not converge
1/t does
if problem is Nonstationary, we don’t want Convergence(融合)(equal weighting of all samples doesn’t make sense) so we will see constant alpha used
Does this converge?
No. Sum of alphas is not infinity
Does this convergence?
Nonstationary system은 모델 파라미터 Θ가 시간의 변화에 따라서 상수가 변한다. 광고를 예시를 들면, 광고에 대한 방문자의 선호도는 제품에 대한 소문, 잦은 광고, 노출에 따른 여러가지 이유로 변할 수 있기 때문이다. 이러한 환경의 MAB문제를 Non-stationary 문제라고 한다. 시간에 흐름에 따라 α+β값이 증가함으로 Θ에 대한 추정이 어려워진다.

# From the course: Bayesin Machine Learning in Python: A/B Testing
# https://deeplearningcourses.com/c/bayesian-machine-learning-in-python-ab-testing
# https://www.udemy.com/bayesian-machine-learning-in-python-ab-testing
from __future__ import print_function, division
from builtins import range
# Note: you may need to update your version of future
# sudo pip install -U future


import matplotlib.pyplot as plt
import numpy as np
from bayesian_bandit import Bandit


def run_experiment(p1, p2, p3, N):
  bandits = [Bandit(p1), Bandit(p2), Bandit(p3)]

  data = np.empty(N)

  for i in range(N):
    # thompson sampling
    j = np.argmax([b.sample() for b in bandits])
    x = bandits[j].pull()
    bandits[j].update(x)

    # for the plot
    data[i] = x
  cumulative_average_ctr = np.cumsum(data) / (np.arange(N) + 1)

  # plot moving average ctr
  plt.plot(cumulative_average_ctr)
  plt.plot(np.ones(N)*p1)
  plt.plot(np.ones(N)*p2)
  plt.plot(np.ones(N)*p3)
  plt.ylim((0,1))
  plt.xscale('log')
  plt.show()


run_experiment(0.2, 0.25, 0.3, 100000)

Reference:

Artificial Intelligence Reinforcement Learning

Advance AI : Deep-Reinforcement Learning

Cutting-Edge Deep-Reinforcement Learning

for Robot Artificial Inteligence

5. Nonstationary bandit

22 Sep 2019 | Reinforcement Learning

Nonstationary Bandits

Mean Update Equation

Low pass filter

Convergence Criteria for Q

Comments

for Robot Artificial Inteligence

5. Nonstationary bandit

22 Sep 2019 | Reinforcement Learning (adsbygoogle = window.adsbygoogle || []).push({});

Nonstationary Bandits

Mean Update Equation

Low pass filter

Convergence Criteria for Q

Comments

22 Sep 2019 | Reinforcement Learning