[RL Series] Multi-Armed Bandits

Introduction

Data is an important part of machine learning. In fact, the very definition of machine learning is: “algorithms that help computers learn how to perform tasks from data.” In modern machine learning algorithms, there are two main ways of collecting data. The first is to obtain labeled data (for example, identifying the species of an animal in an image), or unlabeled data generated by humans (such as text or images on the Internet). This approach has the advantage that the collected data often contains rich information that is useful for learning, since it has already been filtered and curated by humans. However, in some cases, collecting data in this way is very difficult. For example, it is hard to explicitly teach a robot all the detailed steps required to cook an arbitrary dish in an arbitrary kitchen. In such cases, another approach is to let the machine collect data by itself. This type of data may be noisy or imperfect—for instance, a robot might spill dishes or burn a fried egg—but the system can learn from its own mistakes and gradually improve over time. During this process, the machine interacts with the environment to gain “experience.” Algorithms that learn through such interaction are called reinforcement learning algorithms.

Multi-Armed Bandit Problem

The multi-arm bandit problem is an example of a setting where the agent must learn through trial and error. This problem is inspired by the mechanism of a slot machine (one-armed bandit) (Wiki): a player inserts a bet and pulls a lever, and then a random outcome is generated that determines the reward received from that play.

In the multi-armed bandit problem, the agent has $k$ choices (options/actions) corresponding to $k$ different machines (arms). Each choice leads to a reward from the selected machine. Note that:

The reward at each play is stochastic and depends on the chosen machine.
Different machines may have different reward distributions.
The reward distribution of each machine remains fixed over time (stationary) or may change over time (non-stationary).

Assume we have a limit of $N$ plays, and the goal is to maximize the total reward. The problem would be easy if we knew the expected reward (value of action) of all machines: we would simply choose the machine with the highest expected reward. However, we can only estimate these values by trying the arms. Trying different arms to estimate their values is called exploration, and if at a given step we instead choose the arm with the highest estimated value, this is called exploitation. Since the number of plays is limited, the main challenge is how to balance exploration and exploitation. If we do not explore enough, we may end up selecting a suboptimal machine. On the other hand, if we explore too much, we miss the opportunity to exploit the best-known machine. This trade-off also appears in real life, for example when deciding whether to try a new restaurant or continue eating at a familiar one. Next, we will study methods to address this challenge.

Estimating the value of each machine

Above, we discussed estimating the value of actions. Now we go into more detail. One simple estimation method is to take the average of all rewards received from machine $a$ up to the current time. More precisely, let $t$ be the number of plays so far, $N_t(a)$ be the number of times machine $a$ has been selected after $t$ plays, and $R_1, \dots, R_{N_t(a)}$ be the rewards obtained from those plays of machine $a$. Then we define: $\begin{equation} Q_t(a) = \frac{R_1 + R_2 + \dots + R_{N_t(a)}}{N_t(a)} \end{equation}$ Here, $Q_t(a)$ is the estimated expected value of machine $a$ after $t$ plays. If machine $a$ has never been tried before, i.e., $N_t(a) = 0$, we can assign $Q_t(a)$ a default value (for example, 0). As we try an arm more often and $N_t(a)$ increases, we become more confident that the estimated value approaches the true expected value.

Selecting actions from estimates – Method 1: Greedy

After estimating the reward of each machine, the next step is to make a decision, i.e., choose which machine to pull next. A simple rule is to select the machine with the highest estimated value:

\[\begin{equation} A_t = \arg\max_a Q_t(a) \end{equation}\]

This method is called greedy, because it always aims to maximize immediate reward. However, this can lead to a situation where we keep choosing a machine that initially had a high estimated value, even though this estimate may not be accurate, while the truly optimal machine is underestimated.

Selecting actions from estimates – Method 2: $\epsilon$-greedy

Another approach is to act greedily most of the time, but with probability $\epsilon$, we explore by selecting a random machine. This ensures that over time, every machine is sampled more evenly, and the value estimates become more accurate.

Example: 10-armed testbed

Now let us compare these methods in a specific scenario: suppose we have 10 machines, denoted by $a \in {1,2,\dots,10}$, each with a true expected reward $q(a)$. At time step $t$, if machine $A_t$ is selected, the reward $R_t$ is generated based on $q(A_t)$ plus some random noise. We assume this noise follows a normal distribution with mean 0 and variance 1. For analysis convenience, at initialization we also assign each $q(a)$ randomly from a normal distribution with mean 0 and variance 1. However, $q(a)$ remains fixed throughout the process.

An illustration of the example. We have 10 machines (or 10 arms), where each machine has a true expected reward $q(a)$, shown as a red dashed line. The blue curve represents the reward distribution after adding random noise. Each black dot corresponds to the actual reward received each time the corresponding machine is pulled. We observe that these sampled rewards can sometimes deviate significantly from the true expectation $q(a)$, which motivates the need to try each arm multiple times in order to obtain more accurate estimates.

An illustration of this setup is shown in Figure 10armtestbed, where machine 7 is the best arm with $q(7)=1.58$. In Figure fig-1sample_per_arm, we show the rewards obtained after pulling each arm once (10 total plays). Based only on this small sample, we might incorrectly conclude that machine 4 is the best and underestimate machine 7. In this case, the greedy algorithm may continue exploiting machine 4 and ignore machine 7, leading to a suboptimal total reward. So the question is: does the $\epsilon$-greedy method suffer from this same issue?

Illustration of the rewards obtained after pulling each arm once.

Exercise

Assume we have three methods: greedy, 0.1-greedy, and 0.01-greedy applied to the example in Figure \ref{fig:10armtestbed_viz}. After a very large number of plays, which method will achieve a higher total reward and select the optimal machine more often? Answer this by giving a quantitative analysis.

Solution

When the number of plays becomes very large, i.e., $t \to \infty$, the $\epsilon$-greedy algorithms ensure that every arm is selected infinitely often, meaning:

\[N_t(a) \to \infty\]

As a result, the value estimates become increasingly accurate:

\[Q_t(a) \to q(a)\]

This implies that for sufficiently large $t$, the best arm according to the estimated values converges to the true optimal arm.

However, when comparing 0.1-greedy and 0.01-greedy, the 0.1-greedy method explores more (10% of the time vs. 1%). This allows it to discover the optimal arm faster, but it also introduces more long-term exploration cost when $t$ is very large.

From this perspective, when $t$ is large:

0.01-greedy exploits more often and therefore selects the optimal arm with probability approximately 99.1%
0.1-greedy selects the optimal arm with probability approximately 91%

Thus, for very large $t$, 0.01-greedy achieves higher total reward and selects the optimal machine more frequently than 0.1-greedy.

Vietnamese version.

Giới thiệu

Dữ liệu là một phần quan trọng trong học máy, bản chất định nghĩa của học máy cũng là: “các thuật toán giúp máy tính học cách thực hiện công việc \textit{từ dữ liệu}”. Trong các thuật toán học máy hiện nay có 2 cách thu thập dữ liệu chủ yếu. Một là lấy dữ liệu từ quá trình gán nhãn (ví dụ như chỉ ra tên loài vật có trong ảnh), hoặc dữ liệu không nhãn được sinh ra bởi con người (ví dụ như văn bản hay hình ảnh trên Internet). Cách này có điểm lợi là các dữ liệu thu thập được thường chứa nhiều thông tin giúp ích cho việc học của máy, do đã qua quá trình chắt lọc bởi con người. Nhưng trong một số trường hợp, việc thu thập dữ liệu như vậy rất khó khăn. Ví dụ như rất khó để chỉ cho robot chi tiết biết tất cả các hành động để nấu một món ăn bất kì, trong một căn bếp bất kì. Trong những trường hợp như vậy, có một cách khác là để máy tính tự thu thập dữ liệu. Dữ liệu như vậy có thể có lỗi, ví dụ như robot có thể làm đổ bát đĩa hoặc làm cháy quả trứng ốp la, nhưng ta có thể để máy học từ lỗi sai của bản thân và cải thiện dần qua thời gian. Trong quá trình này, máy tương tác với môi trường để lấy “kinh nghiệm”. Và các thuật toán học thông qua tương tác được gọi là thuật toán học tăng cường (reinforcement learning).

Bài toán “Máy đánh bạc nhiều tay” (multi-arm bandits) là một ví dụ cho trường hợp mà máy phải học bằng cách thử và sai như vậy. Bài toán này dựa trên cơ chế của máy đánh bạc dạng kéo cần (Wiki): người chơi chỉ cần đưa vào một khoản tiền cược và kéo cần gạt, sau đó một tổ hợp ngẫu nhiên được được xoay ra ám chỉ khoản tiền mà người chơi nhận được từ lần chơi đó. Trong bài toán máy đánh bạc nhiều tay, người tham gia có $k$ lựa chọn (options/actions) ứng với $k$ máy khác nhau. Mỗi lựa chọn sẽ sẽ dẫn đến một phần thưởng (reward) từ máy tương ứng. Lưu ý rằng

Giá trị phần thưởng ở mỗi lần chơi là ngẫu nhiên và phụ thuộc vào máy được chọn.
Các máy khác nhau có thể có phân phối phần thưởng khác nhau.
Phân phối phần thưởng ở mỗi máy được giữ nguyên qua các lần chơi (stationary). Giả sử ta có giới hạn là $N$ lần chơi, mục tiêu là làm thế nào để có thể tối đa hoá phần thưởng sau cùng.

Câu hỏi trên sẽ trở nên dễ dàng nếu ta biết giá trị phần thưởng kì vọng của tất cả các máy (value of action): đơn giản là chọn máy có kì vọng phần thưởng cao nhất. Thay vào đó, ta chỉ có thể ước lượng các giá trị này thông qua việc thử các cần gạt. Việc thử kéo các cần để ước lượng như vậy được gọi là khám phá (exploration) và, nếu ở một bước nào đó thay vì khám phá, ta chọn cần gạt ứng với giá trị ước lượng cao nhất hiện tại, thì đó được gọi là khai thác (exploitation). Do chỉ có số lượng lần chơi hữu hạn, điều khó khăn là làm thế nào để cân bằng giữa việc khám phá và khai thác. Nếu không khám phá đủ nhiều, sẽ có khả năng cao máy ta chọn chưa phải là tốt nhất. Còn nếu khám phá quá nhiều, ta sẽ lỡ cơ hội để khai thác. Điều này cũng diễn ra nhiều trong thực tế, ví dụ như khi ta phải lựa chọn giữa việc đi ăn thử ở một quán mới hay tiếp tục ăn ở quán ta đã quen rồi. Tiếp theo ta sẽ tìm hiểu phương pháp giúp giải quyết khó khăn này.

Ước lượng giá trị của mỗi máy: Ở trên ta có nói về việc ước lượng giá trị của hành động, bây giờ ta sẽ bàn cụ thể hơn. Một phép ước lượng ta có thể làm là lấy trung bình của phần thưởng thu được từ máy $a$ cho đến thời điểm hiện tại. Để cụ thể hơn, ta gọi $t$ là số lần đã chơi, $N_t(a)$ là số lần chọn máy $a$ sau $t$ lần chơi, và $R_1,\dots,R_{N(a)}$ là phần thưởng (reward) ở mỗi lần thử máy $a$ như vậy. Từ đó, ta có công thức như sau:

\[\begin{equation} Q_t(a) = \frac{R_1 + R_2 + \dots + R_{N_t(a)}}{N_{t}(a)} \end{equation}\]

ở đây, $Q_t(a)$ là giá trị kì vọng ta ước lượng được cho máy $a$ sau $t$ lần chơi. Nếu chưa từng thử $a$ hay $N_t(a) = 0$, ta có thể đặt $Q_t(a)$ bằng một giá trị mặc định nào đó (ví dụ như bằng 0). Như vậy khi thử càng nhiều và $N_t(a)$ tăng lên, ta sẽ chắc chắn hơn rằng giá trị kì vọng ước lượng gần giá trị kì vọng thực tế.

Lựa chọn cần gạt từ ước lượng - Cách 1: Tham lam.

Sau khi có ước lượng cho phần thưởng từ mỗi máy, điều tiếp theo là đưa ra \textit{quyết định (decision)} hay lựa chọn máy để kéo ở lần tiếp theo. Một nguyên tắc đơn giản có thể làm theo là lựa chọn máy có giá trị ước lượng cao nhất, hay $\begin{equation} A_t = {\arg\max}_{a} Q_t(a) \end{equation}$ Lựa chọn theo cách này được gọi là \textit{tham lam (greedy)}. Do mục đích là tối ưu cái lợi (phần thưởng) trước mắt. Điều này dẫn đến khả năng ta luôn chọn một máy có giá trị ước lượng cao lúc đầu nhưng ước lượng này có thể chưa đủ tốt và máy tối ưu thực tế đang bị đánh giá thấp.

Lựa chọn cần gạt từ ước lượng - Cách 2: Hơi tham lam ($\epsilon$-greedy).

Một nguyên tắc khác là ta chỉ tham lam trong hầu hết thời gian và để dành một số lần (với xác xuất $\epsilon$) để khám phá ngẫu nhiên một máy. Điều này đảm bảo rằng khi chơi lâu hơn, số lần chơi ở mỗi máy sẽ đều được tăng lên và giá trị ước lượng sẽ ngày càng chính xác hơn.

Ví dụ với 10 máy đánh bạc (10-armed testbed).

Bây giờ ta hãy thử so sánh các phương pháp trên trong một tình huống cụ thể: Giả sử ta có 10 máy đánh cược, gọi là $a$ với $a={1,2,\dots,10}$, mỗi máy sẽ có kỳ vọng phần thưởng thực tế là $q(a)$. Ở lần chơi thứ $t$, giả sử máy $A_t$ được chọn, phần thưởng $R_t$ được trả về sẽ dựa trên $q(A_t)$ và được cộng thêm (hoặc bớt đi) một khoản ngẫu nhiên. Ta giả sử khoản thêm bớt này tuân theo phân phối chuẩn với trung bình là 0 và phương sai là 1. Để tiện phân tích, ở thời điểm khởi tạo, ta cũng sẽ đặt $q(a)$ ngẫu nhiên từ phân phối chuẩn với trung bình là 0 và phương sai là 1. Nhưng lưu ý rằng trong quá trình chơi, $q(a)$ sẽ được giữ nguyên như thời điểm khởi tạo. Một minh hoạ cho ví dụ này được cho trong hình \ref{fig:10armtestbed_viz}, nhận thấy rằng máy số 7 là máy có kì vọng thực tế tốt nhất với $q(7)=1.58$. Trong hình \ref{fig:1sample_per_arm}, ta có ví dụ về phần thưởng thu được nếu kéo mỗi cần 1 lần (tổng cộng 10 lần chơi). Nếu chỉ dựa vào đây ta có thể lầm tưởng rằng máy số 4 là máy tốt nhất và máy số 7 bị đánh thấp. Trong trường hợp này thuật toán tham lam ở trên có thể tiếp tục khai thác máy số 4 và bỏ qua máy số 7, dẫn đến tổng phần thưởng cuối cùng không được tối ưu. Vậy còn thuật toán hơi tham lam có bị vấn đề này không?

\begin{figure} \centering \includegraphics[width=1.0\linewidth]{10armtestbed_viz.png} \caption{Minh hoạ cho ví dụ được cho. Ta có 10 máy (hay 10 cần gạt (arms)), mỗi máy có 1 \textcolor{red}{kì vọng phần thưởng $q(a)$} được biểu diễn bằng đường thẳng đứt đoạn màu đỏ. Đường cong màu xanh biểu diễn \textcolor{blue}{phân phối của phần thưởng} nhận được sau khi thêm bớt một khoản ngẫu nhiên. Mỗi điểm màu đen thể hiện phần thưởng thực tế nhận được sau mỗi lần kéo cần gặt máy tương ứng. Nhận thấy rằng các phần thưởng thực tế này đôi khi ở xa kì vọng $q(a)$ nên ta cần thử nhiều lần để có thể ước lượng chính xác hơn.} \label{fig:10armtestbed_viz} \end{figure}

\begin{figure} \centering \includegraphics[width=1\linewidth]{1sample_per_arm.png} \caption{Minh hoạ cho phần thưởng thu được nếu kéo mỗi cần 1 lần.} \label{fig:1sample_per_arm} \end{figure}

Bài tập. Giả sử ta có 3 phương pháp: tham lam (greedy), 0.1-tham lam ($0.1-$greedy), và 0.01-tham lam ($0.01-$greedy) được áp dụng cho ví dụ trong Hình \ref{fig:10armtestbed_viz}, sau một số lượng rất lớn lần chơi, phương pháp nào sẽ cho ta tổng phần thưởng lớn hơn và số lần chọn máy tốt nhất nhiều hơn? Hãy trả lời bằng cách phân tích định lượng.

Lời giải: Khi ta tăng số lần chơi lên rất lớn hay vô cùng $t\to \infty$. Các thuật toán hơi tham lam đảm bảo rằng số lần kéo cần của mỗi máy cũng sẽ tăng lên $N_t(a) \to \infty$ và từ đó ước lượng của ta sẽ ngày càng chính xác hơn $Q_t(a) \to q(a)$. Điều giúp các thuật toán hơi tham lam đảm bảo rằng với $t$ đủ lớn, máy tốt nhất theo ước lượng cũng chính là máy tốt nhất thực tế. Tuy nhiên, so sánh giữa $0.1-$greedy và $0.01-$greedy, phương pháp $0.1-$greedy khám phá nhiều hơn (10\% so với (1\%)) điều này giúp $0.1-$greedy tìm ra máy tối ưu nhanh hơn nhưng sẽ là lãng phí khi $t$ rất lớn. Từ đây ta thấy rằng khi $t$ lớn, $0.01-$greedy sẽ chọn máy tối ưu với xác xuất $99.1\%$ trong khi $0.1-$greedy chọn máy tối ưu với xác xuất $91\%$.