Wednesday, July 30, 2025

Easy Information to Multi-Armed Bandits: A Key Idea Earlier than Reinforcement Studying


make sensible decisions when it begins out figuring out nothing and may solely be taught by trial and error?

That is precisely what one of many easiest however most vital fashions in reinforcement studying is all about:

A multi-armed bandit is an easy mannequin for studying by trial and error.

Similar to we do.

We’ll discover why the choice between attempting one thing new (exploration) and sticking to what works (exploitation) is trickier than it appears. And what this has to do with AI, on-line advertisements and A/B testing.

Visualization by ChatGPT 4o.

Why is it vital to grasp this idea?

The multi-armed bandit introduces one of many core dilemmas of reinforcement studying: The best way to make good choices underneath uncertainty.

It isn’t solely related for AI, knowledge science and behavioral fashions, but additionally as a result of it displays how we people be taught by trial and error.

What machines be taught by trial and error is just not so completely different from what we people do intuitively.

The distinction?

Machines do it in a mathematically optimized manner.

Let’s think about a easy instance:

We’re standing in entrance of a slot machine. This machine has 10 arms and every of those arms has an unknown probability of profitable.

Some levers give greater rewards, others decrease ones.

We will pull the levers as typically as we like, however our purpose is to win as a lot as attainable.

Which means now we have to search out out which arm is one of the best (= yields probably the most revenue) with out figuring out from the beginning which one it’s.

The mannequin may be very harking back to what we frequently expertise in on a regular basis life:

We check out completely different methods. In some unspecified time in the future, we use the one which brings us probably the most pleasure, enjoyment, cash, and many others. No matter it’s that we’re aiming for.

In behavioral psychology, we converse of trial-and-error studying.

Or we will additionally consider reward studying in cognitive psychology: Animals in a laboratory experiment discover out over time at which lever there’s meals as a result of they get the best acquire at that individual lever.

Now again to the idea of multi-armed bandits:

It serves as an introduction to decision-making underneath uncertainty and is a cornerstone for understanding reinforcement studying.

I wrote about reinforcement studying (RL) intimately within the final article “Reinforcement Studying Made Easy: Construct a Q-Studying Agent in Python”. However at its core, it’s about an agent studying to make good choices by trial and error. It’s a subfield of machine studying. The agent finds itself in an atmosphere, decides on sure actions and receives rewards or penalties for them. The purpose of the agent is to develop a method (coverage) that maximizes the long-term total profit.

So now we have to search out out within the multi-armed bandits:

  1. Which levers are worthwhile in the long run?
  2. When ought to we exploit a lever additional (exploitation)?
  3. When ought to we check out a brand new lever (exploration)?

These final two questions leads us on to the central dilemma of reinforcement studying:

Central dilemma in Reinforcement Studying: Exploration vs. Exploitation

Have you ever ever held on to a superb possibility? Solely to search out out later that there’s a greater one? That’s exploitation profitable over exploration.

That is the core downside of studying by expertise:

  • Exploration: We attempt one thing new to be able to be taught extra. Perhaps we uncover one thing higher. Or possibly not.
  • Exploitation: We use one of the best of what now we have realized thus far. With the intention of gaining as a lot reward as attainable.

The issue with this?

We by no means know for certain whether or not now we have already discovered the best choice.

Selecting the arm with the best reward thus far means counting on what we all know. That is referred to as exploitation. Nevertheless, if we commit too early to a seemingly good arm, we could overlook an excellent higher possibility.

Making an attempt a unique or hardly ever used arm provides us new info. We acquire extra information. That is exploration. We’d discover a higher possibility. Nevertheless it may be that we discover a worse possibility.

That’s the dilemma on the coronary heart of reinforcement studying.

If we only exploit too early, we may miss out on the better arms (here arm 3 instead of arm 1). However, too much exploration also leads to less overall yield (if we already know that arm 1 is good).
Visualization by the writer.

What we will conclude from this:

If we solely exploit too early, we could miss out on the higher arms (right here arm 3 as an alternative of arm 1). Nevertheless, an excessive amount of exploration additionally results in much less total yield (if we already know that arm 1 is sweet).

Let me clarify the identical factor once more in non-techy language (however considerably simplified):

Let’s think about we all know a superb restaurant. We’ve gone to the identical restaurant for 10 years as a result of we prefer it. However what if there’s a higher, cheaper place simply across the nook? And now we have by no means tried it? If we by no means attempt one thing new, we’ll by no means discover out.

Curiously, this isn’t only a downside in AI. It’s well-known in psychology and economics too:

The exploration vs. exploitation dilemma is a first-rate instance of decision-making underneath uncertainty.

The psychologist and Nobel Prize winner Daniel Kahnemann and his colleague Amos Tversky have proven that folks typically don’t make rational choices when confronted with uncertainty. As a substitute, we observe heuristics, i.e. psychological shortcuts.

These shortcuts typically mirror both behavior (=exploitation) or curiosity (=exploration). It’s exactly this dynamic that can be seen within the Multi-Armed Bandit:

  • Will we play it protected (=identified arm with excessive reward)
    or
  • will we threat one thing new (=new arm with unknown reward)?

Why does this matter for reinforcement studying?

We face the dilemma between exploration vs. exploitation in all places in reinforcement studying (RL).

An RL agent should continuously resolve whether or not it ought to stick to what has labored finest thus far (=exploitation) or ought to attempt one thing new to find even higher methods (=exploration).

You possibly can see this trade-off in motion in advice techniques: Ought to we preserve exhibiting customers content material they already like or threat suggesting one thing new they may love?

And what methods are there to pick one of the best arm? Motion choice methods

Motion choice methods decide how an agent decides which arm to pick within the subsequent step. In different phrases, how an agent offers with the exploration vs. exploitation dilemma.

Every of the next methods (additionally insurance policies/guidelines) solutions one easy query: How will we select the subsequent motion once we don’t know for certain what’s finest?

Technique 1 – Grasping

That is the best technique: We at all times select the arm with the best estimated reward (= the best Q(a)). In different phrases, at all times go for what appears finest proper now.

The benefit of this technique is that the reward is maximized within the quick time period and that the technique may be very easy.

The drawback is that there isn’t any exploration. No threat is taken to attempt one thing new, as a result of the present finest at all times wins. The agent would possibly miss higher choices that merely haven’t found but.

The formal rule is as follows:

Let’s take a look at a simplified instance:

Think about we attempt two new pizzerias. And the second is sort of good. From then on, we solely return to that one, regardless that there are six extra we’ve by no means tried. Perhaps we’re lacking out on one of the best Pizzas on the town. However we’ll by no means know.

Technique 2 – ε-Grasping:

As a substitute of at all times selecting the best-known possibility, we permit on this technique some randomness:

  • With likelihood ε, we discover (attempt one thing new).
  • With likelihood 1-ε, we exploit (stick to the present finest).

This technique intentionally mixes probability into the choice and is subsequently sensible and sometimes efficient.

  • The upper ε is chosen, the extra exploration occurs.
  • The decrease ε is chosen, the extra we exploit what we already know.

For instance, if ε = 0.1, exploration happens in 10% of circumstances, whereas exploitation happens in 90% of circumstances.

The benefit of ε-Grasping is that it’s straightforward to implement and gives good fundamental efficiency.

The drawback is that selecting the best ε is tough: If ε is chosen too giant, a variety of exploration takes place and the lack of rewards might be too nice. If ε is simply too small, there’s little exploration.

If we stick with the pizza instance:

We roll the cube earlier than each restaurant go to. If we get a 6, we check out a brand new pizzeria. If not, we go to the common pizza.

Technique 3 – Optimistic Preliminary Values:

The purpose on this technique is that every one Q0​(a) begin with artificially excessive values (e.g. 5.0 as an alternative of 0.0). Originally, the agent assumes all choices are nice.

This encourages the agent to attempt every little thing (exploration). It needs to disprove the excessive preliminary worth. As quickly as an motion has been tried, the agent sees that it’s price much less and adjusts the estimate downwards.

The benefit of this technique is that exploration happens mechanically. That is significantly appropriate in deterministic environments the place rewards don’t change.

The drawback is that the technique works poorly if the rewards are already excessive.

If we take a look at the restaurant instance once more, we might price every new restaurant with 5 stars in the beginning. As we attempt them, we modify the scores primarily based on actual expertise.

To place it merely, Grasping is pure routine conduct. ε-Grasping is a mix of behavior and curiosity conduct. Optimistic Preliminary Values is akin to when a toddler initially thinks each new toy is nice – till it has tried it out.


On my Substack Information Science Espresso, I frequently share sensible guides and bite-sized updates from the world of Information Science, Python, AI, Machine Studying and Tech — made for curious minds like yours. Take a look — and subscribe if you wish to keep within the loop.


How the agent learns which choices are worthwhile: Estimating Q-values

For an agent to make good choices, it should estimate how good every particular person arm is. It wants to search out out which arm will convey the best reward in the long run.

Nevertheless, the agent doesn’t know the true reward distribution.

This implies the agent should estimate the common reward of every arm primarily based on expertise. The extra typically an arm is drawn, the extra dependable this estimate turns into.

We use an estimated worth Q(a) for this:

Q(a) ≈ anticipated reward if we select arm a

Our intention right here is for our estimated worth Qt(a) to get higher and higher. So good till it comes as shut as attainable to the true worth q(a):

The agent needs to be taught from his expertise in such a manner that his estimated valuation Qt(a) corresponds in the long term to the common revenue of arm a in the long run.

Let’s look once more at our easy restaurant instance:

We think about that we wish to learn how good a specific café is. Each time we go there, we get some suggestions by giving it 3, 4 or 5 stars, for instance. Our purpose is that the perceived common will ultimately match the actual common that we might get if we went infinitely typically.

There are two fundamental methods through which an agent calculates this Q worth:

Methods (Sample average & Incremental update) to estimate Q-Values in Multi-Armed Bandits and Reinforcement Learning
Visualization by the writer.

Methodology 1 – Pattern common methodology

This methodology calculates the common of the noticed rewards and is definitely so simple as it sounds.

All earlier rewards for this arm are checked out and the common is calculated.

  • n: Variety of occasions arm a was chosen
  • Ri: Reward on the i-th time

The benefit of this methodology is that it’s easy and intuitive. And it’s statistically right for secure, stationary issues.

The drawback is that it reacts too slowly to modifications. Particularly in non-stationary environments, the place situations shift over time.

For instance, think about a music advice system: A person would possibly out of the blue develop a brand new style. The person used to want rock, however now they take heed to jazz. If the system retains averaging over all previous preferences, it reacts very slowly to this variation.

Equally, within the mult-armed bandit setting, if arm 3 out of the blue begins giving a lot better rewards from spherical 100 onwards, the working common can be too sluggish to mirror that. The early knowledge nonetheless dominates and hides the development.

Methodology 2 – Incremental Implementation

Right here the Q worth is adjusted instantly with every new reward – with out saving all earlier knowledge:

  • α: Studying price (0 < αalphaα ≤ 1)
  • Rn: Newly noticed reward
  • Qn(a): Earlier estimated worth
  • Qn+1: Up to date estimated worth

If the atmosphere is secure and rewards don’t change, the pattern common methodology works finest. But when issues change over time, the incremental methodology with a relentless studying price α adapts extra rapidly.

Before Reinforcement Learning: Understand the Multi-Armed Bandit
Personal visualization — Illustrations from unDraw.com.

Last Ideas: What do we want it for?

Multi-armed bandits are the idea for a lot of real-world purposes akin to advice engines or internet advertising.

On the similar time, it’s the proper stepping stone into reinforcement studying. It teaches us the mindset: Studying by suggestions, appearing underneath uncertainty and balancing exploration and exploitation.

Technically, multi-armed bandits are a simplified type of Reinforcement Studying: There aren’t any states, no future planning, however solely the rewards proper now. However the logic behind them exhibits up repeatedly in superior strategies like Q-learning, coverage gradients, and deep reinforcement studying.


Curious to go additional?
On my Substack Information Science Espresso, I share guides like this one. Breaking down complicated AI matters into digestible, practicable steps. When you loved this, subscribe right here to remain within the loop.

The place are you able to proceed studying?

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com