Epsilon-Greedy Bandit

When to use
The math
Fixture config
Training
Score interpretation
Pitfalls
Cross-reference

modelType: "epsilon_greedy" — the simplest workable bandit. On most calls, exploit the arm with the highest observed mean reward. With probability ε, pick a random arm to explore. ε decays over time so exploration tapers off as evidence accumulates.

When to use

You want a simple, debuggable bandit baseline before reaching for Thompson.
You can tolerate the exploration noise — every 1/ε calls is essentially random.
You need an explicit exploration rate knob — operators can set ε directly, unlike Thompson where exploration is implicit in posterior variance.

Skip it when you have customer features (use a contextual bandit or neural_cf) or when sample efficiency matters (Thompson dominates ε-greedy asymptotically).

The math

ε_t = ε / (1 + decayRate × totalPulls)

if random() < ε_t:
  pick random arm                    # explore
else:
  pick argmax(mean_reward_per_arm)    # exploit

mean_reward_i = totalReward_i / pulls_i
# Optimistic initialization: arms with pulls_i = 0 get score 1.0

After each interaction:

totalReward_i += reward    # 1 for positive, 0 for negative
pulls_i      += 1
totalPulls   += 1

Fixture config

{
  "modelType": "epsilon_greedy",
  "modelState": {
    "arms": {
      "off-travel":   { "totalReward": 30, "pulls": 100 },
      "off-cashback": { "totalReward": 65, "pulls": 100 },
      "off-nofee":    { "totalReward": 18, "pulls": 100 }
    },
    "epsilon": 0.1,
    "decayRate": 0.001,
    "totalPulls": 300
  }
}

Mean rewards: travel 0.30, cashback 0.65, nofee 0.18. ε_t = 0.1 / (1 + 0.001 × 300) = 0.077. So 7.7% of calls explore randomly; 92.3% exploit and pick cashback. The proof script disables exploration (ε=0) for deterministic verification: cashback wins with score 0.650, travel 0.300, nofee 0.180 — exactly the means.

Training

Same as Thompson — updates happen via POST /api/v1/respond. The engine’s auto-learn.ts increments totalReward and pulls per arm. To bootstrap an unexplored arm: leave pulls = 0 and rely on optimistic initialization (score = 1.0). The first random exploration that picks it provides initial data.

Score interpretation

score = totalReward_i / pulls_i for arms with pulls > 0.
score = 1.0 for unpulled arms (optimistic init — gets explored at least once).
score = random() for ALL arms during an explore call.

The output is rankable but not a probability — interpret as a noisy estimate of arm conversion rate. PRIE composition is not meaningful (mean rewards aren’t on the same scale as propensities); use method: propensity only.

Pitfalls

ε too high — at ε = 0.2, one in five calls is random. If your traffic is millions of impressions, that’s hundreds of thousands of wasted opportunities. Default to 0.05–0.10.
ε too low — at ε = 0.001, exploration barely happens and the model can’t recover from an early bad estimate. If you suspect this, force-reset via POST /api/v1/algorithm-models/<id>/reset-offer and warm-start from a Thompson prior.
decayRate too aggressive — decayRate = 0.01 halves ε_t after 100 pulls. Combine with a small base ε and you’ll have effectively no exploration after a few thousand requests. Calibrate against expected total traffic.
Random number quality — the engine uses Math.random() which is not cryptographically random but is good enough for traffic-allocation noise. Don’t repurpose this model for security decisions.

Cross-reference

Algorithm Selection Guide.
Thompson Sampling Bandit — usually a better choice once you understand the theory.

Thompson Sampling Bandit Online Learner

Get Started

Tutorials

Decisioning

Studio

Data Pipelines

AI & ML

Operations & Reporting

Governance & Security

Integrations

Reference

Epsilon-Greedy Bandit

When to use

The math

Fixture config

Training

Score interpretation

Pitfalls

Cross-reference

Get Started

Tutorials

Decisioning

Studio

Data Pipelines

AI & ML

Operations & Reporting

Governance & Security

Integrations

Reference

Documentation Index

​When to use

​The math

​Fixture config

​Training

​Score interpretation

​Pitfalls

​Cross-reference

When to use

The math

Fixture config

Training

Score interpretation

Pitfalls

Cross-reference