Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

modelType: "epsilon_greedy" — the simplest workable bandit. On most calls, exploit the arm with the highest observed mean reward. With probability ε, pick a random arm to explore. ε decays over time so exploration tapers off as evidence accumulates.

When to use

  • You want a simple, debuggable bandit baseline before reaching for Thompson.
  • You can tolerate the exploration noise — every 1/ε calls is essentially random.
  • You need an explicit exploration rate knob — operators can set ε directly, unlike Thompson where exploration is implicit in posterior variance.
Skip it when you have customer features (use a contextual bandit or neural_cf) or when sample efficiency matters (Thompson dominates ε-greedy asymptotically).

The math

ε_t = ε / (1 + decayRate × totalPulls)

if random() < ε_t:
  pick random arm                    # explore
else:
  pick argmax(mean_reward_per_arm)    # exploit

mean_reward_i = totalReward_i / pulls_i
# Optimistic initialization: arms with pulls_i = 0 get score 1.0
After each interaction:
totalReward_i += reward    # 1 for positive, 0 for negative
pulls_i      += 1
totalPulls   += 1

Fixture config

{
  "modelType": "epsilon_greedy",
  "modelState": {
    "arms": {
      "off-travel":   { "totalReward": 30, "pulls": 100 },
      "off-cashback": { "totalReward": 65, "pulls": 100 },
      "off-nofee":    { "totalReward": 18, "pulls": 100 }
    },
    "epsilon": 0.1,
    "decayRate": 0.001,
    "totalPulls": 300
  }
}
Mean rewards: travel 0.30, cashback 0.65, nofee 0.18. ε_t = 0.1 / (1 + 0.001 × 300) = 0.077. So 7.7% of calls explore randomly; 92.3% exploit and pick cashback. The proof script disables exploration (ε=0) for deterministic verification: cashback wins with score 0.650, travel 0.300, nofee 0.180 — exactly the means.

Training

Same as Thompson — updates happen via POST /api/v1/respond. The engine’s auto-learn.ts increments totalReward and pulls per arm. To bootstrap an unexplored arm: leave pulls = 0 and rely on optimistic initialization (score = 1.0). The first random exploration that picks it provides initial data.

Score interpretation

  • score = totalReward_i / pulls_i for arms with pulls > 0.
  • score = 1.0 for unpulled arms (optimistic init — gets explored at least once).
  • score = random() for ALL arms during an explore call.
The output is rankable but not a probability — interpret as a noisy estimate of arm conversion rate. PRIE composition is not meaningful (mean rewards aren’t on the same scale as propensities); use method: propensity only.

Pitfalls

  • ε too high — at ε = 0.2, one in five calls is random. If your traffic is millions of impressions, that’s hundreds of thousands of wasted opportunities. Default to 0.05–0.10.
  • ε too low — at ε = 0.001, exploration barely happens and the model can’t recover from an early bad estimate. If you suspect this, force-reset via POST /api/v1/algorithm-models/<id>/reset-offer and warm-start from a Thompson prior.
  • decayRate too aggressivedecayRate = 0.01 halves ε_t after 100 pulls. Combine with a small base ε and you’ll have effectively no exploration after a few thousand requests. Calibrate against expected total traffic.
  • Random number quality — the engine uses Math.random() which is not cryptographically random but is good enough for traffic-allocation noise. Don’t repurpose this model for security decisions.

Cross-reference