Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

modelType: "thompson_bandit" — a multi-armed bandit where each offer (“arm”) maintains a Beta posterior distribution over its true conversion rate. To score, the engine samples from each arm’s posterior and ranks by sample value. High-evidence arms cluster tightly around their true rate; low-evidence arms have wider posteriors, so they occasionally sample high and get explored.

When to use

  • Small set of offers (typically ≤ 50) competing for the same eyeballs.
  • No useful customer features — Thompson uses no inputs other than the offer ID and its win/loss history.
  • Continuous traffic flow — bandits update on every interaction.
  • Goal is maximum cumulative conversion — Thompson minimizes regret asymptotically.
Skip it when customer features matter — Thompson treats every customer the same. Combine it via contextual-bandit techniques (or use neural_cf for explicit user × offer modeling) when context is signal.

The math

For each arm i:
  posterior_i  ~  Beta(α_i, β_i)
  sample_i      = sampleBeta(α_i, β_i)

ranked_arms = sort(arms, by: sample_i, desc)

# After observing outcome for arm i:
α_i += (positive ? 1 : 0)
β_i += (positive ? 0 : 1)
Beta is the conjugate prior for Bernoulli — the posterior after n_pos positives and n_neg negatives starting from α₀ = β₀ = 1 (uniform prior) is Beta(α₀ + n_pos, β₀ + n_neg). Mean = α / (α + β). Variance shrinks as α + β grows.

Fixture config

{
  "modelType": "thompson_bandit",
  "modelState": {
    "arms": {
      "off-travel":   { "alpha": 22, "beta": 80 },
      "off-cashback": { "alpha": 65, "beta": 38 },
      "off-nofee":    { "alpha": 12, "beta": 90 }
    }
  }
}
Mean conversion rates: travel ≈ 22% (22 / 102), cashback ≈ 63% (65 / 103), nofee ≈ 12% (12 / 102). The proof script averages 1000 Thompson draws and confirms cashback wins 100% of samples — its posterior is concentrated far above the others. With less evidence (smaller α + β), exploration would mix.

Training

There’s no batch training. Update happens via POST /api/v1/respond — every outcome increments the relevant arm’s α or β. The engine’s auto-learn.ts handles this in the respond pipeline. No retrain endpoint needed. To bootstrap: seed each arm with α = β = 1 (uniform prior). To inject domain priors: seed α = expected_rate × N, β = (1 - expected_rate) × N for a “virtual N samples” of expected behavior.

Score interpretation

  • score for any one call is a Beta sample — stochastic by design.
  • Over many requests, the chosen arm distribution converges to the optimal arm asymptotically (Thompson regret is O(log T)).
  • explanations[] contains all arm scores for the call, so traces can show the full draw.

Pitfalls

  • Combining with PRIE — Thompson’s stochastic sampling doesn’t compose well with PRIE’s geometric mean. Use method: propensity (raw Thompson score = propensity) rather than method: formula.
  • Sample-size starvation — once an arm accumulates 50+ negative-only outcomes (α=1, β=51), it almost never gets explored. The propensity floor helps, but the right fix is to reset arms periodically (POST /api/v1/algorithm-models/<id>/reset-offer) when business conditions change.
  • Reward latency — if conversions take days to fire back as outcomes, Thompson runs blind during the lag. Pair with a short-term proxy reward (e.g. click-through) for faster feedback.
  • Cross-customer drift — Thompson assumes a stationary conversion rate per arm. If demand shifts (seasonality, campaign changes), the bandit lags. Decay old evidence by aging α/β toward the prior over time (not built-in; implement in auto-learn).
  • Arms that should never compete — Thompson treats every arm as exchangeable. If you have a “premium” arm that should only show to specific customers, encode the gating via Qualify rules — Thompson runs only over the surviving candidates.

Cross-reference