Thompson Sampling Bandit

When to use
The math
Fixture config
Training
Score interpretation
Pitfalls
Cross-reference

modelType: "thompson_bandit" — a multi-armed bandit where each offer (“arm”) maintains a Beta posterior distribution over its true conversion rate. To score, the engine samples from each arm’s posterior and ranks by sample value. High-evidence arms cluster tightly around their true rate; low-evidence arms have wider posteriors, so they occasionally sample high and get explored.

When to use

Small set of offers (typically ≤ 50) competing for the same eyeballs.
No useful customer features — Thompson uses no inputs other than the offer ID and its win/loss history.
Continuous traffic flow — bandits update on every interaction.
Goal is maximum cumulative conversion — Thompson minimizes regret asymptotically.

Skip it when customer features matter — Thompson treats every customer the same. Combine it via contextual-bandit techniques (or use neural_cf for explicit user × offer modeling) when context is signal.

The math

For each arm i:
  posterior_i  ~  Beta(α_i, β_i)
  sample_i      = sampleBeta(α_i, β_i)

ranked_arms = sort(arms, by: sample_i, desc)

# After observing outcome for arm i:
α_i += (positive ? 1 : 0)
β_i += (positive ? 0 : 1)

Beta is the conjugate prior for Bernoulli — the posterior after n_pos positives and n_neg negatives starting from α₀ = β₀ = 1 (uniform prior) is Beta(α₀ + n_pos, β₀ + n_neg). Mean = α / (α + β). Variance shrinks as α + β grows.

Fixture config

{
  "modelType": "thompson_bandit",
  "modelState": {
    "arms": {
      "off-travel":   { "alpha": 22, "beta": 80 },
      "off-cashback": { "alpha": 65, "beta": 38 },
      "off-nofee":    { "alpha": 12, "beta": 90 }
    }
  }
}

Mean conversion rates: travel ≈ 22% (22 / 102), cashback ≈ 63% (65 / 103), nofee ≈ 12% (12 / 102). The proof script averages 1000 Thompson draws and confirms cashback wins 100% of samples — its posterior is concentrated far above the others. With less evidence (smaller α + β), exploration would mix.

Training

There’s no batch training. Update happens via POST /api/v1/respond — every outcome increments the relevant arm’s α or β. The engine’s auto-learn.ts handles this in the respond pipeline. No retrain endpoint needed. To bootstrap: seed each arm with α = β = 1 (uniform prior). To inject domain priors: seed α = expected_rate × N, β = (1 - expected_rate) × N for a “virtual N samples” of expected behavior.

Score interpretation

score for any one call is a Beta sample — stochastic by design.
Over many requests, the chosen arm distribution converges to the optimal arm asymptotically (Thompson regret is O(log T)).
explanations[] contains all arm scores for the call, so traces can show the full draw.

Pitfalls

Combining with PRIE — Thompson’s stochastic sampling doesn’t compose well with PRIE’s geometric mean. Use method: propensity (raw Thompson score = propensity) rather than method: formula.
Sample-size starvation — once an arm accumulates 50+ negative-only outcomes (α=1, β=51), it almost never gets explored. The propensity floor helps, but the right fix is to reset arms periodically (POST /api/v1/algorithm-models/<id>/reset-offer) when business conditions change.
Reward latency — if conversions take days to fire back as outcomes, Thompson runs blind during the lag. Pair with a short-term proxy reward (e.g. click-through) for faster feedback.
Cross-customer drift — Thompson assumes a stationary conversion rate per arm. If demand shifts (seasonality, campaign changes), the bandit lags. Decay old evidence by aging α/β toward the prior over time (not built-in; implement in auto-learn).
Arms that should never compete — Thompson treats every arm as exchangeable. If you have a “premium” arm that should only show to specific customers, encode the gating via Qualify rules — Thompson runs only over the surviving candidates.

Cross-reference

Algorithm Selection Guide.
Epsilon-Greedy Bandit — simpler alternative.
Ranking — Exp3-IX — adversarial-bandit alternative for non-stationary settings.

Gradient Boosted Trees (AGB)Epsilon-Greedy Bandit

Get Started

Tutorials

Decisioning

Studio

Data Pipelines

AI & ML

Operations & Reporting

Governance & Security

Integrations

Reference

Thompson Sampling Bandit

When to use

The math

Fixture config

Training

Score interpretation

Pitfalls

Cross-reference

Get Started

Tutorials

Decisioning

Studio

Data Pipelines

AI & ML

Operations & Reporting

Governance & Security

Integrations

Reference

Documentation Index

​When to use

​The math

​Fixture config

​Training

​Score interpretation

​Pitfalls

​Cross-reference

When to use

The math

Fixture config

Training

Score interpretation

Pitfalls

Cross-reference