Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
modelType: "thompson_bandit" — a multi-armed bandit where each offer (“arm”) maintains a Beta posterior distribution over its true conversion rate. To score, the engine samples from each arm’s posterior and ranks by sample value. High-evidence arms cluster tightly around their true rate; low-evidence arms have wider posteriors, so they occasionally sample high and get explored.
When to use
- Small set of offers (typically ≤ 50) competing for the same eyeballs.
- No useful customer features — Thompson uses no inputs other than the offer ID and its win/loss history.
- Continuous traffic flow — bandits update on every interaction.
- Goal is maximum cumulative conversion — Thompson minimizes regret asymptotically.
neural_cf for explicit user × offer modeling) when context is signal.
The math
n_pos positives and n_neg negatives starting from α₀ = β₀ = 1 (uniform prior) is Beta(α₀ + n_pos, β₀ + n_neg). Mean = α / (α + β). Variance shrinks as α + β grows.
Fixture config
Training
There’s no batch training. Update happens viaPOST /api/v1/respond — every outcome increments the relevant arm’s α or β. The engine’s auto-learn.ts handles this in the respond pipeline. No retrain endpoint needed.
To bootstrap: seed each arm with α = β = 1 (uniform prior). To inject domain priors: seed α = expected_rate × N, β = (1 - expected_rate) × N for a “virtual N samples” of expected behavior.
Score interpretation
scorefor any one call is a Beta sample — stochastic by design.- Over many requests, the chosen arm distribution converges to the optimal arm asymptotically (Thompson regret is O(log T)).
explanations[]contains all arm scores for the call, so traces can show the full draw.
Pitfalls
- Combining with PRIE — Thompson’s stochastic sampling doesn’t compose well with PRIE’s geometric mean. Use
method: propensity(raw Thompson score = propensity) rather thanmethod: formula. - Sample-size starvation — once an arm accumulates 50+ negative-only outcomes (
α=1, β=51), it almost never gets explored. The propensity floor helps, but the right fix is to reset arms periodically (POST /api/v1/algorithm-models/<id>/reset-offer) when business conditions change. - Reward latency — if conversions take days to fire back as outcomes, Thompson runs blind during the lag. Pair with a short-term proxy reward (e.g. click-through) for faster feedback.
- Cross-customer drift — Thompson assumes a stationary conversion rate per arm. If demand shifts (seasonality, campaign changes), the bandit lags. Decay old evidence by aging α/β toward the prior over time (not built-in; implement in auto-learn).
- Arms that should never compete — Thompson treats every arm as exchangeable. If you have a “premium” arm that should only show to specific customers, encode the gating via Qualify rules — Thompson runs only over the surviving candidates.
Cross-reference
- Algorithm Selection Guide.
- Epsilon-Greedy Bandit — simpler alternative.
- Ranking — Exp3-IX — adversarial-bandit alternative for non-stationary settings.